strings as features in decision tree/random forest



I am doing some problems on an application of decision tree/random forest. I am trying to fit a problem which has numbers as well as strings (such as country name) as features. Now the library, scikit-learn takes only numbers as parameters, but I want to inject the strings as well as they carry a significant amount of knowledge.

How do I handle such a scenario?

I can convert a string to numbers by some mechanism such as hashing in Python. But I would like to know the best practice on how strings are handled in decision tree problems.


Posted 2015-02-25T01:07:14.717

Reputation: 895

In case of sckitlearn I have seen that we need to encode the categorical variables, else fit method would throw an error saying ValueError: could not convert string to float – Kar – 2016-08-31T23:53:27.997



In most of the well-established machine learning systems, categorical variables are handled naturally. For example in R you would use factors, in WEKA you would use nominal variables. This is not the case in scikit-learn. The decision trees implemented in scikit-learn uses only numerical features and these features are interpreted always as continuous numeric variables.

Thus, simply replacing the strings with a hash code should be avoided, because being considered as a continuous numerical feature any coding you will use will induce an order which simply does not exist in your data.

One example is to code ['red','green','blue'] with [1,2,3], would produce weird things like 'red' is lower than 'blue', and if you average a 'red' and a 'blue' you will get a 'green'. Another more subtle example might happen when you code ['low', 'medium', 'high'] with [1,2,3]. In the latter case it might happen to have an ordering which makes sense, however, some subtle inconsistencies might happen when 'medium' in not in the middle of 'low' and 'high'.

Finally, the answer to your question lies in coding the categorical feature into multiple binary features. For example, you might code ['red','green','blue'] with 3 columns, one for each category, having 1 when the category match and 0 otherwise. This is called one-hot-encoding, binary encoding, one-of-k-encoding or whatever. You can check documentation here for encoding categorical features and feature extraction - hashing and dicts. Obviously one-hot-encoding will expand your space requirements and sometimes it hurts the performance as well.


Posted 2015-02-25T01:07:14.717

Reputation: 3 864

3It's the scikit implementation that it doesn't handle categorical variables properly. Recoding like how this answer suggests is probably the best you can do. More serious user might look for alternative package. – SmallChess – 2016-09-18T01:46:20.643

3One can use sklearn.preprocessing.LabelBinarizer for one-hot-encoding of categorical variable. – GuSuku – 2016-11-09T09:29:14.593

@rapaio I think binary coding is not same one hot encoding. Binary coding is when you represent 8 categories with 3 columns, or between 9 to 16 categories with 4 columns and so on. Am I wrong? – Alok Nayak – 2017-04-21T10:59:03.197

the patsy python package will deal with one-hot encoding of categorical variables.

– ZSH – 2017-05-17T18:13:22.763


Do not use LabelBinarizer, use sklearn.preprocessing.OneHotEncoder. If you are using pandas to import and pre-process your data, you can also do that directly using pandas.get_dummies. It sucks that scikit-learn does not support categorical variables.

– Ricardo Cruz – 2017-07-28T09:52:24.077

1-hot-K encoding is inefficient when there are many categorical features with large number of unique values. This may result in the curse of dimensionality. – CKM – 2017-12-22T08:15:59.733

1What are Python alternatives to trees that accept categoricals as strings? – Daniel Möller – 2020-03-13T17:11:01.163

Be carefull of one hot encoding!!

– polvoazul – 2020-08-28T03:51:42.433


You need to encode your strings as numeric features that sci-kit can use for the ML algorithms. This functionality is handled in the preprocessing module (e.g., see sklearn.preprocessing.LabelEncoder for an example).


Posted 2015-02-25T01:07:14.717

Reputation: 1 453

6rapaio explains in his answer why this would get an incorrect result – Keith – 2017-04-25T16:14:48.900


You should usually one-hot encode categorical variables for scikit-learn models, including random forest. Random forest will often work ok without one-hot encoding but usually performs better if you do one-hot encode. One-hot encoding and "dummying" variables mean the same thing in this context. Scikit-learn has sklearn.preprocessing.OneHotEncoder and Pandas has pandas.get_dummies to accomplish this.

However, there are alternatives. The article "Beyond One-Hot" at KDnuggets does a great job of explaining why you need to encode categorical variables and alternatives to one-hot encoding.

There are alternative implementations of random forest that do not require one-hot encoding such as R or H2O. The implementation in R is computationally expensive and will not work if your features have many categories. H2O will work with large numbers of categories. Continuum has made H2O available in Anaconda Python.

There is an ongoing effort to make scikit-learn handle categorical features directly.

This article has an explanation of the algorithm used in H2O. It references the academic paper A Streaming Parallel Decision Tree Algorithm and a longer version of the same paper.


Posted 2015-02-25T01:07:14.717

Reputation: 171


2018 Update!

You can create an embedding (dense vector) space for your categorical variables. Many of you are familiar with word2vec and fastext, which embed words in a meaningful dense vector space. Same idea here-- your categorical variables will map to a vector with some meaning.

From the Guo/Berkhahn paper:

Entity embedding not only reduces memory usage and speeds up neural networks compared with one-hot encoding, but more importantly by mapping similar values close to each other in the embedding space it reveals the intrinsic properties of the categorical variables. We applied it successfully in a recent Kaggle competition and were able to reach the third position with relative simple features.

The authors found that representing categorical variables this way improved the effectiveness of all machine learning algorithms tested, including random forest.

The best example might be Pinterest's application of the technique to group related Pins:

enter image description here

The folks at fastai have implemented categorical embeddings and created a very nice blog post with companion demo notebook.

Additional Details and Explanation

A neural net is used to create the embeddings i.e. assign a vector to each categorical value. Once you have the vectors, you may use them in any model which accepts numerical values. Each component of vector becomes an input variable. For example, if you used 3-D vectors to embed your categorical list of colors, you might get something like: red=(0, 1.5, -2.3), blue=(1, 1, 0) etc. You would use three input variables in your random forest corresponding to the three components. For red things, c1=0, c2=1.5, and c3=-2.3. For blue things, c1=1, c2=1, and c3=0.

You don't actually need to use a neural network to create embeddings (although I don't recommend shying away from the technique). You're free to create your own embeddings by hand or other means, when possible. Some examples:

  1. Map colors to RGB vectors.
  2. Map locations to lat/long vectors.
  3. In a U.S. political model, map cities to some vector components representing left/right alignment, tax burden, etc.


Posted 2015-02-25T01:07:14.717

Reputation: 754

OK cool but unless I missed something this is for nets start to finish. How to we create an embedding and then pass that embedding into a Forrest? I guess you have to train a whole net with all the features and then take the first few layers and use that as your input feature to your Forrest. Its not clear how this would be done. – Keith – 2019-06-11T22:12:35.343

@Keith a neural net is used to create the embeddings i.e. assign a vector to each categorical value. Once you have the vectors, you may use them in any model which accepts numerical values. Each component of vector becomes an input variable. For example, if you used 3-D vectors to embed your categorical list of colors, you might get something like: red = (0, 1.5, -2.3), blue=(1, 1, 0) etc. You would use three input variables in your random forest corresponding to the three components. For red things, c1 = 0, c2 = 1.5, and c3 = -2.3. For blue things, c1 = 1, c2 = 1, and c3 = 0. – Pete – 2019-06-14T04:04:15.530

I totally get the concept since it is pretty simple. I mean how would this be done in the implementation? The demo notebook you link has a bit with a RandomForestRegressor at the end but I do not really see how this adds in the embeddings. – Keith – 2019-06-14T16:08:56.583

I think this may be a good example of code in Keras

– Keith – 2019-06-14T17:59:35.280


You can use dummy variables in such scenarios. With panda's panda.get_dummies you can create dummy variables for strings you want to put in Decision Tree or Random Forest.


import pandas as pd
d = {'one' : pd.Series([1., 2., 3.,4.], index=['a', 'b', 'c','d']),'two' :pd.Series(['Paul', 'John', 'Micheal','George'], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)

df_with_dummies= pd.get_dummies(df,columns=["two"],drop_first=False)


Posted 2015-02-25T01:07:14.717

Reputation: 31


Turn them to numbers, for example for each unique country assingn a unique number (like 1,2,3 and ...)

also you Don't need to use One-Hot Encoding (aka dummy variables) when working with random forest, because trees don't work like other algorithm (such as linear/logistic regression) and they don't work by distant (they work with finding good split for your features) so NO NEED for One-Hot Encoding

Arash Jamshidi

Posted 2015-02-25T01:07:14.717

Reputation: 29

1It actually depends on the particular algorithm that trains the tree. In particular, scikit does NOT support categorical variables. – chuse – 2019-02-01T16:05:53.973