## strings as features in decision tree/random forest

32

23

I am new to machine learning!

Right now I am doing some problems on an application of decision tree/random forest. I am trying to fit a problem which has numbers as well as strings (such as country name) as features. Now the library, scikit-learn takes only numbers as parameters, but I want to inject the strings as well as they carry a significant amount of knowledge.

How do I handle such a scenario?

I can convert a string to numbers by some mechanism such as hashing in python. But I would like to know the best practice on how strings are handled in decision tree problems.

Thanks for your support!

In case of sckitlearn I have seen that we need to encode the categorical variables, else fit method would throw an error saying ValueError: could not convert string to floatKar 2016-08-31T23:53:27.997

## Answers

29

In most of the well-established machine learning systems, categorical variables are handled naturally. For example in R you would use factors, in WEKA you would use nominal variables. This is not the case in scikit-learn. The decision trees implemented in scikit-learn uses only numerical features and these features are interpreted always as continuous numeric variables.

Thus, simply replacing the strings with a hash code should be avoided, because being considered as a continuous numerical feature any coding you will use will induce an order which simply does not exist in your data.

One example is to code ['red','green','blue'] with [1,2,3], would produce weird things like 'red' is lower than 'blue', and if you average a 'red' and a 'blue' you will get a 'green'. Another more subtle example might happen when you code ['low', 'medium', 'high'] with [1,2,3]. In the latter case it might happen to have an ordering which makes sense, however, some subtle inconsistencies might happen when 'medium' in not in the middle of 'low' and 'high'.

Finally, the answer to your question lies in coding the categorical feature into multiple binary features. For example, you might code ['red','green','blue'] with 3 columns, one for each category, having 1 when the category match and 0 otherwise. This is called one-hot-encoding, binary encoding, one-of-k-encoding or whatever. You can check documentation here for encoding categorical features and feature extraction - hashing and dicts. Obviously one-hot-encoding will expand your space requirements and sometimes it hurts the performance as well.

1It's the scikit implementation that it doesn't handle categorical variables properly. Recoding like how this answer suggests is probably the best you can do. More serious user might look for alternative package.SmallChess 2016-09-18T01:46:20.643

3One can use sklearn.preprocessing.LabelBinarizer for one-hot-encoding of categorical variable.crackjack 2016-11-09T09:29:14.593

@rapaio I think binary coding is not same one hot encoding. Binary coding is when you represent 8 categories with 3 columns, or between 9 to 16 categories with 4 columns and so on. Am I wrong?Alok Nayak 2017-04-21T10:59:03.197

the patsy python package will deal with one-hot encoding of categorical variables. http://patsy.readthedocs.io/en/latest/quickstart.html

zhespelt 2017-05-17T18:13:22.763

1

Do not use LabelBinarizer, use sklearn.preprocessing.OneHotEncoder. If you are using pandas to import and pre-process your data, you can also do that directly using pandas.get_dummies. It sucks that scikit-learn does not support categorical variables.

Ricardo Cruz 2017-07-28T09:52:24.077

1-hot-K encoding is inefficient when there are many categorical features with large number of unique values. This may result in the curse of dimensionality.chandresh 2017-12-22T08:15:59.733

17

While previous answers are correct in describing how categorical variables can be encoded with sets of binary variables (binarization or one-hot encoding), I'm surprised that no one has pointed out the obvious...

Neither decision trees or random-forests require one-hot encoding for categorical variables. This is one of the very convenient aspects of both decision tree and random-forest classifiers. They can operate on integer features, floating point features, ordinal features, and categorical features and a heterogenous combination of them.

So go ahead and use your categorical variables without encoding them. Decision trees have the added benefit that you can visualize them to see how the variables are oriented in the decision surface.

I know this question is quite old, but I wanted to add so that future readers are aware of this feature.

Thanks!

12

I just tried, neither decision tree or random-forest in scikit-learn work without one-hot encoding, as least for the scikit-learn 0.17.1. They both try to convert all values to float, as documented here

Zebra Propulsion Lab 2016-04-08T22:43:45.797

9your statements regarding that sklearn can accept categorical input variables appears to be incorrectcammil 2016-05-04T15:54:49.340

2In theory, decision tree and forests work fine with categorical values. But in sklearn, this is not so. It sucks to one-hot encode when you have dozens of categories.Ricardo Cruz 2016-05-25T14:33:41.117

15If you're using other languages (R/SAS) then your answer is correct. However, the question is specific to sciki-learn, which does not accept categorical/string variables.Vishal 2016-06-08T21:11:12.313

For sklearn, you can map the string variables to numbers by df['string_feature'].map(my_map) then use the random forestnos 2016-08-05T15:31:06.970

The random forest in scikit-learn will run with categorical variables, but that is not the correct way to do it. You should one hot encode categorical variables even if they are integers. If you have the categories [red,blue,green,yellow] and convert it to integers [1,2,3,4] the decision tree algorithm in scikit-learn will interpret it as 1<2<3<4 which is not true for categorical.denson 2017-08-05T11:50:52.140

8

You need to encode your strings as numeric features that sci-kit can use for the ML algorithms. This functionality is handled in the preprocessing module (e.g., see sklearn.preprocessing.LabelEncoder for an example).

2rapaio explains in his answer why this would get an incorrect resultKeith 2017-04-25T16:14:48.900

2

You should usually one-hot encode categorical variables for scikit-learn models, including random forest. Random forest will often work ok without one-hot encoding but usually performs better if you do one-hot encode. One-hot encoding and "dummying" variables mean the same thing in this context. Scikit-learn has sklearn.preprocessing.OneHotEncoder and Pandas has pandas.get_dummies to accomplish this.

However, there are alternatives. The article "Beyond One-Hot" at KDnuggets does a great job of explaining why you need to encode categorical variables and alternatives to one-hot encoding.

There are alternative implementations of random forest that do not require one-hot encoding such as R or H2O. The implementation in R is computationally expensive and will not work if your features have many categories. H2O will work with large numbers of categories. Continuum has made H2O available in Anaconda Python.

This article has an explanation of the algorithm used in H2O. It references the academic paper A Streaming Parallel Decision Tree Algorithm and a longer version of the same paper.

Very helpful, thanks!ste_kwr 2017-10-04T21:21:33.050

1

You can use dummy variables in such scenarios. With panda's panda.get_dummies you can create dummy variables for strings you want to put in Decision Tree or Random Forest.

Example:

import pandas as pd
d = {'one' : pd.Series([1., 2., 3.,4.], index=['a', 'b', 'c','d']),'two' :pd.Series(['Paul', 'John', 'Micheal','George'], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)

df_with_dummies= pd.get_dummies(df,columns=["two"],drop_first=False)
df_with_dummies