When to use One Hot Encoding vs LabelEncoder vs DictVectorizor?

25

28

I have been building models with categorical data for a while now and when in this situation I basically default to using scikit-learn's LabelEncoder function to transform this data prior to building a model.

I understand the difference between OHE, LabelEncoder and DictVectorizor in terms of what they are doing to the data, but what isn't clear to me is when you might choose to employ one technique over another. Are there certain algorithms or situations in which one has advantages/disadvantages with respect to the others?

Many thanks.

anthr

Posted 2015-12-19T19:30:35.527

Reputation: 218

Answers

49

There are some cases where LabelEncoder or DictVectorizor are useful, but these are quite limited in my opinion due to ordinality.

LabelEncoder can turn [dog,cat,dog,mouse,cat] into [1,2,1,3,2], but then the imposed ordinality means that the average of dog and mouse is cat. Still there are algorithms like decision trees and random forests that can work with categorical variables just fine and LabelEncoder can be used to store values using less disk space.

One-Hot-Encoding has a the advantage that the result is binary rather than ordinal and that everything sits in an orthogonal vector space. The disadvantage is that for high cardinality, the feature space can really blow up quickly and you start fighting with the curse of dimensionality. In these cases, I typically employ one-hot-encoding followed by PCA for dimensionality reduction. I find that the judicious combination of one-hot plus PCA can seldom be beat by other encoding schemes. PCA finds the linear overlap, so will naturally tend to group similar features into the same feature.

Hope this helps!

AN6U5

Posted 2015-12-19T19:30:35.527

Reputation: 3 638

Thank you very much - this is very helpful and makes a lot of sense. Are there any other encoding schemes you use for specific/edge cases? Do you ever find that you're in a situation where you'll use different encoding schemes for different features?anthr 2015-12-21T20:36:59.493

3

In reference to AN6U5's answer, and this statement:

Still there are algorithms like decision trees and random forests that can work with categorical variables just fine and LabelEncoder can be used to store values using less disk space.

Wouldn't using LabelEncoder transform a categorical to a numeric feature, thereby causing a decision tree to perform splits at some value which don't really make sense since the mapping is arbitrary?

Nico

Posted 2015-12-19T19:30:35.527

Reputation: 31