## Encoding categorical variables using likelihood estimation

25

16

I am trying to understand how I can encode categorical variables using likelihood estimation, but have had little success so far.

Any suggestions would be greatly appreciated.

What to do while prediction time when we don't have target label? – Ranjeet Singh – 2018-03-06T10:30:35.647

26

I was learning this topic too, and these are what I found:

• This type of encoding is called likelihood encoding, impact coding or target coding

• The idea is encoding your categorical variable with the use of target variable (continuous or categorical depending on the task). For example, if you have regression task, you can encode your categorical variable with the mean of the target. For every category, you calculate the corresponding mean of the target (among this category) and replace the value of a category with this mean.

• If you have classification task, you calculate the relative frequency of your target with respect to every category value.

• From a mathematical point of view, this encoding means a probability of your target, conditional on each category value.

• If you do it in a simple way, how I described above, you will probably get a biased estimation. That's why in Kaggle community they usually use 2 levels of cross-validation. Read this comment by raddar here. The corresponding notebook is here.

The quote:

It's taking mean value of y. But not plain mean, but in cross-validation within cross-validation way;

Let's say we have 20-fold cross validation. we need somehow to calculate mean value of the feature for #1 fold using information from #2-#20 folds only.

So, you take #2-#20 folds, create another cross validation set within it (i did 10-fold). calculate means for every leave-one-out fold (in the end you get 10 means). You average these 10 means and apply that vector for your primary #1 validation set. Repeat that for remaining 19 folds.

It is tough to explain, hard to understand and to master :) But if done correctly it can bring many benefits:)

• Another implementation of this encoding is here.

• In R library vtreat they have implementation of impact encoding. See this post.

• In CatBoost library they have a lot of options for categorical variable encoding including target encoding.

• There is no such encoding in sklearn yet.

UPDATE: There is a nice package for sklearn models and pipelines! https://github.com/scikit-learn-contrib/category_encoders

1

There is Target Encoding in Sklearn-contrib Category Encoders

– josh – 2018-04-13T10:28:20.093

How would you implement feature interaction in case you use target encoding? For example, you target-encoded F1 and F2. Would you just multiply the encoded values F1*F2? – Michael Larionov – 2018-07-17T19:45:55.877

If you calculate the mean for each LOO fold, then you take average of them, it is exactly the same as you are taking the mean of #2-#20 fold, I dont see why this can be considered as CV. Also I dont understand what he means by "vector" when he averages those 10 means. – SiXUlm – 2019-02-20T10:36:16.817

A late comment; the target encoding in Category Encoders is a simple mean encoding; it does not perform the folds-within-folds regularization described by raddar. – Dan Scally – 2019-03-24T07:40:57.393

8

Target encoding is now available in sklearn through the category_encoders package.

Target Encoder

class category_encoders.target_encoder.TargetEncoder(verbose=0, cols=None, drop_invariant=False, return_df=True, impute_missing=True, handle_unknown='impute', min_samples_leaf=1, smoothing=1)

Target Encode for categorical features. Based on leave one out approach.

As noted by josh in the comment above.

1

Likelihood encoding is still not available on scikit learn. You may do it by creating a dictionary, and then do a replace function.

dict1 = {'cate1':1,'cate2':1,'cate3':2,'cate4':2}
for i in df.shape:
df.iloc[i] = dict1[df.iloc[i]]