Target mean encoding worse than ordinal encoding with GBDT ( XGBoost, CatBoost )



I have a dataset of 23k rows of an unbalanced dataset 85/15 ratio, 10 variables ( 9 of which are categorical ) , i'm using CatBoost and XGBoost for a binary classification. I applied cv (5 iteration loop) mean target encoding on the categorical variables and i got a certain accuracy. Ordinal encoding of the categorical features is giving a better accuracy than the mean encoding. How is that possible? If my understanding is correct, mean target encoding does not only numerically encode 'object'-type variables but it orders them using their impact on the target value, and the difference between the numerically-encoded new variables is also based on the categories' impact on the target Why do GBDT's perform better on a randomly encoded variable rather than a "well-encoded" one? Over-fitting ? or do GBDT's ( catboost/xgboost ) handle the ordinal encoding well enough that mean encoding is not needed? or something else?

Here's how i'm doing cross-validation mean encoding with a smoothing value of alpha = 10

Edit : Got a slightly better result by increasing the smoothing value to 30, but mean-encoding is still underperforming compared to the ordinal one.

   ## inside the loop
   means1 =  X_val.groupby(column1).bank_account.agg('mean')
   nrows1 = X_val.groupby(column1).size()
   score1 = (np.multiply(means1,nrows1)  + globalmean*alpha) / (nrows1+alpha)
   X_val.loc[:,encoded_column1] = X_val[column1]
   X_val.loc[:,encoded_column1] = X_val[column1].map(score1)
## After the loop is over, i average the encodings for each category across all folds and update the value for my new encoded column
meanz1 = train_new.groupby(column1)[encoded_column1].mean()
train_new[encoded_column1] = train_new[column1].map(meanz1).copy()


Posted 2019-08-22T17:10:33.457

Reputation: 1 704

As a first guess, target-mean encoding makes the model overfit more readily? – Ben Reiniger – 2019-08-22T18:53:15.700

How do you combat that? more smoothing? – Blenz – 2019-08-23T00:36:21.967

and does that mean that ordinal encoding on categorical features is well-handled by those classifiers? does the classifier ignore the numerical relationship between the numerically-encoded variables? and considers them independant? – Blenz – 2019-08-23T00:58:37.393

Tree models (er, at least the most common ones) never care about the numerical values of features, only their relative ordering, since splits are made as "X<=a vs X>a". In the random ordinal encoding then, we get splits of the categorical levels into two sets, but not every bipartition is possible, and the chosen ordering will affect the result. (Some tree models can split levels truly independently, but not XGBoost and not [I think] CatBoost.) – Ben Reiniger – 2019-08-23T02:20:59.393

More smoothing is worth trying at least. Possibly lump small categorical levels together before their target mapping. How many levels do your categoricals have, and how are the data distributed among them? (I'll still leave this as a comment, as I'm not sure I've got the right underlying problem.) – Ben Reiniger – 2019-08-23T02:28:40.107

I think you got a point with your reasoning on categorical levels. I have 3 variables that are bicategorical, the rest of my variables have from 4 to 7 categories. I also have 2 numerical variables that i'm binning for this matter. – Blenz – 2019-08-23T08:24:18.670

Hrm, that's not as many levels as I would expect to be contributing to overfitting in this way... – Ben Reiniger – 2019-08-23T14:24:59.953

Still, you're right in a way for that reasoning, because the distribution of the categories for some variables is very unbalanced 10k occurences for 1 category and 15 for a category. I caught that category unbalance could cause overfitting so i imputed the really low categories with most freq. Still same problem. – Blenz – 2019-08-23T14:28:00.720

But the issue here is , overfitting or not, xgboost is able to ignore the numerical differences between the randomy numerically-encoded variables and that's something i always heard its contrary on forums. – Blenz – 2019-08-23T14:29:03.627

Try to use sklearn's categorical-encoder. Also you can try to increase fold count or you can use inner folds. Check this out.

– silverstone – 2019-08-23T20:06:17.573

@Blenz: not sure this will be helpfull, but I have found target encoding to be of little or no help with advanced tree methods such as xgboost. My reasoning was that the xgboost would calculate such mean target and split variables accordingly... You would merely gain one or two split. I quite surpised it had such impact for you. How much did it impact accuracy ? – lcrmorin – 2020-03-19T23:25:47.113

Not much honestly. This is an old post, i remember i eventually used ordinal encoding as target encoding was performing worse. – Blenz – 2020-03-20T13:44:47.090

No answers