I have a dataset of 23k rows of an unbalanced dataset 85/15 ratio, 10 variables ( 9 of which are categorical ) , i'm using CatBoost and XGBoost for a binary classification. I applied cv (5 iteration loop) mean target encoding on the categorical variables and i got a certain accuracy. Ordinal encoding of the categorical features is giving a better accuracy than the mean encoding. How is that possible? If my understanding is correct, mean target encoding does not only numerically encode 'object'-type variables but it orders them using their impact on the target value, and the difference between the numerically-encoded new variables is also based on the categories' impact on the target Why do GBDT's perform better on a randomly encoded variable rather than a "well-encoded" one? Over-fitting ? or do GBDT's ( catboost/xgboost ) handle the ordinal encoding well enough that mean encoding is not needed? or something else?
Here's how i'm doing cross-validation mean encoding with a smoothing value of alpha = 10
Edit : Got a slightly better result by increasing the smoothing value to 30, but mean-encoding is still underperforming compared to the ordinal one.
## inside the loop means1 = X_val.groupby(column1).bank_account.agg('mean') nrows1 = X_val.groupby(column1).size() score1 = (np.multiply(means1,nrows1) + globalmean*alpha) / (nrows1+alpha) X_val.loc[:,encoded_column1] = X_val[column1] X_val.loc[:,encoded_column1] = X_val[column1].map(score1) ## After the loop is over, i average the encodings for each category across all folds and update the value for my new encoded column meanz1 = train_new.groupby(column1)[encoded_column1].mean() train_new[encoded_column1] = train_new[column1].map(meanz1).copy()