Target Encoding: missing value imputation before or after encoding


I want to perform a target encoding for my categorical features although I am not sure when to perform the data imputation if any of them has missing values. Let's say I have a few continuous features, Cnt1-Cnt5 (without NA's) and two categorical features, Cat1 and Cat2, with Cat2 having missing values. Let's also assume that I want to use Random Forest as an imputation method. Which approach would be the correct one?

  1. Impute Cat2 treating Cat1 and Cnt1-Cnt5 as predictors in RF and then perform target encoding on categorical variables.

  2. Target encode Cat2 for non missing and Cat1, build RF and impute missings for Cat2 (which is now numeric, not categorical).

  3. Any other approach?

We can generalize this question and ask whether we should impute missings for any kind of variable (including continuous) before or after target encoding?

I see at least one benefit of imputation after target encoding - if there are unseen levels of categorical variable present in the test data (which will result in NA's in the test set after performing target encoding), those would be easily imputed by RF built on training data, without any potential error due to new levels.


Posted 2019-03-16T10:57:11.730

Reputation: 31



If you want to do TargetEncoder you have to impute the missing values first.

  1. First of all you should convert your categorical features into int, using LabelEncoder or OrdinalEncoder. I used a huge numeric value (my choice : 8888) in order to fill the NaN values, before running OrdinalEncoder. Then transform your matrix to int, it will be more efficient.

  2. For the imputation of missing values, you may use different strategies :

    2.1. . Fill with the most frequent value in each features (column) (if you did the previous stage you do not need this). If you perform TargetEncoding, imputing with most frequent generally works well.

    2.2. . Fill using IterativeImputer. You can provide in the argument estimator the kernel you want to use (I tested RandomForestRegressor - for numeric features). Warning: IterativeImputer uses only regression kernels. If you did the previous stage, do not forget to add the argument : missing_values=8888.

You can apply your TargetEncoder now.

Catalina Chircu

Posted 2019-03-16T10:57:11.730

Reputation: 246