Aggregating target-encoded array-like categorical features?


I am trying find commonly used techniques when dealing with high cardinality multi-valued categorical variables for machine learning classification algorithms.

One-hot encoding leads to very high dimensionality. The approach I've landed on is target-encoding/mean-encoding. I understand how to use this when the categorical feature is a single choice (eg current zip code). But, when the feature can take on multiple values from a large list (eg favorite hobbies, illness symptoms, university coursework), I am not sure how to combine the values.

My intuition says that the wrong approach would be to take each unique combination as its own factor and encode that, as it would lead to overfitting. Other things that come to mind would be simple aggregations like sum/avg/product/variance.

How should target encoded values be combined?


Posted 2019-04-09T18:41:03.810

Reputation: 320



There are several options:

  • Domain knowledge - Given what you know about the domain, combine the categories that make the most sense.

  • Empirical - Treat combing categories as a hyperparameter. Search through the space of options and pick the best combinations based on cross-validation score.

Brian Spiering

Posted 2019-04-09T18:41:03.810

Reputation: 10 864