## Instead of one-hot encoding a categorical variable, could I profile the data and use the percentile value from it's cumulative density distribution?

6

1

I have a categorical variable which has thousands of values, for a dataset which has millions of records. The data is being used to create a binary classification model. I am in the early steps of feature selection, but I am trying out Random Forest, Boosted Trees, and Logistic Regression to see what works.

If I find the frequency of each category and sort that, I see that about 50 values make up the top 80%. Is it valid to condense this feature as a binary on whether the value is in that set of values or not. By 'valid', I mean is it likely that this sort of transformation retain any useful information for a model? I have a concern that sorting these categorical values which do not have any order to them creates some incorrect assumptions.

The frequecy distribution looks a little like this:

A;10%
D;5%
E;1.2%
B;1.1%
...
Z;0.004%
W;0.0037%
...


Going one step further, is it valid to profile each class in my dataset and do the same? Say Categories A-F comprise the top 80% of class 0 and Categories D-H are the top 80% of class 1. I would convert:

data_id;cat_var
1;B
2;F
3;H
4;Z


to

data_id;cat_var_top80class0;cat_var_top80class1
1;1;0
2;1;1
3;0;1
4;0;0


Adding picture to hopefully clear up this idea. In yellow are the pre-calculated distributions of cat_var (***_id in the picture) for classes 0 and 1 based on the training set. On the right shows how the transformation would be applied:

The percentile approach is definitely something worth trying. Percentile is the value of cdf function of frequency/density distribution. I have seen people using this frequency concept for new feature generations. Since it works there, it is probably useful to your case too. From working solutions I have seen, the way is to create features as counts grouped by the categories. Frequency is proportional to the counts. So it should work the same way. For example, there are 3 categorical features, A, B, C. New features can be count_A_by_B_C, count_B_by_A_C, and count_C_by_A_C. Another approa – Diansheng – 2018-04-04T03:04:12.107

4

(Edited after @D.W. suggestion).

To the best of my knowledge, there is nothing wrong with what you have in mind; thus it is certainly valid.. As you said, you have to try out all possible ways you can think and see which one works better. The most important point adopting a particular encoding is to be aware of you have certain amount of information getting lost and it varies one case to another depending on the problem at hand. For example in the case of binary coding based on high/low frequency sublevels, there will large of information (details) lost which could help the algorithm to do classification. I liked your idea of percentile coding based on cumulative density distribution. Maybe you want to look at Quantile-based discretization which is available in pandas.qcut.

The rest is my earlier answer as it was (below). I intended to suggest trying other techniques as well on top of what you have had in mind; but apparently the message was not clear. Please note that I do not seek to get my answer marked as final answer, as I aware that still it does not fully answer your question; it is simply to brainstorm and exchange ideas in length. ;-)

Perhaps you have already digged out enough about ways to convert categorical variables into continuous data. In case you did not and missed checking this answer, check it out. As discussed, they are many ways to convert cat-to-num, and your problem is one of the hardest yet the most common across many domains. You have a high cardinality in your categorical variable, and as far as I understood imbalance distributions of those sublevels and you are not sure whether messing up the order of those sublevels matters or not. You may need to try Ordinal Encoding (if order really matters) or weight of evidence (WoE) transformation (see this blog post for instance) that I have heard but not tried or even going beyond mixing them in a meaningful way to represent your categorical data properly.

What I have learned, despite all efforts in the field, that this problem is still an open challenge in data science and machine learning. Thus there is no best solution, or well-established method as far as I checked. Please do let me know if you come across one.

@D.W. True. I understand that my answer does not directly answer the question. Yet I tried to convey the message that there is no valid/invalid of doing categorical encoding. Yes one can do binary coding because high/low frequency sublevels, but as also was mentioned it may be the case that lots of information get lost. I suggested looking into more options beyond just simple binary coding, definitely after trying what the user had in mind to do. Maybe this was not clear from my answer; but this was simply my intention (I may edit my answer to make this point clearer!). Cheers – TwinPenguins – 2018-04-05T07:55:59.610