I have a categorical variable which has thousands of values, for a dataset which has millions of records. The data is being used to create a binary classification model. I am in the early steps of feature selection, but I am trying out Random Forest, Boosted Trees, and Logistic Regression to see what works.
If I find the frequency of each category and sort that, I see that about 50 values make up the top 80%. Is it valid to condense this feature as a binary on whether the value is in that set of values or not. By 'valid', I mean is it likely that this sort of transformation retain any useful information for a model? I have a concern that sorting these categorical values which do not have any order to them creates some incorrect assumptions.
The frequecy distribution looks a little like this:
A;10% D;5% E;1.2% B;1.1% ... Z;0.004% W;0.0037% ...
Going one step further, is it valid to profile each class in my dataset and do the same? Say Categories A-F comprise the top 80% of class 0 and Categories D-H are the top 80% of class 1. I would convert:
data_id;cat_var 1;B 2;F 3;H 4;Z
data_id;cat_var_top80class0;cat_var_top80class1 1;1;0 2;1;1 3;0;1 4;0;0
Adding picture to hopefully clear up this idea. In yellow are the pre-calculated distributions of cat_var (***_id in the picture) for classes 0 and 1 based on the training set. On the right shows how the transformation would be applied: