Treating missing data in categorical features

3

I have a dataset with one of the categorical columns having a considerable number of missing values. The interesting thing about this column is that it has values only for a particular category in "another" column .

For eg :

column 1                        column2
========================================
Google                             -
Google                             -
Google                             -
Google                             -
Facebook                        Image
Facebook                        Video
Facebook                        Image

My column of interest has values only for one category (Facebook) that is present in another column. Therefore, the missing values for google cannot be imputed with average, cannot be predicted and those rows cannot be ignored either.

In such a situation, is it wise to consider the missing values '-' as a separate category in one-hot encoding? Or will this affect my machine learning model badly?

Bharathi

Posted 2020-08-21T08:35:22.900

Reputation: 117

1To me it depends and you have to make some test both with and without variable. Did you also try to merge column 1 and column 2 variables ? (In your example, you could make 3 variables Google, FacebookImage and FacebookVideo). That's another thing you can try to avoid having 2 highly correlated columns. – BeamsAdept – 2020-08-21T14:19:28.390

Answers

3

You could break the column 2 from your example into number of columns : Image,Video....

So the new features will be like:

Column1  Image  Video  
Google     0      0
Google     0      0
Facebook   1      0
Facebook   0      1

Shiv

Posted 2020-08-21T08:35:22.900

Reputation: 544

We can follow this method for all kinds of categorical columns? – Vikas Ukani – 2020-09-19T05:23:26.497

Suppose, There is an categorical feature in which there are too many unique values, For that, This method goes wrong, Right? – Vikas Ukani – 2020-09-19T05:24:40.880

2

You can try this:

import pandas as pd

df_new = pd.get_dummies(df, columns=['column2'])
print(df_new)

Output:

    column1  column2_Image  column2_Video
0    Google              0              0
1    Google              0              0
2    Google              0              0
3    Google              0              0
4  Facebook              1              0
5  Facebook              0              1
6  Facebook              1              0

Soumendra Mishra

Posted 2020-08-21T08:35:22.900

Reputation: 199

What if there are many unique values in column_2, For Instance, Image, Video, PDF, DOC, Excel, Audio, etc. – Vikas Ukani – 2020-09-19T05:26:35.353

1It will work. For example, if you add a new value (email), a new column will be added: column2_Email column2_Image column2_Video – Soumendra Mishra – 2020-09-19T06:37:46.863

Is there any disadvantages of too many features column, Suppose I use this method and I got 200+ feature in my DataFrame. So, There is and negative point of this kind of problem? – Vikas Ukani – 2020-09-19T06:51:28.940

1There is no performance issues. It all depends on your use case. – Soumendra Mishra – 2020-09-19T06:54:32.633