Mass convert categorical columns in Pandas (not one-hot encoding)

12

8

I have pandas dataframe with tons of categorical columns, which I am planning to use in decision tree with scikit-learn. I need to convert them to numerical values (not one hot vectors). I can do it with LabelEncoder from scikit-learn. The problem is there are too many of them, and I do not want to convert them manually.

What would be an easy way to automate this process.

user1700890

Posted 2016-09-18T16:45:15.647

Reputation: 335

The get_dummies function in pandas can help you. Check the documentation here for more details. I think it perfectly covers this use case and you can further tweak the behavior by supplying custom prefixes.

– hssay – 2016-09-19T04:51:47.960

Answers

11

If your categorical columns are currently character/object you can use something like this to do each one:

char_cols = df.dtypes.pipe(lambda x: x[x == 'object']).index

for c in char_cols:
    df[c] = pd.factorize(df[c])[0]

If you need to be able to get back to the categories I'd create a dictionary to save the encoding; something like:

char_cols = df.dtypes.pipe(lambda x: x[x == 'object']).index
label_mapping = {}

for c in char_cols:
    df[c], label_mapping[c] = pd.factorize(df[c])

Using Julien's mcve will output:

In [3]: print(df)
Out[3]: 
    a   b   c   d
0   0   0   0   0.155463
1   1   1   1   0.496427
2   0   0   2   0.168625
3   2   0   1   0.209681
4   0   2   1   0.661857

In [4]: print(label_mapping)
Out[4]:
{'a': Index(['Var2', 'Var3', 'Var1'], dtype='object'),
 'b': Index(['Var2', 'Var1', 'Var3'], dtype='object'),
 'c': Index(['Var3', 'Var2', 'Var1'], dtype='object')}

george_w_kush

Posted 2016-09-18T16:45:15.647

Reputation: 126

Your code to find the object columns is v useful. – StephenBoesch – 2018-08-22T20:32:56.577

6

First, let's create a mcve to play with:

import pandas as pd
import numpy as np

In [1]: categorical_array = np.random.choice(['Var1','Var2','Var3'],
                                             size=(5,3), p=[0.25,0.5,0.25])
        df = pd.DataFrame(categorical_array,
               columns=map(lambda x:chr(97+x), range(categorical_array.shape[1])))
        # Add another column that isn't categorical but float
        df['d'] = np.random.rand(len(df))
        print(df)

Out[1]:
      a     b     c         d
0  Var3  Var3  Var3  0.953153
1  Var1  Var2  Var1  0.924896
2  Var2  Var2  Var2  0.273205
3  Var2  Var1  Var3  0.459676
4  Var2  Var1  Var1  0.114358

Now we can use pd.get_dummies to encode the first three columns.

Note that I'm using the drop_firstparameter because N-1 dummies are sufficient to fully describe N possibilities (eg: if a_Var2 and a_Var3 are 0, then it's a_Var1). Also, I'm specifically specifying the columns but I don't have to as it will be columns with dtype either object or categorical (more below).

In [2]: df_encoded = pd.get_dummies(df, columns=['a','b', 'c'], drop_first=True)
        print(df_encoded]
Out[2]:
          d  a_Var2  a_Var3  b_Var2  b_Var3  c_Var2  c_Var3
0  0.953153       0       1       0       1       0       1
1  0.924896       0       0       1       0       0       0
2  0.273205       1       0       1       0       1       0
3  0.459676       1       0       0       0       0       1
4  0.114358       1       0       0       0       0       0

In your specific application, you'll have to provide a list of column that are Categorical, or you'll have to infer which columns are Categorical.

Best case scenario your dataframe already has these columns with a dtype=category and you can pass columns=df.columns[df.dtypes == 'category'] to get_dummies.

Otherwise I suggest setting the dtype of all other columns as appropriate (hint: pd.to_numeric, pd.to_datetime, etc) and you'll be left with columns that have an object dtype and these should be your categorical columns.

The pd.get_dummies parameter columns defaults as follows:

columns : list-like, default None
    Column names in the DataFrame to be encoded.
    If `columns` is None then all the columns with
    `object` or `category` dtype will be converted.

Julien Marrec

Posted 2016-09-18T16:45:15.647

Reputation: 240

2

In order to convert types of multiple columns at once I would use something like this :

df2 = df.select_dtypes(include = ['type_of_insterest'])

df2[df2.columns].apply(lambda x:x.astype('category'))

Then I would join them back to original df.

Ozkan Serttas

Posted 2016-09-18T16:45:15.647

Reputation: 121

I think df2[df2.columns] = df2[df2.columns].astype('category') does the same, no apply, no lambda. – paulperry – 2019-01-31T19:50:14.880