How to automate the encoding process?


I am working on the Boston challenge hosted on Kaggle and I'm still refining my features. Looking at the dataset, I realize that some columns need to be encoded in binary, some encoded in decimals (ranking them out of a scale of n) and some need to be one-hot-encoded. I've collected these columns and categorized them in distinct lists (at least based on my judgement on how their data should be encoded):

categorical_columns = ['MSSubClass', 'MSZoning', 'Alley', 'LandContour', 'Neighborhood', 'Condition1', 'Condition2',
                       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'Foundation', 'Heating',
                       'Functional', 'GarageType', 'PavedDrive', 'SaleType', 'SaleCondition']

binary_columns = ['Street', 'CentralAir']

ranked_columns = ['LotShape', 'Utilities', 'LandSlope', 'ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond',
                  'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'HeatingQC', 'KitchenQual', 'FireplaceQu',
                  'GareFinish', 'GarageQual', 'GarageCond', 'PoolQC', 'Fence', 'MiscFeature']

One fellow stackexchange user suggested that I use pandas.get_dummies() method to one-hot-encode categorical variables like MSZoning and attach it to a variable like this:

OHE_MSZoning = pd.get_dummies(train['MSZoning'])

I'd like to know how I can automate this process using functions and control-flow statements like for-loop.

Andros Adrianopolos

Posted 2019-06-12T08:57:30.523

Reputation: 322



I'm the fellow Stackexchange user, hi! I wrote the function that iterates the one-hot encoding on all your categorical_columns:

def serial_OHE(df, categorical_columns):

    # iterate on each categorical column
    for col in categorical_columns:

        # take one-hot encoding
        OHE_sdf = pd.get_dummies(df[col])

        # drop the old categorical column from original df
        df.drop(col, axis=1, inplace=True)

        # attach one-hot encoded columns to original dataframe
        df = pd.concat([df, OHE_sdf], axis = 1, ignore_index = True)

    return df

So you can call it like this:

df = serial_OHE(df, categorical_columns)

Let me know it there are any problems.


Posted 2019-06-12T08:57:30.523

Reputation: 4 928

So this is returning an entire new training dataset? – Andros Adrianopolos – 2019-06-12T10:47:33.303

it is returning a new dataset in which each categorical_column has been substituted by its one-hot encoded counterpart. – Leevo – 2019-06-12T10:55:05.173

Ah okay. Thank you very much. I'll accept and upvote your answer. If you think that this was a well asked question, could you give me an upvote as well? – Andros Adrianopolos – 2019-06-12T11:01:14.743

I gotta do this for binary and decimal encoding as well. How can I do all 3 in the same function? It just means that I need to reassign the dataset 3 times. – Andros Adrianopolos – 2019-06-12T11:29:22.203

It should work with others lists as well, I think. Try to call df = serial_OHE(df, binary_columns) and df = serial_OHE(df, ranked_columns), and let me know if it works. – Leevo – 2019-06-12T12:41:13.823

If you don't mind, could you break this line down for me df.drop(col, axis=1, inplace=True). Idg the relations of the parameters. – Andros Adrianopolos – 2019-06-13T05:44:43.823

1Sure. The first part: df.drop(col) tells df to drop what corresponds to col. The axis=1 argument says: "I want you to drop a column (axis=1) and not a row (that would be axis=0)". Finally, inplace=True means: "the new dataframe that you get after the column drop, is the new df, i.e. substitute it to the original df". – Leevo – 2019-06-13T07:12:59.067

Thank you sir.. – Andros Adrianopolos – 2019-06-13T07:40:20.493