one-hot-encoding categorical data gives error

3

I am currently working on the Boston problem hosted on Kaggle. The dataset is nothing like the Titanic dataset. There are many categorical columns and I'm trying to one-hot-encode these columns. I've decided to go with the column MSZoning to get the approach working and work out a strategy to apply it to other categorical columns. This is a small snippet of the dataset:

enter image description here

Here are the different types of values present in MSZoning, so obviously integer encoding only would be a bad idea:

['RL' 'RM' 'C (all)' 'FV' 'RH']

Here is my attempt on Python to assign MSZoning with the new one-hot-encoded data. I do know that one-hot-encoding turns each value into a column of its own and assigns binary values to each of them so I realize that this isn't exactly a good idea. I wanted to try it anyways:

import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder


train = pd.read_csv("https://raw.githubusercontent.com/oo92/Boston-Kaggle/master/train.csv")
test = pd.read_csv("https://raw.githubusercontent.com/oo92/Boston-Kaggle/master/train.csv")

labelEncoder = LabelEncoder()

train['MSZoning'] = labelEncoder.fit_transform(train['MSZoning'])
train_OHE = OneHotEncoder(categorical_features=train['MSZoning'])
train['MSZoning'] = train_OHE.fit_transform(train['MSZoning']).toarray()


print(train['MSZoning'])

Which is giving me the following (obvious) error:

C:\Users\security\Anaconda3\lib\site-packages\sklearn\preprocessing\_encoders.py:392: DeprecationWarning: The 'categorical_features' keyword is deprecated in version 0.20 and will be removed in 0.22. You can use the ColumnTransformer instead.
  "use the ColumnTransformer instead.", DeprecationWarning)
Traceback (most recent call last):
  File "C:/Users/security/Downloads/AP/Boston-Kaggle/Boston.py", line 11, in <module>
    train['MSZoning'] = train_OHE.fit_transform(train['MSZoning']).toarray()
  File "C:\Users\security\Anaconda3\lib\site-packages\sklearn\preprocessing\_encoders.py", line 511, in fit_transform
    self._handle_deprecations(X)
  File "C:\Users\security\Anaconda3\lib\site-packages\sklearn\preprocessing\_encoders.py", line 394, in _handle_deprecations
    n_features = X.shape[1]
IndexError: tuple index out of range

I did read through some Medium posts on this but they didn't exactly relate to what I was trying to do with my dataset as they were working with dummy data with a couple of categorical columns. What I want to know is, how do I make use of one-hot-encoding after the (attempted) step?

Andros Adrianopolos

Posted 2019-06-10T09:36:50.037

Reputation: 322

1Quick note: you have loaded the same dataframe for both train and test – Leevo – 2019-06-10T09:41:49.677

I've been displeased with how OneHotEncoder works based on hot LabelEncoder works. like the accepted answer, pd.get_dummies does OneHotEncoding without the (unnecessary) hassle of setting up the class. – MattR – 2019-06-10T20:25:12.010

Answers

3

First of all, I noticed you have loaded the same dataframe for both train and test. Change the code like this:

import numpy as np
import pandas as pd

train = pd.read_csv("https://raw.githubusercontent.com/oo92/Boston-Kaggle/master/train.csv")
test = pd.read_csv("https://raw.githubusercontent.com/oo92/Boston-Kaggle/master/test.csv")

At this point, one-hot encode each variable you want with pandas' get_dummies() function:

# Onhe-hot encode a given variable
OHE_MSZoning = pd.get_dummies(train['MSZoning'])

It will be returned as a pandas dataframe. In my Jupyter Notebook it looks like this:

OHE_MSZoning.head()

enter image description here

You can repeat the same command for all the variables you want to one-hot encode.

Hope this helps, otherwise let me know.

Leevo

Posted 2019-06-10T09:36:50.037

Reputation: 4 928

1How come you're using pandas.get_dummies() over the sklearn function? – Andros Adrianopolos – 2019-06-10T09:57:00.947

1It's the method I'm used to, I work all the time with pandas dataframes and I find it useful. But it's not necessarily better than sklearn. I used this because I'm sure it worked. – Leevo – 2019-06-10T10:51:14.040

I'll definitely give it a try. Thank you. I'll accept your answer. If you think that this was a well asked question, could you give me an upvote? – Andros Adrianopolos – 2019-06-10T10:52:54.097

So do you just create a new variable for every instance of this? This dataset has many categorical variables. – Andros Adrianopolos – 2019-06-11T06:43:30.003

3

Here is an approach using the encoders from sklearn

import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
train = pd.read_csv("https://raw.githubusercontent.com/oo92/Boston-Kaggle/master/train.csv")
test = pd.read_csv("https://raw.githubusercontent.com/oo92/Boston-Kaggle/master/test.csv")
labelEncoder = LabelEncoder()
MSZoning_label = labelEncoder.fit_transform(train['MSZoning'])

The order mapping of classes and labels from sklearn's LabelEncoder can be seen from its classes_ property

labelEncoder.classes_
array(['C (all)', 'FV', 'RH', 'RL', 'RM'], dtype=object)
onehotEncoder = OneHotEncoder(n_values=len(labelEncoder.classes_))
MSZoning_onehot_sparse = onehotEncoder.fit_transform([MSZoning_label])
  • Convert MSZoning_onehot from sparse array to dense array
  • Reshape the dense array to be (n_classes,n_examples)
  • Convert from float to int type
MSZoning_onehot = MSZoning_onehot_sparse.toarray().reshape(len(MSZoning_label),-1).astype(int)

Pack it back into a data frame if you wan't

MSZoning_label_onehot = pd.DataFrame(MSZoning_onehot,columns=labelEncoder.classes_)
MSZoning_label_onehot.head(10)

enter image description here

dustindorroh

Posted 2019-06-10T09:36:50.037

Reputation: 41

I don't get this line array(['C (all)', 'FV', 'RH', 'RL', 'RM'], dtype=object). MSZoning is already type object. – Andros Adrianopolos – 2019-06-11T09:39:08.613

That is the output of the line above it. In [1]: print(labelEncoder.classes_) Out[2]: array(['C (all)', 'FV', 'RH', 'RL', 'RM'], dtype=object) – dustindorroh – 2019-06-11T09:52:56.133

When you pack it back into the dataframe, the dataframe isn't train. Shouldn't you submit your OHE variables back into the mother data? – Andros Adrianopolos – 2019-06-11T10:20:24.100

I created a new dataframe in the example, but you can add it back to the train dataframe if you like. The indexes between the two are mapped. – dustindorroh – 2019-06-22T07:44:41.490