How to use a one-hot encoded nominal feature in a classifier in Scikit Learn?

7

I'm working on a genre classification problem on a songs dataset. Since genre is a nominal feature, I used sklearn's LabelBinarizer to get the one-hot encoding for this feature for every row in the dataset. I'm then left with a dataframe(df_train_num) with two columns, both numeric in nature and a Series object for which every row value is a numpy array - the one-hot encoding of the genre. I now want to fit a classifier on this data. What I did was:

svm_classifier = LinearSVC()
svm_classifier.fit(df_train_num,df_train_genre)

This gives me:

ValueError: Unknown label type: 'unknown'

What exactly is causing this error? Am I not allowed to use a Series object with a DataFrame object in the to fit a classifier? Although replacing df_train_genre with df_train_genre.values so as to pass the numpy array directly to the fit method also doesn't change anything. Same error.

Here is a view of the two pandas objects:

df_train_num.head(5)

Unique_Word_Count   Sentiment Polarity
157277                  126   0.027766
90109                   114  -0.199545
106224                  16    0.000000
221087                  103  -0.058025
247082                  409  -0.170143

df_train_genre.head(5)

157277    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
90109     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ...
106224    [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
221087    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
247082    [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
Name: Genre_Encoded, dtype: object

Mudit Jha

Posted 2019-03-25T20:33:12.757

Reputation: 81

Answers

1

I think you should try pd.get_dummies to code the categories; which will create new columns in dataframe and then use that df to pass it to the classifier.

cap

Posted 2019-03-25T20:33:12.757

Reputation: 394

0

I used sklearn's LabelBinarizer to get the one-hot encoding for this feature for every row in the dataset.

I think this might be the mistake. Have a look here. Instead do the same column-wise. Then I guess the fit method should work just fine.

naive

Posted 2019-03-25T20:33:12.757

Reputation: 368

0

LinearSVC does multiclass classification on integer targets; you don't need to use the LabelBinarizer. See for example https://scikit-learn.org/stable/auto_examples/svm/plot_iris_svc.html (where iris.target produces an array of values in $\{0,1,2\}$).

Ben Reiniger

Posted 2019-03-25T20:33:12.757

Reputation: 7 097