Label Encoder encoding the same class as two integers


First I have defined classes of a label encoder with the keys of a dictionary. Then I have used that label encoder to encode some strings. But for the same string, its giving different integers. Why is this happening?

grid_dict = {'8861892541fffff': 588,'886189254dfffff': 65, '8861892543fffff': 103, '8861892547fffff': 83, '8861892545fffff': 401,
 '8861892569fffff': 131, '8861892727fffff': 39, '886189255dfffff': 13, '8861893597fffff': 2, '8861892555fffff': 4, '886189209bfffff': 13, '8861892549fffff': 3}

le = LabelEncoder()
le.classes_ = list(grid_dict.keys())


le.transform(['8861892547fffff']) #4th class

gives me array([1]) but le.transform(['8861892545fffff']) #5th class also gives me the same array([1]) which should not happen. Also when I do le.inverse_transform(0) its giving '8861892541fffff' which is the second class. Also when I do le.transform(['8861892549fffff']) #last class its giving array([5]) and for le.transform(['8861892569fffff']) #class6 its giving array([5]).

I am not fitting the label encoder anywhere. I have some stored the keys in a dictionary and declaring the label encoder classes with the dict keys. Why is this happening?


Posted 2018-09-19T12:04:18.537

Reputation: 51

The correct way of using sklearn's LabelEncoder is with the fit/fit_transform/transofrm/inverse_transform methods. Why are you defining the classes_ attribute yourself? – qmeeus – 2018-09-19T12:58:30.970

I have some new labels in test data, so I am adding all those labels to a dict and then using the dict keys to label encode. I have accpepted your answer. Thanks. – ravishankar – 2018-09-20T09:48:29.910

If you have unseen classes in your testing set, your algorithm will not be able to predict them or the predictions won't be reliable because it was not trained on them... You're better off removing them or make sure that all classes are present in the training set – qmeeus – 2018-09-26T08:04:42.077



The correct way of using sklearn's LabelEncoder is the following:

from sklearn.preprocessing import LabelEncoder
import random

keys = [
    '8861892541fffff', '886189254dfffff', '8861892543fffff', '8861892547fffff', 
    '8861892545fffff', '8861892569fffff', '8861892727fffff', '886189255dfffff', 
    '8861893597fffff', '8861892555fffff', '886189209bfffff', '8861892549fffff'

data = [random.choice(keys) for _ in range(100)]

le = LabelEncoder().fit(data)

# ['886189209bfffff' '8861892541fffff' '8861892543fffff' '8861892545fffff'
#  '8861892547fffff' '8861892549fffff' '886189254dfffff' '8861892555fffff'
#  '886189255dfffff' '8861892569fffff' '8861892727fffff' '8861893597fffff']

# [10]

# ['8861892727fffff']


Posted 2018-09-19T12:04:18.537

Reputation: 726


I found out why this happening (got it from a question/comment in stack exchange only but couldn't get that link again, so posting here again).

While defining the classes_ of label encoder, we have to give them in the sorted order. Or else it behaves like in the above case.


Posted 2018-09-19T12:04:18.537

Reputation: 51