Label Encoder encoding the same class as two integers

1

First I have defined classes of a label encoder with the keys of a dictionary. Then I have used that label encoder to encode some strings. But for the same string, its giving different integers. Why is this happening?

grid_dict = {'8861892541fffff': 588,'886189254dfffff': 65, '8861892543fffff': 103, '8861892547fffff': 83, '8861892545fffff': 401,
 '8861892569fffff': 131, '8861892727fffff': 39, '886189255dfffff': 13, '8861893597fffff': 2, '8861892555fffff': 4, '886189209bfffff': 13, '8861892549fffff': 3}

le = LabelEncoder()
le.classes_ = list(grid_dict.keys())

Now

le.transform(['8861892547fffff']) #4th class

gives me array([1]) but le.transform(['8861892545fffff']) #5th class also gives me the same array([1]) which should not happen. Also when I do le.inverse_transform(0) its giving '8861892541fffff' which is the second class. Also when I do le.transform(['8861892549fffff']) #last class its giving array([5]) and for le.transform(['8861892569fffff']) #class6 its giving array([5]).

I am not fitting the label encoder anywhere. I have some stored the keys in a dictionary and declaring the label encoder classes with the dict keys. Why is this happening?

ravishankar

Posted 2018-09-19T12:04:18.537

Reputation: 51

The correct way of using sklearn's LabelEncoder is with the fit/fit_transform/transofrm/inverse_transform methods. Why are you defining the classes_ attribute yourself? – qmeeus – 2018-09-19T12:58:30.970

I have some new labels in test data, so I am adding all those labels to a dict and then using the dict keys to label encode. I have accpepted your answer. Thanks. – ravishankar – 2018-09-20T09:48:29.910

If you have unseen classes in your testing set, your algorithm will not be able to predict them or the predictions won't be reliable because it was not trained on them... You're better off removing them or make sure that all classes are present in the training set – qmeeus – 2018-09-26T08:04:42.077

Answers

1

The correct way of using sklearn's LabelEncoder is the following:

from sklearn.preprocessing import LabelEncoder
import random

keys = [
    '8861892541fffff', '886189254dfffff', '8861892543fffff', '8861892547fffff', 
    '8861892545fffff', '8861892569fffff', '8861892727fffff', '886189255dfffff', 
    '8861893597fffff', '8861892555fffff', '886189209bfffff', '8861892549fffff'
]

data = [random.choice(keys) for _ in range(100)]

le = LabelEncoder().fit(data)

print(le.classes_)
# ['886189209bfffff' '8861892541fffff' '8861892543fffff' '8861892545fffff'
#  '8861892547fffff' '8861892549fffff' '886189254dfffff' '8861892555fffff'
#  '886189255dfffff' '8861892569fffff' '8861892727fffff' '8861893597fffff']

print(le.transform(['8861892727fffff']))
# [10]

print(le.inverse_transform([10]))
# ['8861892727fffff']

qmeeus

Posted 2018-09-19T12:04:18.537

Reputation: 726

0

I found out why this happening (got it from a question/comment in stack exchange only but couldn't get that link again, so posting here again).

While defining the classes_ of label encoder, we have to give them in the sorted order. Or else it behaves like in the above case.

ravishankar

Posted 2018-09-19T12:04:18.537

Reputation: 51