Label Encode with pre defined classes


I have trained a model (Random Forest) and now I want to use it to predict for a certain data on a particular day. I have a categorical column where there are some values(say a,b,c,d,e) over a period. Now on a particular day, only some of those values are there (say b,d). Now while making them to onehot encoding, I am using LabelEncoder and the Onehot encoder. But if I give that column for label encoding, it is labelling only 'b' and 'd' (say 1 & 2) and the one hot vector length is 2. What I need is say the actual model labelled (a,b,c,d,e) as (1,2,3,4,5), now i need (b,d) to be labelled as (2,4) and the one hot vector to be of size 5. What I am doing is saving the label encoder used in the training and using that one to label encode the column on that day. But I am not getting proper results, am I doing it the right way ? I have given the length of onehot_train as n_values for the one day data. I have used sklearn label encoder and onehot encoder. That is my main question and other one is, suppose I see new category which I haven't seen it during training, how to proceed with that, should I consider all the new categories as a 'unknown' category and encode all of them as same onehot or is there any better method? Thank you for any suggestions.

def get_onehot(arr):
    label_enc = LabelEncoder()
    onehot_enc = OneHotEncoder(sparse=False)
    int_enc = label_enc.fit_transform(arr)
    int_enc = int_enc.reshape(len(int_enc),1)
    onehot = onehot_enc.fit_transform(int_enc)
    return onehot_train,label_enc_train

def get_onehot_per_day(arr_perday, label_enc_train, length_onehot_train):
    onehot_enc = OneHotEncoder(sparse=False,n_values=length_onehot_train)
    int_enc = label_enc_train.transform(arr)
    int_enc = int_enc.reshape(len(int_enc),1)
    onehot = onehot_enc.fit_transform(int_enc)
    return onehot_per_day


Posted 2018-09-07T05:30:33.553

Reputation: 51

No answers