One-hot encode multi-class multi-label sequences


Suppose I want to build a timeseries where each timestep is represented by a categorical array: the encoded sequences look like [[2, 0, 5],[3, 1, 4],..] and each entry has a different number of possible values (categories). For example the first entry has 0-3 values, the second 0-1 and so on...

I want to train an LSTM model in order to predict the next timestep. So I defined a one hot encoding of each entry by means of the maximum number of classes:

For example [2, 0, 5] becomes

[[0. 0. 1. 0. 0. 0.],
[1. 0. 0. 0. 0. 0.],
[0. 0. 0. 0. 0. 1.]]

Unfortunately this kind of representation raises the error

ValueError: Invalid shape for y: (1, 3, 5)

I have three questions:

  1. Is it possible to pass a 3d y target to Keras?
  2. Should I define a single one-hot encoding which combines all the possible triplets of categorical values instead? The problem is that in this case I would lose the correlation between the occurrences of the labels in the same category, because each possible combination of labels would become independent from the other ones.
  3. Should I only one-hot encode the target y or also the input X?


Posted 2019-02-26T12:32:50.310

Reputation: 173



You may need to try cat2vec which converts categorical features into vector representation using Word2Vec approach. Check also this link for multi-feature inputs into LSTM.

For the target y, one-hot is a better technique for NN-based models.

Mulugeta Weldezgina

Posted 2019-02-26T12:32:50.310

Reputation: 141