When using padding in sequence models, is Keras validation accuracy valid/ reliable?


I have a group of non zero sequences with different lengths and I am using Keras LSTM to model these sequences. I use Keras Tokenizer to tokenize (tokens start from 1). In order to make sequences have the same lengths, I use padding.

An example of padding:

# [0,0,0,0,0,10,3]
# [0,0,0,0,10,3,4]
# [0,0,0,10,3,4,5]
# [10,3,4,5,6,9,8]

In order to evaluate if the model is able to generalize, I use a validation set with 70/30 ratio. In the end of each epoch Keras shows the training and validation accuracy.

My big doubt is whether Keras validation accuracy is reliable when using padding (When you run Keras over several epochs, in the end of each epochs it prints training accuracy and validation accuracy). Because the validation set can simply be sequences of 0's --> [0,0,0]. Since there are a lot of sequences of 0's (because of padding), the model can easily learn and predict the sequences of 0's correctly, and as a result, create a fake high validation accuracy. In other words the model may learn sequences of zeros and not learn the real sequences.

So, does padding influences the validation accuracy in Keras?

Amir Jalilifard

Posted 2020-07-19T19:47:48.377

Reputation: 123

No answers