Why is my test data accuracy higher than my training data?


I'm using four years of data, training on the first 3 and testing on the fourth. Using LSTM w/ Keras. My test data set (which has no overlap at all with the training) is consistently performing better than my training data. How should I interpret this? It seems very unusual. Here's the trail end of the model output. You can see my training accuracy for a given epoch hovers around 80%, but test output jumps to about 86%:

Epoch 8/10
9092/9092 [==============================] - 9s 964us/step - loss: 0.9870 - acc: 0.8185
Epoch 9/10
9092/9092 [==============================] - 9s 1ms/step - loss: 0.9670 - acc: 0.7996
Epoch 10/10
9092/9092 [==============================] - 9s 937us/step - loss: 0.9799 - acc: 0.7895
Test Set Accuracy: 85.96%

predicted     0    1
0          2639  238
1           211  111

Edit: Here's my code to create & compile the model:

embedding_vector_length = 32
days = 30

model = Sequential()
model.add(Embedding(2080, embedding_vector_length, input_length=days)
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(train_x, train_y, epochs=3, batch_size=64,class_weight={0:1.,1:1}) 
scores = model.evaluate(test_x, test_y, verbose=0)
print("Test Set Accuracy: %.2f%%" % (scores[1]*100))

Jim the fourth

Posted 2019-05-23T02:00:35.050

Reputation: 135



I assume you're using structured data (numerical, categorical, nominal, ordinal..): - It's probably due to class imbalance. - If you use Scikit-Learn, you can add class_weight = "balanced" which will automatically weigh classes inversely proportional to their frequency. - Testing this should confirm if it's a class imbalance problem.

PS: Francois Chollet (create of Keras) states that traditional algorithms are superior to Deep Learning for structured data. Personally, with structured data, I've never been able to match the performance of XGBoost with Deep Learning.

enter image description here

enter image description here


Posted 2019-05-23T02:00:35.050

Reputation: 901

My data is a time series, the problem is binary classification. Where would that put me with respect to deep learning & something like XGBoost? I've tried feature extraction (tsfresh) on the series data and used that with traditional ML, and not had very good results at all. Of course there might not be much signal to distinguish between the two classifications. – Jim the fourth – 2019-05-24T14:04:26.127

I don't get your question. If it's a classification problem, run a Random Forest with SK-Learn and set the class_weight to "balanced". Your test accuracy should be lower than your train accuracy. If that's the case, you know what's the issue with Deep Learning. – FrancoSwiss – 2019-05-24T19:53:07.910

1Thanks, I'll give it a shot. I'm actually more comfortable with traditional algorithms, but not with time series data. It looked like LTSM might be a good candidate, but perhaps not. I appreciate your input, (and on my other question too) – Jim the fourth – 2019-05-25T04:28:38.953

Sure. Good luck! – FrancoSwiss – 2019-05-25T08:00:30.653


Your test-set is unlucky.

Use Cross-Validation

Also make sure to handle proper/random split for test and train data. So that it's not biased


Posted 2019-05-23T02:00:35.050

Reputation: 2 248

1Very likely this, +1. OP does not tell us how he generated the test set. That said, pytorch will already perform cross validation during .fit(). You do not want to do full k-fold over all data during the training, otherwise your hyperparameters can overfit to the datatset (instead of just parameters overfitting the dataset in a single model). OP should just manually check a handful of other splits, if this discrepancy in test and validation score happens again then one must search for different causes (e.g. a small number of outliers) – grochmal – 2019-05-23T16:59:40.373

Thanks, this is in fact the case, but my test set mimics the conditions under which I work. The most recent year's data shows more of the less-likely outcome than prior years. That is expected to continue, and the nature of the data is annual batches. So if my model can't use prior years to predict the current year, which is what my current train-test split mimics, then I'm not sure how useful it will be. – Jim the fourth – 2019-05-25T04:26:21.113


It seems your model is biased towards one class and some how on fourth year(testing data) you are getting more samples of biased class. You can also observe such problem due to deficiency of training data.

Probable solution: Add more samples in training data.

Note: You can also try model training after shuffle your all 4 years of data and split into training vs testing data.

vipin bansal

Posted 2019-05-23T02:00:35.050

Reputation: 1 322

It is biased, but weighing in favor of the unlikely category actually makes the train-test accuracy issue worse. Shuffling all four years wouldn't accurately mimic the conditions I work under. I will need to predict year 5 all at once, when it arrives, using 1-4. So, I need to be able to predict year 4 using only 1-3 to properly reproduce those conditions. – Jim the fourth – 2019-05-25T04:34:02.983


You must be using some regularization techniques to avoid over-fitting of the training data. (For example : dropout regularization) It would be easy to analyze if you could post your code snippet.


Posted 2019-05-23T02:00:35.050

Reputation: 1 271

Thanks, I added the code that created & compiles the model. The loss function is binary_crossentropy. I guess I'm underfitting? But I would expect my test set accuracy to still be lower in that case. – Jim the fourth – 2019-05-23T12:50:04.890