Which is more important - stable training results or good test results?


Which is more important - stable training results or good test results?

For instance, is obtaining an unstable training accuracy in different epochs, but good test accuracy better? Or is obtaining a stable training accuracy in different epochs, but test accuracy is not good.

Which one is the best choice?


Posted 2018-09-13T06:21:34.023

Reputation: 1 216



I would personally prefer the scenario where the training/testing accuracy is high/low respectively, because it would mean that your model was at least well-fit in the training data but unfortunately performed poorly on unseen data. It is also a conservative scenario, because it can't get worse than that, so you have a good feeling that your model needs improvement.

On the other hand, if your training accuracy is low, you shouldn't even test the model on unseen data, because you are certain that it is not trained well enough. However, if you insist on testing it, even though the training accuracy is low, and the testing results are better than the training results, then this probably means that you were lucky during testing. Under no circumstances should you make any inference of the model's quality in such a scenario.

The solution to your problem would be to use k-fold validation, meaning that you should randomly split your dataset multiple times into different training/test sets and evaluate its training/testing accuracy over all the splits to have a better feeling of how the model behaves on your dataset.


Posted 2018-09-13T06:21:34.023

Reputation: 3 275


First of all, having this situation isn't very common. If would imply that your training and test data do not originate from the same underlying distribution! That being said, here are a few other thoughts on the matter:

  1. If the volatile training accuracies and the good test accuracy occur consistently i.e. they are reproducible - then I would probably be happy with that. Then start adding different types of regularisation to smooth the training loss curves a little. This could be in the form of batch normalisation or dropout, for example.

  2. If the test accuracies are all over the place, also varying dramatically, then I would definitely want to sort out the training curves first. Try more data pre-processing, different batch-sizes (perhaps yours is too smnall, hence the volatility) as well as adding regularisation as mentioned above.

  3. Another approach might be to try some form of dimensionality reduction of your input data. Mapping the features you have to a latent space, where the information may be more dense and (hopefully) more uniform. Which method to try will greatly depend on your type of data: text, images, videos, sound, temperature, stock-prices and so on. Have a look at things like t-SNE, which will work for most data types, or Word2Vec for text.


Posted 2018-09-13T06:21:34.023

Reputation: 12 573

Thanks @n1k31t4 . If you look at my post https://datascience.stackexchange.com/questions/38121/how-to-resolve-the-instability-of-average-reward-per-episode-in-training-of-dqn you can see this is common in dqn! What we can say for this even it has done preprocessing!

– user10296606 – 2018-09-13T16:24:36.340


In most of the cases in the industry, we should go for the stable performance rather than the strong accuracy. But still, it depends. Here's an example from my current business.

I currently deploy deep learning algorithms for algorithmic trading as my job. My algorithms do not predict the market, yet they do predict the trading strategy for every 15 minutes for FOREX EUR/USD, XAU/USD, etc. Not going deep with the specifics, let's say my algorithm predicts the true trading strategy with an average 65% (+-3%) stable accuracy. It can provide long-term stable profits for my portfolio. However, if I have average 70%(+-15%) accuracy, my algorithm could have 3-weeks of aggressive profits, whereas it could fail the same portfolio over one week. Actually, this described scenario is a real life example from what I have experienced from most of my models.

However, if my task was to detect wanted criminal suspects from certain features in the public from CCTV cameras by computer vision, it could be acceptible that having an unstable performance. The reason why is that it's fine that the algorithm can sometimes made mistakes upon regular people when identifying them, yet it could find specific targets better due to its overall performance. By analogy, a strong performing unstable model has some kind of magic but the wand may not be working sometimes; whereas a stabilized model is a regular German machinery, working most of the times with no magical powers.


Posted 2018-09-13T06:21:34.023

Reputation: 460