What would I prefer - an over-fitted model or a less accurate model?



Let's say we have two models trained. And let's say we are looking for good accuracy. The first has an accuracy of 100% on training set and 84% on test set. Clearly over-fitted. The second has an accuracy of 83% on training set and 83% on test set.

On the one hand, model #1 is over-fitted but on the other hand it still yields better performance on an unseen test set than the good general model in #2.

Which model would you choose to use in production? The First or the Second and why?


Posted 2020-01-12T13:48:25.717

Reputation: 449

3The difference of 1% on the test set makes it easy to choose the 2nd one. But I think if the difference was 10% on the test set than people might choose the "overfit" model. – jerlich – 2020-01-13T06:55:12.667

31% difference in test accuracy will not be statistically significant for many test sets. How big is yours? – Will – 2020-01-13T09:27:30.437

Can we have a third option: use the second to further improve into a third? The second is likely more salvageable than the first, but neither would be ideal in production. What is your goal? – Mast – 2020-01-13T10:41:31.620

The test and train set metrics are random variables - if for example you're using k-fold cross validation then you can estimate the confidence of both using the variance for example - this could indicate that you cannot really say 1 is higher than 2 with say 90% confidence. So whilst it's a bit hand wavey I would prefer the model which generalises better – David Waterworth – 2020-01-13T21:57:10.057

3As mentioned in @Ray's answer, it is not correct that the first model is necessarily overfit. Random Forests, for example, are explicitly designed to produce this situation. – Matthew Drury – 2020-01-17T19:23:30.873



There are a couple of nuances here.

  1. Complexity question very important - ocams razor
  2. CV - is this trully the case 84%/83% (test it for train+test with CV)

Given this, personal opinion: Second one.

Better to catch general patterns. You already know that first model failed on that because of the train and test difference. 1% says nothing.

Noah Weber

Posted 2020-01-12T13:48:25.717

Reputation: 4 932

2I disagree with the blanket statement that "1% says nothing". With a lot of data, that can be a statistically significant difference, and in situations with class imbalance, that might be a very important 1% that's being misclassified (although in that situation, accuracy is a terrible measure to start with). – Nuclear Hoagie – 2020-01-13T14:44:30.587

2you are reading it wrong. 1% says nothing in THIS situation. 1% in general is big difference! – Noah Weber – 2020-01-13T14:45:58.253

1"Better to catch general patterns. You already know that first model failed on that because of the train and test difference." No, the 84% says it caught general patterns. The 100% says those general patterns applied very well to the test set. – Acccumulation – 2020-01-13T21:25:25.117

Could you explain what CV is in the context of your answer? – Fritz – 2020-01-20T09:19:47.773

1cross validation – Noah Weber – 2020-01-20T10:05:34.607


It depends mostly on the problem context. If predictive performance is all you care about, and you believe the test set to be representative of future unseen data, then the first model is better. (This might be the case for, say, health predictions.)

There are a number of things that would change this decision.

  1. Interpretability / explainability. This is indirect, but parametric models tend to be less overfit, and are also generally easier to interpret or explain. If your problem lies in a regulated industry, it might be substantially easier to answer requests with a simpler model. Related, there may be some ethical concerns with high-variance models or non-intuitive non-monotonicity.

  2. Concept drift. If your test set is not expected to be representative of production data (most business uses), then it may be the case that more-overfit models suffer more quickly from model decay. If instead the test data is just bad, the test scores may not mean much in the first place.

  3. Ease of deployment. While ML model deployment options are now becoming much easier and more sophisticated, a linear model is still generally easier to deploy and monitor.

See also
Can we use a model that overfits?
What to choose: an overfit model with higher evaluation score or a non-overfit model with lower one?

(One last note: the first model may well be amenable to some sort of regularization, which will trade away training accuracy for a simpler model and, hopefully, a better testing accuracy.)

Ben Reiniger

Posted 2020-01-12T13:48:25.717

Reputation: 7 097


There's an interesting paper https://arxiv.org/abs/1812.11118v2 with a claim that with a sufficiently powerful network you get 'beyond overfitting' as the test accuracy starts to improve (due to regularization and other prior structural bias) after overfitting while the train accuracy stays at 100%.

– Peteris – 2020-01-13T14:18:43.727


The first has an accuracy of 100% on training set and 84% on test set. Clearly over-fitted.

Maybe not. It's true that 100% training accuracy is usually a strong indicator of overfitting, but it's also true that an overfit model should perform worse on the test set than a model that isn't overfit. So if you're seeing these numbers, something unusual is going on.

If both model #1 and model #2 used the same method for the same amount of time, then I would be rather reticent to trust model #1. (And if the difference in test error is only 1%, it wouldn't be worth the risk in any case; 1% is noise.) But different methods have different characteristics with regard to overfitting. When using AdaBoost, for example, test error has often been observed not only to not increase, but actually continue decreasing even after the training error has gone to 0 (An explanation of which can be found in Schapire et. al. 1997). So if model #1 used boosting, I would be much less worried about overfitting, whereas if it used linear regression, I'd be extremely worried.

The solution in practice would be to not make the decision based only on those numbers. Instead, retrain on a different training/test split and see if you get similar results (time permitting). If you see approximately 100%/83% training/test accuracy consistently across several different training/test splits, you can probably trust that model. If you get 100%/83% one time, 100%/52% the next, and 100%/90% a third time, you obviously shouldn't trust the model's ability to generalize. You might also keep training for a few more epochs and see what happens to the test error. If it is overfitting, the test error will probably (but not necessarily) continue increasing.


Posted 2020-01-12T13:48:25.717

Reputation: 181


Obviously the answer is highly subjective; in my case clearly the SECOND. Why? There's nothing worse than seeing a customer running a model in production and not performing as expected. I've had literally had a technical CEO who wanted to get a report of how many customers have left in a given month and the customer churn prediction model. It was not fun :-(. Since then, I strongly favor high bias/low variance models.


Posted 2020-01-12T13:48:25.717

Reputation: 901


These numbers suggest that the first model is not, in fact, overfit. Rather, it suggests that your training data had few data points near the decision boundary. Suppose you're trying to classify everyone as older or younger than 13 y.o. If your test set contains only infants and sumo wrestlers, then "older if weight > 100 kg, otherwise younger" is going to work really well on the test set, not so well on the general population.

The bad part of overfitting isn't that it's doing really well on the test set, it's that it's doing poorly in the real world. Doing really well on the test set is an indicator of this possibility, not a bad thing in and of itself.

If I absolutely had to choose one, I would take the first, but with trepidation. I'd really want to do more investigation. What are the differences between train and test set, that are resulting in such discrepancies? The two models are both wrong on about 16% of the cases. Are they the same 16% of cases, or are they different? If different, are there any patterns about where the models disagree? Is there a meta-model that can predict better than chance which one is right when they disagree?


Posted 2020-01-12T13:48:25.717

Reputation: 241


If your options are indeed "100% on train / 84% on validation" vs "83% on train / 83% on validation", I'd feel safer with the second one - but really, I'd take a third option: Try and tweak the first model to reduce overfitting (with the usual methods), hopefully squeezing a bit more accuracy out of it.

Itamar Mushkin

Posted 2020-01-12T13:48:25.717

Reputation: 853