Why Is Overfitting Bad in Machine Learning?

43

15

Logic often states that by overfitting a model, its capacity to generalize is limited, though this might only mean that overfitting stops a model from improving after a certain complexity. Does overfitting cause models to become worse regardless of the complexity of data, and if so, why is this the case?


Related: Followup to the question above, "When is a Model Underfitted?"

blunders

Posted 2014-05-14T18:09:01.940

Reputation: 868

1Overfitting is bad by definition. If it weren't it wouldn't be over-fitting.Gala 2014-06-16T10:51:52.677

-1 for not being stated clearly. I would propose a fix, but the only ones I can think of would fundamentally change the meaning of the question. I think blunders might be conflating "overfitting" with "adding model complexity".Nathan Gould 2014-06-16T14:06:19.127

@NathanGould: Thanks for commenting, though appears you're both inferring a meaning that is simply not present in the question, and quoting text that is also not present; meaning no where in the text are the words "adding model complexity."blunders 2014-06-16T15:57:44.980

1I didn't mean to quote you on "adding model complexity" -- I was just highlighting the phrase. Anyhow I guess my issue is basically the same as @GaLa, which is that overfitting means fitting too much. So it seems you are asking us to confirm a tautology. So, I would tend to think that you actually meant to ask a different question. E.g., does increasing model complexity cause models to become worse? Or, how does complexity of the data relate to the tendency of a model to overfit?Nathan Gould 2014-06-16T20:19:32.450

@NathanGould: I think you're over thinking the question, or somehow expecting that I would ask a question that on a topic that I completely understood. Every single reference by me to complexity within the question related to the data being modeled, not the complexity of the model.blunders 2014-06-19T11:50:57.280

@blunders Maybe the question should simply be “What is overfitting?” or “How can a model fitting the data more precisely be worse?”Gala 2014-06-19T20:27:23.540

@GaLa: Thanks for the feedback, though I wanted to know why overfitting is bad, not what overfitting is, or how fitting the model to the data more precisely is bad. Beyond that, at this point, given there are 3 answers, and 40+ votes, I do not feel that it would either be fair or a good idea to change the meaning and/or request made by the question. If you believe that the question might be better expressed, my suggestion would be just to post a question yourself; please feel free to link to it in the comments here. Again, thanks!blunders 2014-06-19T21:56:03.637

@blunders We are back to my original comment but it seems to me that if you know what overfitting is, you know why it's bad. Since you seemed to suggest earlier you didn't fully understand the topic (which is fine), I would think the first thing to ask is simply what overfitting is.Gala 2014-06-20T07:37:44.590

Also, the answers seem quite messy, perhaps because the question was so confusing to begin with but some of them do seem to address it in the way I suggest.Gala 2014-06-20T07:39:37.753

@GaLa: If you want to ask your own question, as stated above, please do. If you have any issues with the answers, please address them within the comments to the answer. The top answer, which I selected when it had I believe one or two votes, now has 18+ votes, and the question received a number of valid answers; meaning nothing to see, moving on.blunders 2014-06-20T11:49:26.910

@blunders I just think this question is a bit of a mess and since the comments seemed to confuse you, I tried to explain them a bit.Gala 2014-06-21T09:17:29.030

1Is your question actually whether there is a case where it's impossible to overfit?Sean Owen 2014-05-16T13:09:01.880

@SeanOwen: No, how would it be impossible to overfit?blunders 2014-05-16T13:13:46.987

Agree, just checking as you asked if overfitting caused models to become worse regardless of the dataSean Owen 2014-05-16T13:14:24.237

Just to be clear, to me, you asking if it's "impossible to overfit" and "overfitting caused models to become worse regardless of the data" are completely different in my opinion; that said, as far as I know, both are impossible.blunders 2014-05-16T13:49:13.207

Answers

43

Overfitting is empirically bad. Suppose you have a data set which you split in two, test and training. An overfitted model is one that performs much worse on the test dataset than on training dataset. It is often observed that models like that also in general perform worse on additional (new) test datasets than models which are not overfitted.

One way to understand that intuitively is that a model may use some relevant parts of the data (signal) and some irrelevant parts (noise). An overfitted model uses more of the noise, which increases its performance in the case of known noise (training data) and decreases its performance in the case of novel noise (test data). The difference in performance between training and test data indicates how much noise the model picks up; and picking up noise directly translates into worse performance on test data (including future data).

Summary: overfitting is bad by definition, this has not much to do with either complexity or ability to generalize, but rather has to do with mistaking noise for signal.

P.S. On the "ability to generalize" part of the question, it is very possible to have a model which has inherently limited ability to generalize due to the structure of the model (for example linear SVM, ...) but is still prone to overfitting. In a sense overfitting is just one way that generalization may fail.

Alex I

Posted 2014-05-14T18:09:01.940

Reputation: 2 044

17

Overfitting, in a nutshell, means take into account too much information from your data and/or prior knowledge, and use it in a model. To make it more straightforward, consider the following example: you're hired by some scientists to provide them with a model to predict the growth of some kind of plants. The scientists have given you information collected from their work with such plants throughout a whole year, and they shall continuously give you information on the future development of their plantation.

So, you run through the data received, and build up a model out of it. Now suppose that, in your model, you considered just as many characteristics as possible to always find out the exact behavior of the plants you saw in the initial dataset. Now, as the production continues, you'll always take into account those characteristics, and will produce very fine-grained results. However, if the plantation eventually suffer from some seasonal change, the results you will receive may fit your model in such a way that your predictions will begin to fail (either saying that the growth will slow down, while it shall actually speed up, or the opposite).

Apart from being unable to detect such small variations, and to usually classify your entries incorrectly, the fine-grain on the model, i.e., the great amount of variables, may cause the processing to be too costly. Now, imagine that your data is already complex. Overfitting your model to the data not only will make the classification/evaluation very complex, but will most probably make you error the prediction over the slightest variation you may have on the input.

Edit: This might as well be of some use, perhaps adding dynamicity to the above explanation :D

Rubens

Posted 2014-05-14T18:09:01.940

Reputation: 2 452

14

Roughly speaking, over-fitting typically occurs when the ratio

enter image description here

is too high.

Think of over-fitting as a situation where your model learn the training data by heart instead of learning the big pictures which prevent it from being able to generalized to the test data: this happens when the model is too complex with respect to the size of the training data, that is to say when the size of the training data is to small in comparison with the model complexity.

Examples:

  • if your data is in two dimensions, you have 10000 points in the training set and the model is a line, you are likely to under-fit.
  • if your data is in two dimensions, you have 10 points in the training set and the model is 100-degree polynomial, you are likely to over-fit.

enter image description here

From a theoretical standpoint, the amount of data you need to properly train your model is a crucial yet far-to-be-answered question in machine learning. One such approach to answer this question is the VC dimension. Another is the bias-variance tradeoff.

From an empirical standpoint, people typically plot the training error and the test error on the same plot and make sure that they don't reduce the training error at the expense of the test error:

enter image description here

I would advise to watch Coursera' Machine Learning course, section "10: Advice for applying Machine Learning".

(PS: please go here to ask for TeX support on this SE.)

Franck Dernoncourt

Posted 2014-05-14T18:09:01.940

Reputation: 2 888

8

No one seems to have posted the XKCD overfitting comic yet.

enter image description here

Jeremy Miles

Posted 2014-05-14T18:09:01.940

Reputation: 181

5

That's because something called bias-variance dilema. The overfitted model means that we will have more complex decision boundary if we give more variance on model. The thing is, not only too simple models but also complex models are likely to have dis-classified result on unseen data. Consequently, over-fitted model is not good as under-fitted model. That's why overfitting is bad and we need to fit the model somewhere in the middle.

Kim

Posted 2014-05-14T18:09:01.940

Reputation: 151

+1 Thanks, as a result of your answer, I've posted a followup to the question above, "When is a Model Underfitted?"

blunders 2014-06-13T16:59:09.830

3

What got me to understand the problem about overfitting was by imagining what the most overfit model possible would be. Essentially, it would be a simple look-up table.

You tell the model what attributes each piece of data has and it simply remembers it and does nothing more with it. If you give it a piece of data that it's seen before, it looks it up and simply regurgitates what you told it earlier. If you give it data it hasn't seen before, the outcome is unpredictable or random. But the point of machine learning isn't to tell you what happened, it's to understand the patterns and use those patterns to predict what's going on.

So think of a decision tree. If you keep growing your decision tree bigger and bigger, eventually you'll wind up with a tree in which every leaf node is based on exactly one data point. You've just found a backdoor way of creating a look-up table.

In order to generalize your results to figure out what might happen in the future, you must create a model that generalizes what's going on in your training set. Overfit models do a great job of describing the data you already have, but descriptive models are not necessarily predictive models.

The No Free Lunch Theorem says that no model can outperform any other model on the set of all possible instances. If you want to predict what will come next in the sequence of numbers "2, 4, 16, 32" you can't build a model more accurate than any other if you don't make the assumption that there's an underlying pattern. A model that's overfit isn't really evaluating the patterns - it's simply modeling what it knows is possible and giving you the observations. You get predictive power by assuming that there is some underlying function and that if you can determine what that function is, you can predict the outcome of events. But if there really is no pattern, then you're out of luck and all you can hope for is a look-up table to tell you what you know is possible.

Ram

Posted 2014-05-14T18:09:01.940

Reputation: 323

1

You are erroneously conflating two different entities: (1) bias-variance and (2) model complexity.

(1) Over-fitting is bad in machine learning because it is impossible to collect a truly unbiased sample of population of any data. The over-fitted model results in parameters that are biased to the sample instead of properly estimating the parameters for the entire population. This means there will remain a difference between the estimated parameters $\hat{\phi}$ and the optimal parameters $\phi^{*}$, regardless of the number of training epochs $n$.

$|\phi^{*} - \hat{\phi}| \rightarrow e_{\phi} \mbox{ as }n\rightarrow \infty$, where $e_{\phi}$ is some bounding value

(2) Model complexity is in simplistic terms the number of parameters in $\phi$. If the model complexity is low, then there will remain a regression error regardless of the number of training epochs, even when $\hat{\phi}$ is approximately equal to $\phi^{*}$. Simplest example would be learning to fit a line (y=mx+c), where $\phi = \{m,c\}$ to data on a curve (quadratic polynomial).

$E[|y-M(\hat{\phi})|] \rightarrow e_{M} \mbox{ as } n \rightarrow \infty$, where $e_{M}$ is some regression fit error bounding value

Summary: Yes, both sample bias and model complexity contribute to the 'quality' of the learnt model, but they don't directly affect each other. If you have biased data, then regardless of having the correct number of parameters and infinite training, the final learnt model would have error. Similarly, if you had fewer than the required number of parameters, then regardless of perfectly unbiased sampling and infinite training, the final learnt model would have error.

Dynamic Stardust

Posted 2014-05-14T18:09:01.940

Reputation: 633

0

There have been a lot of good explanations about overfitting. Here are my thoughts. Overfitting happens when your variance is too high and bias is too low.

Let's say you have training data with you, which you divide into N parts. Now, if you train a model on each of the datasets, you will have N models. Now find the mean model and then use the variance formula to compute how much each model varies from the mean. For overfitted models, this variance will be really high. This is because, each model would have estimated parameters which are very specific to the small dataset that we fed to it. Similarly, if you take the mean model and then find how much it is different from the original model that would have given the best accuracy, it wouldn't be very different at all. This signifies low bias.

To find whether your model has overfitted or not, you could construct the plots mentioned in the previous posts.

Finally, to avoid overfitting you could regularize the model or use cross validation.

Ram

Posted 2014-05-14T18:09:01.940

Reputation: 221