Why is it wrong to train and test a model on the same dataset?

28

3

What are the pitfalls of doing so and why is it a bad practice? Is it possible that the model starts to learn the images by heart instead of understanding the underlying logic?

karalis1

Posted 2020-12-13T14:11:58.530

Reputation: 391

12

Precisely. See overfitting https://en.wikipedia.org/wiki/Overfitting

– njzk2 – 2020-12-14T11:12:24.567

If you were to train your model to predict heads and tails of a coin flip based on the sequence so far, it shouldn't be able to learn anything and can't provide a meaningful prediction of your next flip. But if you test it on the sequence it's already seen, it will be perfect. Would you trust someone who was selling you a "coin flip predictor" with data showing that perfect prediction? – Joel – 2020-12-15T00:37:10.253

The principle is similar when one says "I am a very good person". He/she is overconfident about him/herself. Therefore, it should be that they are good if others also say so. – TrungDung – 2020-12-15T21:39:00.063

Answers

41

Yes, you put it quite correctly.

As a teacher, you wouldn’t give your students an exam that’s got the exact same exercises you have provided as homework: you want to find out whether they (a) have actually understood the intuition behind the methods you taught them and (b) make sure they haven’t just memorised the homework exercises.

hH1sG0n3

Posted 2020-12-13T14:11:58.530

Reputation: 1 165

23An infamous example I've seen was a classifier designed to detect fish. The training dataset had many pictures of a fisherman holding a fish versus not holding a fish. The classifier seemed to work perfectly on the training data, but later failed to detect any fish under any other conditions. It turned out the neural network focused solely on the fingers of the fisherman, because they were in a consistent position every time he held a fish. – vsz – 2020-12-14T07:07:54.287

15Another one: A classifier to detect enemy tanks ... worked perfectly in training, but effectively only learned that (pictures of) friendly tanks were shot during the day, enemy tanks were shot at night. – fho – 2020-12-14T14:01:29.150

13I remember reading one about telling wolfs apart from dogs, and the model learned that wolves had snow in the background, not that wolves look like wolves. – Dave – 2020-12-14T15:52:39.387

6@Dave That one is often misremembered: the researchers were well aware of the snow=wolves (and grass = dog), as it was an intentional flaw they put into the training data set. They weren't building a model to tell wolves from dogs, though: they were looking to see how much (certain types of) people would retain their trust in the (trained) algorithm's soundness after watching it perform. It was a psychological experiment. – zibadawa timmy – 2020-12-15T11:08:06.683

7None of the three examples offered in the comments so far are good examples of training and testing the model on the same data. They're examples of testing models on data that is substantially different to training, but if you know that sort of data exists you should include some in the training set! A model trained on objects in multiple settings can still overfit if you test it on exactly those images. It might not be as simple as snow/no snow but the color of a couple of pixels is all you need for a perfect score if you never have to care about out-of-sample classification. – Will – 2020-12-15T12:51:07.823

I think I agree with @Will 's comment above. – hH1sG0n3 – 2020-12-15T12:57:05.763

2

@fho Another similar case - in the early 40s, the Soviets trained dogs to run at tanks while strapped with explosives. One of the many problems was that they had trained the dogs on Soviet tanks - so the dogs would run at their own lines rather than the Germans'. Because Soviet tanks used diesel and German tanks used gasoline, is thought that the dogs learned "run towards diesel fumes" rather than "run towards tanks".

– Richard Ward – 2020-12-15T15:58:10.923

1Another one: a classifier to detect skin cancer actually detected the ruler which only occurred in the examples of cancer. – Michael Mior – 2020-12-15T21:20:03.333

11

It is wrong because:

  • it is fundamentally incorrect (a theoretical concern)
  • it leads to bad results (a practical concern)

It is fundamentally incorrect because usually the objective of testing a model is to estimate how well it will perform predictions on data that the model didn't see.

It's quite hard to come up with good estimates of real-world performance, even when you do everything correctly. If you use training data to estimate the performance the result is worse than useless, it's actively misleading.

There's several ways that doing this can lead to bad results.

Overfitting

If you're training a complex model with small amount of data, your model is very likely to overfit. In a simplified way, we can say that if the has a lot of "memory" (parameters), it memorizes the training data, and fails to understand its underlying structure.

Imagine that you're building a model that predicts house price based on the floor area. Your training set looks like this:

area  price
30    100001
50    150002
80    200003

You train your model, then ask it to predict the price for a house of area=50, and it tells you that the price should be 150002. Is that impressively accurate? Not really. It's just memorizing the training data.

Overfitting is commonly detected through a large difference in performance between the training and test set. If you test on the training set, you're unable to detect overfitting.

Concept drift

If you make sure you're training a very simple model on a large amount of data, even if there's no overfitting, it's common for models to suffer from concept drift.

This basically means that the underlying structure of the data can change over time. For example, trying to predict how many sales a store is going to make on the weekend after training on data from Monday to Friday.

If your test data is not diverse enough along the time dimension vs the training set, you won't catch that problem.

goncalopp

Posted 2020-12-13T14:11:58.530

Reputation: 259

1Exactly. It is fundamentally incorrect. It is mal-implementation of statistics. – Stian Yttervik – 2020-12-14T14:12:20.727

8

It can happen that the model you train learns "too much" or memorizes the training data, and then it performs poorly on unseen data. This is called "overfitting".

The problem of training and testing on the same dataset is that you won't realize that your model is overfitting, because the performance of your model on the test set is good. The purpose of testing on data that has not been seen during training is to allow you to properly evaluate whether overfitting is happening.

noe

Posted 2020-12-13T14:11:58.530

Reputation: 10 494

what if the sets have very similar images? Can overfitting be avoided? – karalis1 – 2020-12-13T14:31:17.023

2There is something called "data leakage" that consists of having part of your training data inadvertently leaked to the validation/test set. This can give you a false evaluation of overfitting, as you will think that there is no overfitting, but due to the overlap, you may be having overfitting without realizing it. This is why it is important to properly avoid overlaps between the training and test data. – noe – 2020-12-13T14:49:08.267

A typical example of data leakage in image classification is when you do data augmentation (e.g. creating new images by rescaling, rotating, or applying other transformations to images) before splitting into training and test data (see this).

– noe – 2020-12-13T14:49:27.583

2And overfitting can just be detected with a separate test set, not avoided. In order to avoid overfitting there are techniques like "regularization". – noe – 2020-12-13T14:50:52.100

Even in the "regular" world of data processing, QA doesn't test with the same data that the programmer developed against, because the program is -- by nature -- biased towards the test data he programmed against. – RonJohn – 2020-12-14T06:30:37.767

2Just to make it clear - the problem with overfitting is not just that the model becomes too good on the training data, it's (also) that it gets better on the training data but worse on other data. – kutschkem – 2020-12-14T16:06:25.570

Right, I clarified it. Thanks – noe – 2020-12-14T16:09:13.507

1@ncasas Regarding data leakage: I once created a code recommendation engine based on the name of the function in which the programmer is currently writing. The idea was inside getXXX methods calling other getXXX methods will be more likely than setXXX. Worked surprisingly well, but the big catch is that in my training (and test!) data, a very big proportion was actually generated code (from ANTLR I think). So I am suspicious the model just learned how ANTLR-generated code looks but I didn't have time to clean up the data and try again, so I will never know for sure. – kutschkem – 2020-12-14T16:14:47.493

1@ncasas Another example: I worked in a project classifying Wikipedia articles, and it was important to exclude the comment section from the features because otherwise we would be training on discussions about the issues we were trying to detect. – kutschkem – 2020-12-14T16:17:08.597

6

Simple answer: circular reasoning. The fact that your model "knows" the answer to something you've already told it the answer to really doesn't prove anything.

Put another way: the entire point of testing is to get some sense of how well your model would do with data it hasn't seen yet, and testing it with data that it has already seen doesn't do that.

EJoshuaS - Reinstate Monica

Posted 2020-12-13T14:11:58.530

Reputation: 161

3

Is it possible that the model starts to learn the images by heart instead of understanding the underlying logic?

If the model memorizes the training data when that same data is used for the "test" set, it would still memorize the training data when different data was used for the "test" set. Using a separate "test" set cannot prevent that memorization from happening. More generally, the "test" set has no direct impact on model training.

However, the separate "test" set does allow the researchers to identify that the model is indeed memorizing the individual data samples and targets instead learning the underlying patterns. When the researchers see the loss decreasing on the training set but increasing on the "test" set, they know this overfitting is taking place. At that point they can tune the model's hyperparameters, specifically trying to lower the model's capacity (i.e. number of nodes and/or layers), and then retrain the model to see if the issue has been resolved.

Because the researchers use the "test" set to tune the model's hyperparameters, the "test" data can have an indirect impact on the final model's performance. It could be that the researchers pick hyperparameters that work well for "test" set, but not for the data in general. For that reason, it is sometimes recommended to use 3 distinct data sets: the training set used to train the model, the initial "test" set used to address overfitting and other issues by tuning hyperparameters (this is more commonly known as the validation set), and a final test set which is only used to evaluate the finalized model (and has no impact, direct or indirect, on model training).

Leland Hepworth

Posted 2020-12-13T14:11:58.530

Reputation: 133

2

To express it in a different way, that might be more useful when explaining to impatient stakeholders:

Imagine that you go to a travelling fair and a lady with many shawls and a crystal ball tells you, "I can look at a person and tell them if they are married or not." You are not sure if this is for real.

If she starts pointing at her colleagues from the fair and tells you, "he is married, she isn't, the other woman also isn't" - what does this tell you? Nothing. She already knows these people, she knows who is married! To start trusting her ability, you want her to make her guesses about people she's never seen.

In data science, you always have the problem whether people (including you!) should trust the model or not. It can prove itself by showing that it can find information which it didn't know beforehand. It has to know its training data by definition, so your only option is to keep some data "hidden" from it (the test set).

In fact, it is ideal that, if you suspect your data is too uniform, to do a second testing with a different dataset created in a different way, to confirm it is working in general. This is done mostly in science, if data is available, e.g. if you trained and tested data on patients from one hospital, you ideally try it on patients from a different hospital, just in case data was coded differently, or you had selection bias or whatever.

rumtscho

Posted 2020-12-13T14:11:58.530

Reputation: 151

2

To give a simple illustration of how bad overfitting can be, consider the example of fitting (training) a polynomial of order equal to the number of points of data you have. In this case I've generated data with a slope and some normally distributed random noise added. If you test it with exactly the same x & y values that you used to generate the polynomial fit by looking at the residuals, all you see is the numerical error, and you might naively say it's a good fit, or at least better than the linear fit (plotted in green) which has much larger residuals. If you plot the actual polynomial you get (in red), you'll probably see that it actually does a terrible job of interpolating between this test data (since we know that the underlying process is simply a straight line), like so:

enter image description here

If you generate a new set of data with the same x-values, you see that as well as failing at interpolating, this performs about the same as the linear fit in terms of residuals:

enter image description here

And perhaps worst of all, it fails spectacularly when attempting to extrapolate as the polynomial predictably blows up in both directions:

enter image description here

So if "prediction" for your model is interpolating, then overfitting makes it bad at that and won't be detected unless you test it on non-training data. If prediction is extrapolating, then most likely it's even worse at that than it is at interpolating, and again you won't be able to tell unless you test it on the right kind of data.

Python code used to generate these plots:

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(0)

nSamples = 15
slope = 10
xvals = np.arange(nSamples)
yvals = slope*xvals + np.random.normal(scale=slope/10.0, size=nSamples)

plt.figure(1)
plt.clf()
plt.subplot(211)
plt.title('"Perfect" polynomial fit')
plt.plot(xvals, yvals, '.', markersize=10)

polyCoeffs = np.polyfit(xvals, yvals, nSamples-1)
poly_model = np.poly1d(polyCoeffs)
linearCoeffs = np.polyfit(xvals, yvals, 1)
linear_model = np.poly1d(linearCoeffs)
xfit = np.linspace(0, nSamples-1, num=nSamples*50)
#yfit_dense = poly_model(xfit)
plt.plot(xfit, poly_model(xfit), 'r')
plt.plot(xfit, linear_model(xfit), 'g')

plt.subplot(212)
plt.plot(xvals, poly_model(xvals) - yvals, 'r.')
plt.plot(xvals, linear_model(xvals) - yvals, 'g.')
plt.title('Fit residuals for training data (nonzero only due to numerical error)')

#%% Testing interpolation
plt.figure(2)
plt.clf()
test_yvals = slope*xvals + np.random.normal(scale=slope, size=nSamples)
plt.subplot(211)
plt.title('Testing "perfect" polynomial fit with new samples')
plt.plot(xvals, test_yvals, '.', markersize=10)
plt.plot(xfit, poly_model(xfit), 'r')
plt.plot(xfit, linear_model(xfit), 'g')

plt.subplot(212)
plt.title('Fit residuals for test data')
plt.plot(xvals, poly_model(xvals) - test_yvals, 'r.')
plt.plot(xvals, linear_model(xvals) - test_yvals, 'g.')

#%% Testing extrapolation
extrap_xmin = -5
extrap_xmax = nSamples + 5
xvals_extrap = np.arange(extrap_xmin, extrap_xmax)
yvals_extrap = slope*xvals_extrap + np.random.normal(scale=slope, size=len(xvals_extrap))

plt.figure(3)
plt.clf()
plt.subplot(211)
plt.title('Testing "perfect" polynomial fit extrapolation')
plt.plot(xvals_extrap, yvals_extrap, '.', markersize=10)
plt.plot(xvals_extrap, poly_model(xvals_extrap), 'r')
plt.plot(xvals_extrap, linear_model(xvals_extrap), 'g')

plt.subplot(212)
plt.title('Fit residuals for extrapolation')
plt.plot(xvals_extrap, poly_model(xvals_extrap) - yvals_extrap, 'r.')
plt.plot(xvals_extrap, linear_model(xvals_extrap) - yvals_extrap, 'g.')

llama

Posted 2020-12-13T14:11:58.530

Reputation: 121

0

There is nothing wrong with testing the model on data you trained on, but there is something wrong with not testing your model on data it has not seen before.

As other answers have explained, tests on the data that the model was trained on are by no means substitute for tests on new data, and in case your model is overfitting these results can be very different. However, testing the model on the data it trained on is still valuable. For example, if you model does not fit the data it trained on well, then you know that you have an underfitting problem, and your model is too simple.

In other words, if your out of sample accuracy is 0.6, you would proceed differently depending on what your in sample accuracy is: if it is 0.999, then you are overfitting, if it is 0.62, then you are underfitting.

Usually you look at both, as it helps guide you model improvement direction.

Akavall

Posted 2020-12-13T14:11:58.530

Reputation: 721