Why use both validation set and test set?

28

8

Consider a neural network:

For a given set of data, we divide it into training, validation and test set. Suppose we do it in the classic 60:20:20 ratio, then we prevent overfitting by validating the network by checking it on validation set. Then what is the need to test it on the test set to check its performance?

Won't the error on the test set be somewhat same as the validation set as for the network it is an unseen data just like the validation set and also both of them are same in number?

Instead can't we increase the training set by merging the test set to it so that we have more training data and the network trains better and then use validation set to prevent overfitting? Why don't we do this?

user1825567

Posted 2017-04-13T19:33:53.090

Reputation: 1 106

5You'd like it to be the same but you can't be sure because you've touched it for hyperparameter optimization and early stopping, so you need a virgin test set. – Emre – 2017-04-13T20:01:16.827

@Emre But the weights will get adjusted based on the training set and not on the validation set, so the result on test and validation set shouldn't be way to different. – user1825567 – 2017-04-14T04:24:20.143

1No, they don't (get adjusted according to the training set). That's for regular parameters. – Emre – 2017-04-14T04:36:11.107

Answers

34

Let's assume that you are training a model whose performance depends on a set of hyperparameters. In the case of a neural network, these parameters may be for instance the learning rate or the number of training iterations.

Given a choice of hyperparameter values, you use the training set to train the model. But, how do you set the values for the hyperparameters? That's what the validation set is for. You can use it to evaluate the performance of your model for different combinations of hyperparameter values (e.g. by means of a grid search process) and keep the best trained model.

But, how does your selected model compares to other different models? Is your neural network performing better than, let's say, a random forest trained with the same combination of training/test data? You cannot compare based on the validation set, because that validation set was part of the fitting of your model. You used it to select the hyperparameter values!

The test set allows you to compare different models in an unbiased way, by basing your comparisons in data that were not use in any part of your training/hyperparameter selection process.

Pablo Suau

Posted 2017-04-13T19:33:53.090

Reputation: 1 507

I dont see why you couldn't just treat model type as just another hyperparameter and just use the validation data set to perform all model selection activities? Why differentiate hyperparameters of a model and different model types? The test set just seems useful to monitor the performance of the final production model. – SriK – 2020-09-30T10:52:06.487

15

The test set and cross validation set have different purposes. If you drop either one, you lose its benefits:

  • The cross validation set is used to help detect over-fitting and to assist in hyper-parameter search.

  • The test set is used to measure the performance of the model.

You cannot use the cross validation set to measure performance of your model accurately, because you will deliberately tune your results to get the best possible metric, over maybe hundreds of variations of your parameters. The cross validation result is therefore likely to be too optimistic.

For the same reason, you cannot drop the cross validation set and use the test set for selecting hyper parameters, because then you are pretty much guaranteed to be overestimating how good your model is. In the ideal world you use the test set just once, or use it in a "neutral" fashion to compare different experiments.

If you cross validate, find the best model, then add in the test data to train, it is possible (and in some situations perhaps quite likely) your model will be improved. However, you have no way to be sure whether that has actually happened, and even if it has, you do not have any unbiased estimate of what the new performance is.

From witnessing many Kaggle competitions, my experience is that tuning to the test set by over-using it is a real thing, and it impacts those competitions in a large way. There is often a group of competitors who have climbed the public leaderboard and selected their best model in test (the public leaderboard is effectively a test set), whilst not being quite so thorough on their cross validation . . . these competitors drop down the leaderboard when a new test set is introduced at the end.

One approach that is reasonable is to re-use (train + cv) data to re-train using the hyper-params you have found, before testing. That way you do get to train on more data, and you still get an independent measure of performance at the end.

If you want to get more out of cross validation, the usual approach is k-fold cross validation. A common trick in Kaggle competitions is to use k-fold cross validation, and instead of re-combining the data into a larger (train + cv) training set, to ensemble or stack the cv results into a meta-model.

Finally, always check that your splits for validation and test are robust against possible correlation within your data set.

Neil Slater

Posted 2017-04-13T19:33:53.090

Reputation: 24 613

1what do you mean by "robust against possible correlation within your data set"? – user6903745 – 2018-04-13T10:09:51.257

Why couldn't the k-fold cross validation performance results be used as the performance measure of the final model ? Why have a separate test set which anyway provides only a single performance estimate which maybe biased since its based on a specific split of the data. K-fold provides a distribution of performance measures as well which is better than a single point estimate. – SriK – 2020-09-30T11:04:49.673

@SriK: You still have the problem of maximisation bias though if the k-fold results have been used to select a model with the best metrics. Perhaps k-fold is more robust to this, but the gold standard is still an independent test set. – Neil Slater – 2020-09-30T11:08:28.860

@NeilSlater Thanks Neil ! Just to confirm the term maximization bias. Does it mean that if you used a dataset to optimize some parameter that dataset is somehow tainted and is less reliable as an independent performance measure for that optimized system. Is this right? In practice though, data is precious and I want the most data to train and the least data to reliably evaluate performance and identify the ideal model/hyperparameter(no difference between the two). So can I just use K-fold CV test score to identify the best model and that score is also a reasonable model performance measure? – SriK – 2020-09-30T16:34:20.123

@SriK: DIfficult to answer all that in a comment. Simplest way to think of maximisation bias. You have 20 models, all with same true performance, but sample variance in behaviour against the cv set means there is uncertainty when you measure the performance. Selecting the best performing model will likely over-estimate the performance of it by an expected 2 standard errors over the true performance measure. k-fold cv could get you some understanding of how much you may have over-estimated, you could also estimate it from a metric like accuracy and size of cv set. – Neil Slater – 2020-09-30T17:52:46.490

@SriK: In practice of course you don't have 20 equal models, so the logic is harder. I am not sure if there is a way to get meaningful bounds on the bias (maybe ask a new question about that). However, if you are really confident that the standard error in the metric based on your k-fold cv is low, you may feel happy to ignore the bias. You couldn't report the cv result as a true measure in a scientifiic sense (so it's a bit weak for a published research paper), but you may not care if this is a production system. – Neil Slater – 2020-09-30T17:59:06.963