## What is the difference between test set and validation set?

315

226

I found this confusing when I use the neural network toolbox in Matlab.
It divided the raw data set into three parts:

1. training set
2. validation set
3. test set

I notice in many training or learning algorithm, the data is often divided into 2 parts, the training set and the test set.

My questions are:

1. what is the difference between validation set and test set?
2. Is the validation set really specific to neural network? Or it is optional.
3. To go further, is there a difference between validation and testing in context of machine learning?

@mpiktas Are you referring to the chapter "Model Assessment and Selection"? – Celdor – 2015-06-01T07:16:11.973

1Yes. The page number was from 5th print edition. – mpiktas – 2015-06-01T07:20:27.720

You might want to also see: http://stats.stackexchange.com/questions/9357/why-only-three-partitions-training-validation-test/9364#9364, where the question was "Why not more than three?"

– Wayne – 2015-09-29T14:28:35.817

44

The question is answered in the book Elements of statistical learning page 222. The validation set is used for model selection, the test set for final model (the model which was selected by selection process) prediction error.

– mpiktas – 2011-11-28T11:47:26.637

6@mpiktas is spot on. Here is the actual text: `The training set is used to fit the models; the validation set is used to estimate prediction error for model selection; the test set is used for assessment of the generalization error of the final chosen model. Ideally, the test set should be kept in a “vault,” and be brought out only at the end of the data analysis.` – arun – 2016-07-15T18:01:42.007

The book Elements of statistical learning" is now reachable under: https://web.stanford.edu/~hastie/Papers/ESLII.pdf

– moi – 2017-07-12T08:15:14.623

@mpiktas There is some logic that I am missing: If the validation set is used for model selection, i.e., choose the model that has the best performance on the validation set (rather than the model that has the best performance on the training set), then is it just another overfitting? i.e., overfitting on the validation set? Then how can we expect that the model with the best performance on the validation set will also have best performance on the test set among all the models you are comparing? If the answer is no, then what's the point of the validation set? – KevinKim – 2018-02-28T16:34:28.923

201

Normally to perform supervised learning you need two types of data sets:

1. In one dataset (your "gold standard") you have the input data together with correct/expected output, This dataset is usually duly prepared either by humans or by collecting some data in semi-automated way. But it is important that you have the expected output for every data row here, because you need for supervised learning.

2. The data you are going to apply your model to. In many cases this is the data where you are interested for the output of your model and thus you don't have any "expected" output here yet.

While performing machine learning you do the following:

1. Training phase: you present your data from your "gold standard" and train your model, by pairing the input with expected output.
2. Validation/Test phase: in order to estimate how well your model has been trained (that is dependent upon the size of your data, the value you would like to predict, input etc) and to estimate model properties (mean error for numeric predictors, classification errors for classifiers, recall and precision for IR-models etc.)
3. Application phase: now you apply your freshly-developed model to the real-world data and get the results. Since you normally don't have any reference value in this type of data (otherwise, why would you need your model?), you can only speculate about the quality of your model output using the results of your validation phase.

The validation phase is often split into two parts:

1. In the first part you just look at your models and select the best performing approach using the validation data (=validation)
2. Then you estimate the accuracy of the selected approach (=test).

Hence the separation to 50/25/25.

In case if you don't need to choose an appropriate model from several rivaling approaches, you can just re-partition your set that you basically have only training set and test set, without performing the validation of your trained model. I personally partition them 70/30 then.

11Why wouldn't I choose the best performing model based on the test set, getting rid of the validation set altogether? – Sebastian Graf – 2014-11-09T14:31:06.863

1Is it because of overfitting? Or because we want some independent statistics based on the test result, just for error estimation? – Sebastian Graf – 2014-11-09T14:42:45.453

7@Sebastian [If you only use the test set: ]"The test set error of the final chose model will underestimate the true test error, sometimes significantly" [Hastie et al] – user695652 – 2015-06-02T20:09:37.277

12The validation set is often used to tune hyper-parameters. For example, in the deep learning community, tuning the network layer size, hidden unit number, regularization term(wether L1 or L2) depends on the validation set – xiaohan2012 – 2015-10-13T10:52:02.300

2What is the correct way to split the sets? Should the selection be random? What if you have pictures that are similar? Won't this damage your ability to generalize? If you have two sets taken in separate locations wouldn't it be better to take one as training set and the other as the test set? – Yonatan Simson – 2016-02-03T10:36:41.503

@user695652 I see you quote the Elements of Statistical Learning. But I don't understand intuitively why this is true? When I train my model on the training data set, I did not use any data in the test data set. Also, if I didn't do any feature engineering, i.e., I just use the original set of features in my data set, then there shouldn't be any information leakage. So in this case, why I still need the validation set? Why if I just use the test set, it will underestimate the true test error? – KevinKim – 2017-03-26T04:22:56.193

Is it like validation is testing against the known, and 'testing' is against the unknown? – Sudip Bhandari – 2017-06-23T11:18:37.240

1@YonatanSimson Models don't usually generalize well enough that you could train in only one location and have it work well in the other one, so the only reason you would do that is if you don't care about your model working as well as possible, but do care about testing how well your model generalizes. When your test set comes from the same distribution as the training set, it still tells you how much you overfit because the data isn't exactly the same, and overfitting is about working only on the exact data in your training set. – alltom – 2017-07-20T13:49:56.730

1@KevinKim user695652 is saying that you will underestimate the true test error if you use the test set to train hyperparameters (size of model, feature selection, etc) instead of using a validation set for that. If you're saying that you don't train any hyperparameters, then you also don't need a validation data set. – alltom – 2017-07-20T13:56:01.613

Is it possible i can use the validation set for the testing? – Aadnan Farooq A – 2017-09-26T05:49:32.373

@alltom I see. But there is still some logic that I am missing: If the validation set is used for model selection, i.e., choose the model that has the best performance on the validation set (rather than the model that has the best performance on the training set), then is it just another overfitting? i.e., overfitting on the validation set? Then how can we expect that the model with the best performance on the validation set will also have best performance on the test set among all the models I am comparing? If the answer is no, then what's the point of the validation set? – KevinKim – 2018-02-28T16:37:56.037

@KevinKim You train a model with examples from the training set, then evaluate the model with examples from the validation set—which it has never seen—to choose the model that generalizes the best. The model that does the best on the validation data with no additional training is most likely to do the best on other data sets (such as the test set), so long as they're all drawn from the same distribution. – alltom – 2018-03-05T04:45:56.053

185

Training set: a set of examples used for learning: to fit the parameters of the classifier In the MLP case, we would use the training set to find the “optimal” weights with the back-prop rule

Validation set: a set of examples used to tune the parameters of a classifier In the MLP case, we would use the validation set to find the “optimal” number of hidden units or determine a stopping point for the back-propagation algorithm

Test set: a set of examples used only to assess the performance of a fully-trained classifier In the MLP case, we would use the test to estimate the error rate after we have chosen the final model (MLP size and actual weights) After assessing the final model on the test set, YOU MUST NOT tune the model any further!

Why separate test and validation sets? The error rate estimate of the final model on validation data will be biased (smaller than the true error rate) since the validation set is used to select the final model After assessing the final model on the test set, YOU MUST NOT tune the model any further!

source : Introduction to Pattern Analysis,Ricardo Gutierrez-OsunaTexas A&M University, Texas A&M University

27+1 for "YOU MUST NOT tune the model any further!" – stmax – 2014-05-27T09:51:03.373

3What is the difference between "fit the parameters" and "tune the parameters"? – Metariat – 2015-08-06T16:21:27.157

Very clear explanation: +1 – VolAnd – 2015-10-05T13:52:47.887

7@stmax Not to be pedantic, but once we have our final test error and we are NOT satisfied with the result, what do we do, if we cant tune our model any further?... I have often wondered about this case. – Spacey – 2016-10-09T01:50:04.353

4@Tarantula you can continue tuning the model, but you'll have to collect a new test set. Of course no one does that ;) but violating that (especially when you repeat it several times) might lead to your model fitting the test set - which results in unrealistic / too optimistic scores. – stmax – 2016-10-11T07:39:08.160

@stmax say I use the default hyper parameters in my model (e.g., lambda in Lasso) and I just train it on the train set to get my weight parameters (those linear coefficients), then I skip the validation set and directly apply my model to the test set, then the error should be an unbiased estimator of the true error right? But if in the second time, I tune the lambda in validation set, that gives me a lower validation error than the default one, then I apply the tuned model to the test set, see it gives a higher test error than the default model, then what should I do? Can I improve anything? – KevinKim – 2017-03-26T04:35:17.920

I think this nomenclature is confusing. You are correct to say "YOU MUST NOT tune the model any further" after using the test set, but... what area you meant to do? Stop work on it? In reality you need a whole hierarchy of test sets. 1: Validation set - used for tuning a model, 2: Test set, used to evaluate a model and see if you should go back to the drawing board, 3: Super-test set, used on the final-final algorithm to see how good it is, 4: hyper-test set, used after researchers have been developing MNIST algorithms for 10 years to see how crazily overfit they are... etc. etc. – Timmmm – 2017-12-23T18:11:34.803

49

My 5 years experience in Computer Science taught me that nothing is better than simplicity.

The concept of 'Training/Cross-Validation/Test' Data Sets is as simple as this. When you have a large data set, it's recommended to split it into 3 parts:

++Training set (60% of the original data set): This is used to build up our prediction algorithm. Our algorithm tries to tune itself to the quirks of the training data sets. In this phase we usually create multiple algorithms in order to compare their performances during the Cross-Validation Phase.

++Cross-Validation set (20% of the original data set): This data set is used to compare the performances of the prediction algorithms that were created based on the training set. We choose the algorithm that has the best performance.

++Test set (20% of the original data set): Now we have chosen our preferred prediction algorithm but we don't know yet how it's going to perform on completely unseen real-world data. So, we apply our chosen prediction algorithm on our test set in order to see how it's going to perform so we can have an idea about our algorithm's performance on unseen data.

Notes:

-It's very important to keep in mind that skipping the test phase is not recommended, because the algorithm that performed well during the cross-validation phase doesn't really mean that it's truly the best one, because the algorithms are compared based on the cross-validation set and its quirks and noises...

-During the Test Phase, the purpose is to see how our final model is going to deal in the wild, so in case its performance is very poor we should repeat the whole process starting from the Training Phase.

1it is easy and confusing to refer to the sets as phases and vice versa. – Matt O'Brien – 2015-03-28T20:51:24.077

@MattO'Brien Yeah you are right. It should have been better to use only one word. – innovIsmail – 2015-06-03T02:49:47.000

So I need to set aside the test set in the beginning to avoid contamination of data. Then from the remaining data I can run cross-validation multiple times, each time selecting training set and cross-validation set randomly. Would that be correct? – Fazzolini – 2016-12-09T07:47:11.407

@innovIsmail What if I skip the validation step? Say I have many algorithms and I trained them on the train set, then I just apply all of them to the test set, then I pick the one that has the best perform on the test set – KevinKim – 2017-03-26T04:42:44.637

1It sounds to me like you're then just skipping the test step. – Mihai Danila – 2017-05-05T03:03:02.547

> compare the performances of the prediction algorithms - what is "an algorithm" in this context? aren't your model is an algorithm? does one have to build several models and train them separately to get several phases to validate? – Boppity Bop – 2017-05-28T12:12:22.923

If you repeat the whole process you're going to need another test set. – Timmmm – 2017-12-23T18:12:33.887

28

At each step that you are asked to make a decision (i.e. choose one option among several options), you must have an additional set/partition to gauge the accuracy of your choice so that you do not simply pick the most favorable result of randomness and mistake the tail-end of the distribution for the center 1. The left is the pessimist. The right is the optimist. The center is the pragmatist. Be the pragmatist.

Step 1) Training: Each type of algorithm has its own parameter options (the number of layers in a Neural Network, the number of trees in a Random Forest, etc). For each of your algorithms, you must pick one option. That’s why you have a training set.

Step 2) Validating: You now have a collection of algorithms. You must pick one algorithm. That’s why you have a test set. Most people pick the algorithm that performs best on the validation set (and that's ok). But, if you do not measure your top-performing algorithm’s error rate on the test set, and just go with its error rate on the validation set, then you have blindly mistaken the “best possible scenario” for the “most likely scenario.” That's a recipe for disaster.

Step 3) Testing: I suppose that if your algorithms did not have any parameters then you would not need a third step. In that case, your validation step would be your test step. Perhaps Matlab does not ask you for parameters or you have chosen not to use them and that is the source of your confusion.

1 It is often helpful to go into each step with the assumption (null hypothesis) that all options are the same (e.g. all parameters are the same or all algorithms are the same), hence my reference to the distribution.

2 This image is not my own. I have taken it from this site: http://www.teamten.com/lawrence/writings/bell-curve.png

2I think the first sentence captures the fundamental answer to this question better than any of the other answers. "At each step that you are asked to make a decision (i.e. choose one option among several options), you must have an additional set/partition to gauge the accuracy of your choice..." – kobejohn – 2016-04-06T23:25:24.853

On question: If I want to find the best RandomForest (RF) and pretending there is only one hyper parameter of RF, which is the number of trees (N), then in step1, I run many RF with different N to build the forest; in step2, apply them on the validation test and pick the RF with N that gives lowest error over validation test, then in step3, I apply RF with N to the test set and get unbiased estimate of true test error of this RF with N. But I could apply all my RFs on test set and pick the one with the lowest test error, which may not be N. Then that is the point of doing validation step2? – KevinKim – 2017-03-26T04:53:58.730

1@KevinKim : If you apply your test set to all RFs and use the results to make a further choice (pick another model), then you've just repeated the validation step. You have set your mind on "I need to get the lowest error with a model!". That is the point of training and validating, NOT testing. Testing is only about: I've trained and picked a model, now let's see how it performs "in general". Obviously the "general" test set is just another slice of data that may or may not be overfit, but the point is that YOU haven't knowingly overfit your model to it by choices. – Honeybear – 2018-03-01T10:49:59.747

The three-wise split is just a very common approach (A) to give you an idea of how the model generalizes (B) with limited effort and (C) limited observed data. If you want to do better in terms of (B), you can do what you are suggesting: Use different validation sets to finetune for generalization. With limited data that is called cross-validation: Repeat the training and validation with varying training and test sets (for neural networks where training may take weeks this is not a thing). – Honeybear – 2018-03-01T11:55:05.223

In terms of (C): With more data you can get better at training (more generally trained models), validation (picking a more general model) and testing (a better idea of the model's generalization), depending which set you expand... but designing the ML cycle is usually not about getting more data. Doing a three-way split ONCE is a trade-off between (A), (B) and (C). – Honeybear – 2018-03-01T11:55:08.900

@Honeybear Now I am thinking about: will the "winner" in the validation step also be the winner in the test step? To be more concrete, if in the validation step, I found that my RF with N=N_1 has the best performance, then will this RF (with N=N_1) also beat my other RF on the test set? I am just talk about relative ranking order, not the estimation of the actual performance of the model – KevinKim – 2018-03-01T16:13:51.307

@KevinKim: Maybe - ideally it should. But the test set is just another slice of data... if your model N* (the winner, say RF with N=N_1) performs well on the training set AND the validation set, it is just a well reasoned assumption that it will perform well on a test set, another (unseen) set of related data. That's what you're testing. If another model N' performs worse on the validation set, it may still perform better on the test set. But usually you don't "test" for that, since if you choose to do anything with it, it's a (strange) validation step. – Honeybear – 2018-03-01T20:33:06.400

Also, if it happens and there are drastic performance differences, there probably is something wrong with your data. A reason can be skewed data, e.g. by chance exactly those data points that are well represented by N* are in the validation set and those data points that are well represented by N' are in the test set. Something like this can be avoided by cross-validation or stratified sampling (by moderating how the sets are composed). By doing only one random split it may always be a bad one and cases like you describe may occur. – Honeybear – 2018-03-01T20:35:20.767

BUT this problem gets smaller with a larger dataset and if you have thousands of samples, a skew like that is unlikely and a one-time three-way split is "good enough" to still learn well and get representative validation / test results. ... I hope that answers your question? – Honeybear – 2018-03-01T20:35:52.683

Having a model N* that performs well on validation set and bad on test set and a model N' that performs bad on validation set and well on the test set shows that you have two models that specialize ("overfit") on one aspect of your problem. If that ever should be a result, it wouldn't be clear which one to pick and you might add another ensemble strategy ("another" as RF already is an ensemble of decision trees), decide that your data split is garbage or that your model to high-dimensional for your problem space. – Honeybear – 2018-03-01T20:44:26.860

@Honeybear I originally think that the whole point of picking the winner of the validation set is that it will also be the winner of the test set (assume the validation set and the test set all have same distribution). But my second thought is: picking the winner on the validation set is just overfitting your model on the validation set, which is essentially the same as the overfitting issue on the training set. Then I don't understand why we still need the validation set. I feel cross validation (e.g., divide all your training set in to 10 chunks randomly) is a better way for model selection – KevinKim – 2018-03-01T21:57:30.517

@KevinKim You hope that the winner of the validation set will be the winner on the test set, but that's not always the case, which is exactly why the test set exists. When you deploy your best performing validation-set model to production in the present you don't know for sure that it will actually provide the best predictions in the future. Revealing the test set only once, ever simulates live production data where you don't have the benefit of hindsight. – Ryan Zotti – 2018-03-01T22:21:40.313

@RyanZotti Then why not skip the validation set step, i.e., directly use the best model on the training set, then hope it will be a good model, then use a test set only once. I know this is definitely overfitting. But in essence, you use the winner of the validation set is also overfitting the validation set, right? (Or I miss something very important about the overfitting issue on the training and validation set?) I would still say that the n-fold CV (which uses all the training set) is a better way for model selection, though the CV error is an optimistic estimation of the true error rate – KevinKim – 2018-03-01T23:53:23.783

@KevinKim Yes, the winner of the validation set is overfitting to the validation set, hence the test set. I suppose cross validation could be viewed as a substitute for the validation set. The most important point is that the final, test set be used only once and should not influence any decisions (other than to not deploy any model). – Ryan Zotti – 2018-03-02T02:08:09.853

@KevinKim Yes, I think you are missing an important point: that (for most models) there is a huge difference between overfit during training and "overfit" during validation. Training is tailoring and changing a model (curve control points curve, NN-weights, tree-splits...) and where overfit - i.e. tailoring to specific data points - can quickly happen. If "testing" with unseen data for overfit on the training set (validation) produces good results, it is likely a general model and you pick it. Now you want to test again (test set) to give a "final" result on data you haven't based decisons on. – Honeybear – 2018-03-02T09:13:50.003

Sure, cross-validation is better, since you do the first steps 10 times, but it is a lot more expensive since it implies training N models 10 times. As I already pointed out, doing it only once is a compromise, since you can't always afford that much resources for training. But doing CV, ALL your data has been used for both training and validation at some point, so you should still keep a test set that you apply in the end, since only then you have data that didn't play any role during model training and selection and will give "independent" results. – Honeybear – 2018-03-02T09:15:52.913

An example: Green, black and yellow (not shown) are 3 models G, B and Y, trained on the shown data (training set). Training alone, you'd say, G is the best, it fits 100%. Hence a validation set, where you'll get sth like G: 80%, B: 97%, Y: 89%. Those results indicate overfit of G while B and Y didn't. B performs best, you'd pick it. One might argue that you just generated models until one fit your validation set perfectly. Hence a test set, where you can say: This is the indipendent performance of the model I chose.

– Honeybear – 2018-03-02T09:46:42.410

1BUT: How the model will perform "in the real world" is still unknown. It is just a validated and tested assumption, that it will perform well on unseen data and for scientific purposes this is usually deemed enough. If you now go again and generate and pick models, until one fits validation set AND test set perfectly, then you degenerated your test set to a validation set. Better do cross-validation for that. In case your performance is constantly significantly worse on the test set, it always is an option that your data is just split badly and you'd want to restart with re-shuffled sets. – Honeybear – 2018-03-02T09:53:30.083

@Honeybear I agree that there are some difference between overfitting on those "parameters" (e.g., weights in the NN) and the overfitting on those "hyper-parameters" (e.g., number of hidden layers in NN). But with one validation set and you pick the winner, is still overfitting (on the hyper-parameters). In the G,B,Y example (these colors are hyper-parameter), it could be that in one validation set, B is the best, but in another validation set, Y may be the best. Ideally, if we could generates many validation set independently and compute the average performance of G,B,Y, then that average... – KevinKim – 2018-03-02T13:56:24.697

@Honeybear ...performance would be a very robust and accurate estimation of the true performance of G,B,Y. So, this is a "clean" version of the traditional n-fold CV. So n-fold CV is trying to mimic this ideal situation, right? But if n-fold CV requires too much computation in a problem, them may be the only practically meaningful thing we can do is the this one validation set. Though it is not as "accurate" as the n-fold CV, it is still much better than just picking the winner based on the training performance – KevinKim – 2018-03-02T14:28:58.657

12

It does not follow that you need to split the data in any way. The bootstrap can provide smaller mean squared error estimates of prediction accuracy using the whole sample for both developing and testing the model.

1So you don't advocate cross-validation through splitting of large data-sets for predictive model testing / validation? – OFish – 2014-12-15T03:42:01.097

6No, unless the dataset is huge or the signal:noise ratio is high. Cross-validation is not as precise as the bootstrap in my experience, and it does not use the whole sample size. In many cases you have to repeat cross-validation 50-100 times to achieve adequate precision. But in your datasets have > 20,000 subjects, simple approaches such as split-sample validation are often OK. – Frank Harrell – 2014-12-15T04:17:39.607

1That's really good to know! Thanks. And coming from you, that's a great "source" of info. Cheers! – OFish – 2014-12-15T04:43:50.300

Could you provide a link to what you think is a good starting point for the bootstrapping method? – kobejohn – 2016-04-06T23:29:53.077

See Chapter 5 of my course notes at http://biostat.mc.vanderbilt.edu/rms

– Frank Harrell – 2016-04-06T23:59:58.357

6

Most supervised data mining algorithms follow these three steps:

1. The training set is used to build the model. This contains a set of data that has preclassified target and predictor variables.
2. Typically a hold-out dataset or test set is used to evaluate how well the model does with data outside the training set. The test set contains the preclassified results data but they are not used when the test set data is run through the model until the end, when the preclassified data are compared against the model results. The model is adjusted to minimize error on the test set.
3. Another hold-out dataset or validation set is used to evaluate the adjusted model in step #2 where, again, the validation set data is run against the adjusted model and results compared to the unused preclassified data.

3

A typical machine learning task can be visualized as the following nested loop:

``````while (error in validation set > X) {
tune hyper-parameters
while (error in training set > Y) {
tune parameters
}
}
``````

Typically the outer loop is performed by human, on the validation set, and the inner loop by machine, on the training set. You then need a 3rd test set to assess the final performance of the model.

In other words, validation set is the training set for human.

I like this explanation :) very concise – Honeybear – 2018-03-01T11:54:30.807

2

I would like to add to other very good answers here by pointing to a relatively new approach in machine learning called "differential privacy" (see papers by Dwork; the Win Vector Blog for more). The idea allows to actually reuse the testing set without compromising the final model performance. In a typical setting the test set is only used to estimate the final performance; ideally one is not even allowed to look at it.

As it is well described in this Win Vector blog (see other entries as well), it is possible to "use" the test set without biasing the model's performance. This is done using the special procedure called "differential privacy". The learner will not have direct access to the test set.

1

One way to think of these three sets is that two of them (`training` and `validation`) come from the past, whereas the `test` set comes from the "future". The model should be built and tuned using data from the "past" (`training`/`validation` data), but never `test` data which comes from the "future".

To give a practical example, let's say we are building a model to predict how well baseball players will do in the future. We will use data from 1899-2014 to create a `test` and `validation` set. Once the model is built and tuned on those data, we will use data from 2015 (actually in the past!) as a test set, which from the perspective of the model appears like "future" data and in no way influenced the model creation. (Obviously, in theory, we could wait for data from 2016 if we really want!)

Obviously I'm using quotes everywhere, because the actual temporal order of the data may not coincide with actual future (by definition all of the data generation probably took place in the actual past). In reality, the `test` set might simply be data from the same time period as the `training`/`validation` sets, that you "hold out". In this way, it had no influence on tuning the model, but those hold out data are not actually coming from the future.

After reading all the other answers, this answer made it "click" for me! You train with the train set, check that you're not overfitting with the validation set (and that the model and hyperparameters work with "unknown data"), and then you assess with the test set - "new data" - whether you now have any predictive powers..! – stolsvik – 2017-03-15T22:17:41.963

0

My Idea is that those option in neural network toolbox is for avoiding overfitting. In this situation the weights are specified for the training data only and don't show the global trend. By having a validation set, the iterations are adaptable to where decreases in the training data error cause decreases in validation data and increases in validation data error; along with decreases in training data error, this demonstrates the overfitting phenomenon.