My 5 years experience in Computer Science taught me that nothing is better than simplicity.
The concept of 'Training/Cross-Validation/Test' Data Sets is as simple as this. When you have a large data set, it's recommended to split it into 3 parts:
++Training set (60% of the original data set): This is used to build up our prediction algorithm. Our algorithm tries to tune itself to the quirks of the training data sets. In this phase we usually create multiple algorithms in order to compare their performances during the Cross-Validation Phase.
++Cross-Validation set (20% of the original data set): This data set is used to compare the performances of the prediction algorithms that were created based on the training set. We choose the algorithm that has the best performance.
++Test set (20% of the original data set): Now we have chosen our preferred prediction algorithm but we don't know yet how it's going to perform on completely unseen real-world data. So, we apply our chosen prediction algorithm on our test set in order to see how it's going to perform so we can have an idea about our algorithm's performance on unseen data.
Notes:
-It's very important to keep in mind that skipping the test phase is not recommended, because the algorithm that performed well during the cross-validation phase doesn't really mean that it's truly the best one, because the algorithms are compared based on the cross-validation set and its quirks and noises...
-During the Test Phase, the purpose is to see how our final model is going to deal in the wild, so in case its performance is very poor we should repeat the whole process starting from the Training Phase.
@mpiktas Are you referring to the chapter "Model Assessment and Selection"? – Celdor – 2015-06-01T07:16:11.973
1Yes. The page number was from 5th print edition. – mpiktas – 2015-06-01T07:20:27.720
You might want to also see: http://stats.stackexchange.com/questions/9357/why-only-three-partitions-training-validation-test/9364#9364, where the question was "Why not more than three?"
– Wayne – 2015-09-29T14:28:35.81744
The question is answered in the book Elements of statistical learning page 222. The validation set is used for model selection, the test set for final model (the model which was selected by selection process) prediction error.
– mpiktas – 2011-11-28T11:47:26.6376@mpiktas is spot on. Here is the actual text:
The training set is used to fit the models; the validation set is used to estimate prediction error for model selection; the test set is used for assessment of the generalization error of the final chosen model. Ideally, the test set should be kept in a “vault,” and be brought out only at the end of the data analysis.
– arun – 2016-07-15T18:01:42.007The book Elements of statistical learning" is now reachable under: https://web.stanford.edu/~hastie/Papers/ESLII.pdf
– moi – 2017-07-12T08:15:14.623@mpiktas There is some logic that I am missing: If the validation set is used for model selection, i.e., choose the model that has the best performance on the validation set (rather than the model that has the best performance on the training set), then is it just another overfitting? i.e., overfitting on the validation set? Then how can we expect that the model with the best performance on the validation set will also have best performance on the test set among all the models you are comparing? If the answer is no, then what's the point of the validation set? – KevinKim – 2018-02-28T16:34:28.923