Should I perform cross validation only on the training set?



I am working with a dataset that I downloaded from Kaggle. The data set is already divided into two CSVs for Train and Test.

I built a model using the training set because I imported the train CSV into a Jupyter Notebook. I predicted using the Train CSV itself. I would like to perform cross validation. Should I perform cross validation on the train CSV and split it into two parts again Train and Test? Or, should I import a new CSV file Test and combine both CSVs into ONE?

Amit Yadav

Posted 2019-08-17T05:16:24.687

Reputation: 41

I am not sure if such questions are acceptable here. It doesn't look like a data science question. You should read their description. Whether to take cv from their train file or test file should be told by them presuming it matters. It's their concern. But in any case, split train into train & cv data set or use K-fold cross-validation on the train data set itself. – Mr. Sigma. – 2019-08-17T06:21:13.097

I think you don't need to perform cross validation if the dataset already split into train and test sets, but if you want do that there are two ideas in your case: 1- combine both train and test sets and then do cross validation. 2- split train set into some parts and you can consider the set that is not feed for train as a validation set, and finally to test on the test set that you already have. – Hunar – 2019-08-17T07:13:07.220

Thank you very much Mr. Ahmed. – Amit Yadav – 2019-08-17T08:50:01.080



Chapter 5 of "Introduction to Statistical Learning" covers CV and bootstrap. I strongly recommend to read this chapter, since sampling methods are extremely relevant in practice.

Cross validation (CV) usually means that you split some training dataset in k pieces in order to generate different train/validation sets. By doing so you can see how well a model learns (and is able to make predictions) on different samples of a training dataset.

During training and model tuning, your model should not see the test data! The idea is that you reserve a truly "exogenous" dataset (never used during training) to test how well your model does in the end. If you use your test set during training, information may leak from your test set to the model, and you cannot demonstrate exogenous validity of your model anymore (becasue information from the test set leaked to your model).

In the image below, you would make four different splits on your training set, where blue would represent training cases and orange the validation cases. You run the same regression/classification on each of the four blue parts, and obtain predictions on each of the four orange parts.

enter image description here

There are different strategies to deal with data characteristics. In order to balance groups etc, you can use a stratified sampling strategy. See an sklearn implementation here.

Since sampling strategies can be relevant, it is advisable to let some tool like sklearn do the CV splits. Most tools/methods come with some CV options if relevant.


Posted 2019-08-17T05:16:24.687

Reputation: 4 724

Thank you very much, Peter! . – Amit Yadav – 2019-08-17T08:52:22.547


You shouldn't touch test data until you finish. To do cross validation you have to split the training dataset into train and validation sets or you can do k-fold cross validation (or any other method).

Here is some information:

But, never use test data to train or adjust your model, because if you do, your model will be trained for your test data and then your results won't be valid.


Posted 2019-08-17T05:16:24.687

Reputation: 11

Okay, Thank you – Amit Yadav – 2019-08-17T08:51:39.317


There is a great book written by Andrew Ng, called "Machine Learning Yearning". You can download it for free here

In the Kaggle's dataset now. The datasets you have are the training and the test. Even if you wanted to use the test dataset, you cannot. If you open it, you will see that only the features are there. The target column is missing, because this is the file that you need to use the model and predict the values to upload it on Kaggle.

Reading the chapters 5, 6 and 7 of the linked book, you will get a lot of great advises on how to handle the training data. In short.

You should split the training dataset in 3 parts.

  • Training set — Which you run your learning algorithm on.
  • Dev (development) set — Which you use to tune parameters, select features, and make other decisions regarding the learning algorithm. Sometimes also called the hold-out cross validation set.
  • Test set — which you use to evaluate the performance of the algorithm, but not to make any decisions regarding what learning algorithm or parameters to use.

Quoted from chapter 5, page 15.

Keep in mind that the above Test set has nothing to do with the test file that you got from Kaggle.


Posted 2019-08-17T05:16:24.687

Reputation: 3 340

Thank you, Tasos! I just downloaded this book and went through. It seems very useful. Thanks a lot – Amit Yadav – 2019-08-17T08:50:49.890

1If you are happy with the answer, you can either upvote or mark it as best answer or both. P.S. this is one of my favorite books :) – Tasos – 2019-08-17T08:51:56.453