Metrics to determine K in K-cross fold validation



Consider a scenario where the dataset in hand is quite large, let's assume 50000 samples (quite well balanced between two classes). What metrics can be used to decide the K value in a K-fold cross-validation? In other words, can a 5-fold CV be enough or should I go for a 10-fold CV?

The rule of thumb is the higher K, the better. But, putting aside the computational costs, what can be used to decide the value of K? Should we look at the overall performance, e.g. average accuracy? That is, if accuracy (5CV) ~ accuracy(10CV), we can opt for 5-fold CV?. Is the standard deviation between the performance of different folds important? That is, the lower the better?


Posted 2018-11-05T21:03:41.980

Reputation: 101



The rule of thumb is the higher K, the better.

I think a better rule of thumb is: The larger your dataset, the less important is $k$.

However, it is useful to have a general understanding of the impact of $k$ on the performance estimator (leaving aside computational costs):

  • Increasing $k$ decreases the bias because the training set better represents the data
  • Increasing $k$ increases the variance of the estimator because the training data sets are becoming more similar

Also note that there is no unbiased estimator for the variance of the $k$-fold CV. Together this means that there is no metric that can tell you the best $k$ if you leave computational costs aside. Some empirical studies suggest that 10 is a reasonable default.

And to be clear, $k$ is not a hyper-parameter you want to tune to find the best accuracy. If you start performing $k_2$-fold CV to find the best $k_1$ something hopefully feels wrong.


Posted 2018-11-05T21:03:41.980

Reputation: 5 477


First of all choosing K is basically heuristic approach. It depends on the data and model. Most of the times 5 is a good choice in my opinion. It doesn't need to much computation power and time but you need to try and see which one is better for your data. There is no free lunch!

I would suggest another CV idea for you. For example if you use 5 Fold CV (without stratifying and shuffle) basically you divide your data to 5 equal folds. The mean of equal is this: every folds have same shape. Every fold can has a different distribution. So you can choose your folds manually. Plot the distribution of target variable and try to catch same patterns for decide your folds.

Also you can select your models with different K based on a criteria. For example AIC.


Posted 2018-11-05T21:03:41.980

Reputation: 124


You should ask yourself, why are we even doing cross-validation? It's not to get a better accuracy. You're trying to get a better estimate for the accuracy (or another metric) on unseen data. You want to know how well does the model generalize.

If you try to grid search for the "best K", you're going to either waste some data, or get a worse estimate of the metric.

Wasting data - you split your data into two sets and grid search on one of them and then do a cross-validation(with the "best K") on the second dataset. Don't do this.

Getting a worse estimate - you do a grid search for the "best K" and choose the one that gets you the best result according to your chosen metric. But now you brought information that you shouldn't have. You are being too optimistic with your estimate. That's the exact opposite of what you wanted, when you started with the cross-validation. Don't do this either.

So what you should do? Pick the largest K that makes sense with the problem you are trying to solve. Don't put the computational cost aside. The computational cost should determine the K.


Posted 2018-11-05T21:03:41.980

Reputation: 11


If necessary, you could try the leave-one-out approach which is really just k equals the number of instances in the data. This will perhaps give you the most realistic expected accuracy on unseen data after you retrain your model on the entire dataset.

However, as others have stated, the computational cost for performing leave-one-out is high and the relative gain in the expected accuracy is probably immaterial. For example, if k=5 expected accuracy is 95% and k=10 expected accuracy is 97%, and you have 15000 instances, what is the benefit of having an expected accuracy of 97.5%? In some instances, it may be necessary to get the most accurate estimate of the model's performance, but in most cases, it is not necessary and if you are training on AWS, for instance, the cost may be in real monetary and significant.


Posted 2018-11-05T21:03:41.980

Reputation: 898