If use the weights from previous iteration of a k-fold cross validation to seed a neural network classifier would I be overfitting?


As is done traditionally, I used k-fold cross validation to select and optimize the hyper parameters of my neural network classifier. When it was time to store the final model for future predictions, I discovered that using the weights from the previous k-fold cv iteration to seed the initial weights of the model in subsequent iteration, helps in improving the accuracy (seems obvious). I can use the model from the final iteration to perform future predictions on unseen data.

  • Would this approach result in overfitting?

(Please note, I am using all available data in this process and I do not have any holdout data for validation.)


Posted 2018-06-20T17:39:55.893

Reputation: 31



To answer directly your question: estimating a model's performance on data previously used for fitting will overestimate performance.

When your dataset is "small" you are faced with a bias-variance dilemma with regards to how much data should be used for train and test sets:

  • Too much train data, you end up with little test samples and your performance estimate has high variance.

  • Too much test data, your train samples do not represent well the population you are trying to model/target, so your average performance will be much lower than what could be achieved.

K-fold CV is a compromise in the evaluation of performance of certain procedure. Once you have settled on a certain model or hyperparameters, when moving to production you can:

Choose to take one of the k models you have trained, perhaps using the "one std variation rule of thumb".

  • Re-train on all of your data and you can expect the resulting model to be at least as good as you have estimated.

  • Use all k models to form an ensemble and you can expect the resulting model to be at least as good as you have estimated.

Surya Sg

Posted 2018-06-20T17:39:55.893

Reputation: 555

Thank you for your response. Some additional clarification. I am not evaluating the performance of my model. I have use the K-fold CV in the traditional way during training. For production use, I have choice between the two options that you have provided in the last paragraph - or a third option: which is what I have described. I suspect that since both approaches (your option 1 and my option 3) use all the available data, the models coming out this process would be similar. But do such models deviate significantly from the initial models trained using only part of the data? – SingularRiver – 2018-06-21T21:45:39.847