Training and cross validation error curves



I have a graph which plots training datasize on X axis and accuracy on y axis. I plotted the curves using sklearn's learning_curve.

enter image description here

It is observed that the accuracy of training dataset decreases but the accuracy of validation dataset increases. I am not able to justify this behavior.

  1. Normally, as training dataset increases, the training accuracy is supposed to increase right?
  2. Also, assuming that the dataset is very noisy and hence training accuracy is decreasing as dataset size increases. But this doesn't explain why validation accuracy increases because noise is supposed to affect that too.

Hima Varsha

Posted 2016-09-26T06:12:26.650

Reputation: 2 146



I think what you're seeing is normal behaviour:

With only few samples (like 2000) it's easy for a model to (over)fit the data - but it doesn't generalize well. So you get high training accuracy, but the model might not work well with new data (i.e. low validation/test accuracy).

As you add more samples (like 9000) it becomes harder for the model to fit the data - so you get a lower training accuracy, but the model will work better with new data (i.e. validation/test accuracy starts to rise).


  1. As the training dataset increases, the training accuracy is supposed to decrease because more data is harder to fit well.

  2. As the training dataset increases, the validation/test accuracy is supposed to increase as well since less overfitting means better generalization.

Andrew Ng has a video about learning curves. Note that he plots the error on the y axis, you have the accuracy on the y axis.. so the y axis is flipped.

Also take a look at the second half of the video. It explains high bias and high variance problems.

Your model seems to have high variance (due to the big "gap" between the two curves) - it's still too complex for the small amount of data you've got. Either getting more data or using a simpler model (or more regularization on the same model) might improve the results.


Posted 2016-09-26T06:12:26.650

Reputation: 1 536

So i got this graph for another dataset. What would this imply?

– Hima Varsha – 2016-09-26T08:11:51.570

1Interesting example - I see you've posted it as a new question already. I'll look into it later. – stmax – 2016-09-26T12:53:30.723