Ideal difference in the training accuracy and testing accuracy


In a data classification problem (with supervised learning), what should be the ideal difference in the training set accuracy and testing set accuracy? What should be the ideal range? Is a difference of 5% between the accuracy of training and testing set okay? Or does it signify overfitting?


Posted 2017-07-08T06:12:49.910

Reputation: 1 093



Theoretically speaking, in a perfect scenario, training and test data both represent the distribution of your problem accurately. Therefore, in an ideal case, training and testing should not have any significant differences in accuracy. This becomes more and more true when you have lots of data.

A difference of 5% is perfectly fine. In practice, it is common that training accuracy is slightly better than the test accuracy. I will say that the difference may not be the best indicator. What you should look at is correlation. Meaning that, as long as training and testing accuracy improve together at a similar rate, you're in the clear, regardless of how far apart they are. You can investigate that by training and evaluating on increasingly bigger subsets of the data. Ideally, training and testing should both improve as you add data. If test data starts decreasing, you have overfitting.

Valentin Calomme

Posted 2017-07-08T06:12:49.910

Reputation: 4 666