Interpretation of accuracy score on subset of data points


I have a multi-class problem that I am building a classifier for. I have N total data points I would like to predict on. If I instead predict on n < N data points and get an accuracy score, is there a way I can say (with some degree of confidence) what I think the same model's accuracy score would be on the remaining data points?

Can somebody point me to an article discussing this, or suggest a formula to research?

Jake Morris

Posted 2018-09-27T13:16:20.367

Reputation: 131

If the smaller sample is representative of the bigger sample then yes, usually you would train a model undel the accuracy score stabilizes and doesn't change much and you would take this score as the approximate expected score. – user2974951 – 2018-09-27T13:23:32.733



Usually when working with classification problems, one tries to have 3 subsets of data:

  • A training set: this subset is usually the biggest one and can take up to ~80% of the available data. It is used to train the chosen algorithm, using the known labels of each data sample.
  • A validation set: this subset is much smaller. It will typically be ~5-10% of the available data. It is used to evaluate the performance of the algorithm trained on the training set. Typically, one will tune the parameters of the algorithm in order to reach the best performance on the validation set.
  • A testing set: this subset is of the same size order as the validation set or bigger. Very important: it should NEVER be used for training purpose! Once the model is trained and tuned using the training and validation sets, the testing set allows one to get the accuracy (or any other performance metric) on unseen data. If the model generalizes well, the score will be close to the ones seen on the validation set, often times a tiny bit worse.

In order for this to work properly, it is important that all the subsets are representative of the available data. For example, that the proportion of each class is approximately the same across the subsets.

In the light of such broadly used process, we can see that most algorithms are tuned and tested on a fraction of the available data. As long as the used test set is balanced in a way that is similar to the training and validation sets and that it has not been used at all during the training/tuning of the model, there is no reason why the performance scores would not generalize well to N > n samples used in the test set.


Posted 2018-09-27T13:16:20.367

Reputation: 406

Thanks. I am aware of train/validation/test best practices, but your statement "there is no reason why the performance scores would not generalize well" is in line with my actual question. Is it accurate to say there is no way to calculate the uncertainty associated with extrapolating the accuracy on the testing set to accuracy on a larger set drawn from the same distribution? – Jake Morris – 2018-09-28T13:14:50.830


Use cross validation. It's where you split the data in K subsets, and train and test K times on all of the data using a different subset of the data for validation each time. The average cross validation score is generally a better estimate of the model's performance on unseen data than the standard 80/10/10 splitting that you'd use when training, validating, and testing your final model.

Many machine learning libraries, such as Python's scikit-learn, have a module for this.

Jay Speidell

Posted 2018-09-27T13:16:20.367

Reputation: 561