Given a machine learning algorithm, what is the minimum size of the training set for it?



I understand that the more data we have, the more reliable is our model trained on that data. I also understand that the more parameters a machine learning model has, the more training data it requires. For example, deep neural networks need more data than linear regression or support vector machines.

Successful application of machine learning to texts and images requires millions of data points. I read this attempt (link) with a data set of 5000 quotes of thinkers to generate new quotes. The author had several failed attempts until he understood that his data set is too small for the task. He succeeded only by using transfer learning employing an AWD-LSTM model developed by Salesforce trained on 100 million tokens from Wikipedia.

On the other hand, I often see scientific articles that train machine learning models on just a few hundreds of data (or even less) and boast about their "great discoveries". For example, this paper (link) uses a data set of only 46 samples for which they use decision trees. This situation exists because it is an extremely rare in science that one can make millions of experiments with millions of materials to get large training data sets. Yet, machine learning is fashionable, so many research groups apply it to their tiny datasets just to get published. How reliable are their predictions given the size of their data sets?

Is there a formula, or an algorithm where I can insert the number of my training data, the number of parameters of my model, and maybe few other numbers that distinguish between the models (e.g. feed-forward neural networks, SVM, kernel ridge regression, decision trees) and this formula show that my training data size is sufficient or insufficient for a chosen method?

Is there statistical research that shows quantitatively that a certain machine learning model requires at least this amount of data?

Vladislav Gladkikh

Posted 2019-09-04T05:57:26.470

Reputation: 821

I don't have the answer for you, but I just wanted to comment that I think this is a really excellent question. – Dan Scally – 2019-09-04T07:52:29.117



You can have a model with a single training example. The real question should be, how good is your model with only a single training example. The answer? A learning algorithm $h(\theta)$ is an approximation of the actual relationship between your features and your target. However, the performance on how good $h(\theta)$ relies on good data. Your model is only as good as your data. If your data shows clear patterns, you can have a good model with a few data points. That's why Exploratory Data Analysis is essential.

I'd recommend watching Statistical Learning theory on youtube, specifically the lecture "Is learning feasible". The class is offered by Caltech.

Benj Cabalona Jr.

Posted 2019-09-04T05:57:26.470

Reputation: 261

Is it this course? youtube link

– Vladislav Gladkikh – 2019-09-04T10:34:51.173

Yes. That's a great course. – Benj Cabalona Jr. – 2019-09-04T10:40:34.207


As Benj said, there's no general answer since it depends not only on the algorithm but also a lot on the data. It's easy to find examples where the exact same size of data with the same algorithm performs terrible in one case and perfectly in the other.

Given a particular dataset and a particular algorithm, there are experimental methods which can help determine the relationship between data size and performance:

  • Ablation study: train a model using say 10%, 20%, 30%... 100% of the training data and evaluate (preferably with cross-validation) for each subset, then plot the performance at each stage. The evolution of the performance across various sizes shows how much gain in performance is made with each step of additional data, and by extrapolation one can roughly predict how much more would be gained with more data.
  • Features: the complexity of the data depends a lot on the number and diversity of the features, so in order to get a full picture of the relationship between data size and performance it's important to also study how different subsets of features perform. It's possible that a certain size of data gives poor performance with a large set of features, but the same set of instances with less features performs perfectly well.


Posted 2019-09-04T05:57:26.470

Reputation: 12 600

I like the "Ablation Study", but I'm curious, isn't that method prone to over fitting? – Benj Cabalona Jr. – 2020-02-03T03:13:07.797

1@BenjCabalonaJr. Yes, there's certainly a bigger risk when the training data is small (that's why evaluating with cross-validation is recommended btw). feature selection can be used to mitigate overfitting, but anyway in a ablation study the goal is not to obtain an optimal model, just to see the level of performance. – Erwan – 2020-02-03T12:03:52.620