3

2

I understand that the more data we have, the more reliable is our model trained on that data. I also understand that the more parameters a machine learning model has, the more training data it requires. For example, deep neural networks need more data than linear regression or support vector machines.

Successful application of machine learning to texts and images requires millions of data points. I read this attempt (link) with a data set of 5000 quotes of thinkers to generate new quotes. The author had several failed attempts until he understood that his data set is too small for the task. He succeeded only by using transfer learning employing an AWD-LSTM model developed by Salesforce trained on 100 million tokens from Wikipedia.

On the other hand, I often see scientific articles that train machine learning models on just a few hundreds of data (or even less) and boast about their "great discoveries". For example, this paper (link) uses a data set of only 46 samples for which they use decision trees. This situation exists because it is an extremely rare in science that one can make millions of experiments with millions of materials to get large training data sets. Yet, machine learning is fashionable, so many research groups apply it to their tiny datasets just to get published. How reliable are their predictions given the size of their data sets?

Is there a formula, or an algorithm where I can insert the number of my training data, the number of parameters of my model, and maybe few other numbers that distinguish between the models (e.g. feed-forward neural networks, SVM, kernel ridge regression, decision trees) and this formula show that my training data size is sufficient or insufficient for a chosen method?

Is there statistical research that shows quantitatively that a certain machine learning model requires at least this amount of data?

I don't have the answer for you, but I just wanted to comment that I think this is a really excellent question. – Dan Scally – 2019-09-04T07:52:29.117