Is there a way to define the boundaries of the optimal size of a training set?


At a related question in Computer Science SE, a user told:

Neural networks typically require a large training set.

Is there a way to define the boundaries of the "optimal" size of a training set in general case?

When I was learning about fuzzy logic, I've heard some rules of thumb that involved examining the mathematical composition of the problem and using that to define the number of fuzzy sets.

Is there such a method that can be applicable for an already defined neural network topology?

Zoltán Schmidt

Posted 2016-08-04T15:49:13.793

Reputation: 593



For a finite value to be 'optimal,' typically you need some benefit from more paired up with some cost for more, and eventually the lines cross because the benefit decreases and the cost increases.

Most models will have a reduction in error with more training data, that asymptotically approaches the best the model can do. See this image (from here) as an example:

Decreasing error with increasing training set size

The costs of training data are also somewhat obvious; data is costly to obtain, to store, and to move. (Assuming model complexity stays constant, the actual cost of storing, moving, and using the model remains the same, since the weights in the model are just being tuned.)

So at some point the slope of the error-reduction curve becomes horizontal enough that more data points are costlier than they're worth, and that's the optimal amount of training data.

Matthew Graves

Posted 2016-08-04T15:49:13.793

Reputation: 3 957


In general, the larger the training set, the better. See The Unreasonable effectiveness of Data, though this article is quite dated (written in 2009). Xavier Amatriain, a researcher at Netflix has a Quora answer where he discusses that more data can sometimes hurt algorithms.

For deep neural networks in particular, it does not seem that we have hit these limits yet.


Posted 2016-08-04T15:49:13.793

Reputation: 1 056