Are there any rules for choosing the size of a mini-batch?

30

11

When training neural networks, one hyperparameter is the size of a minibatch. Common choices are 32, 64 and 128 elements per mini batch.

Are there any rules / guidelines how big a mini-batch should be? Any publications which investigates the effect on the training?

Martin Thoma

Posted 2017-04-17T16:18:22.793

Reputation: 15 590

Other than fitting in memory? – Ehsan M. Kermani – 2017-04-17T16:28:01.643

Yes. For example, is there any publication with says "the bigger the batch size, the better" (as long as it fits in memory)? – Martin Thoma – 2017-04-17T16:29:24.263

@EhsanM.Kermani I think it does matter. I made a couple of runs on CIFAR-100 and I get different results depending on the batch size (with early stopping so that overfitting is hopefully not a problem) – Martin Thoma – 2017-04-17T16:45:12.487

Also, there is the trade-off between more updates and hence probably less epochs until convergence and time per update / more meaningful updates / gradient noise. – Martin Thoma – 2017-04-17T16:49:50.830

3

Bigger computes faster (is efficient), smaller converges faster, generalizes better; cf. Efficient Mini-batch Training for Stochastic Optimization and this RNN study. There is a sweet spot that you find empirically for your problem.

– Emre – 2017-04-17T17:00:03.987

@Emre May I add your comment to my answer? (I would then make it community wiki) – Martin Thoma – 2017-04-17T17:49:59.883

Sure, feel free; that's what they're for. – Emre – 2017-04-17T18:09:38.087

3

This most insightful paper by Blei et al just came out: Stochastic Gradient Descent as Approximate Bayesian Inference

– Emre – 2017-04-17T21:04:11.307

Interesting observations! – Ehsan M. Kermani – 2017-04-17T21:18:01.377

Answers

31

In On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima there are a couple of intersting statements:

It has been observed in practice that when using a larger batch there is a degradation in the quality of the model, as measured by its ability to generalize [...]

large-batch methods tend to converge to sharp minimizers of the training and testing functions—and as is well known, sharp minima lead to poorer generalization. n. In contrast, small-batch methods consistently converge to flat minimizers, and our experiments support a commonly held view that this is due to the inherent noise in the gradient estimation.

From my masters thesis: Hence the choice of the mini-batch size influences:

  • Training time until convergence: There seems to be a sweet spot. If the batch size is very small (e.g. 8), this time goes up. If the batch size is huge, it is also higher than the minimum.
  • Training time per epoch: Bigger computes faster (is efficient)
  • Resulting model quality: The lower the better due to better generalization (?)

It is important to note hyper-parameter interactions: Batch size may interact with other hyper-parameters, most notably learning rate. In some experiments this interaction may make it hard to isolate the effect of batch size alone on model quality. Another strong interaction is with early stopping for regularisation.

See also

Martin Thoma

Posted 2017-04-17T16:18:22.793

Reputation: 15 590

@NeilSlater Do you want to add your comment to my (now community wiki) answer? – Martin Thoma – 2017-04-17T18:17:39.767

1I like the answer as a general one. Moreover I would appreciate to have a number about what are very small, huge and mini-batch in a specific example. – So S – 2017-04-19T10:32:44.357

@SoS mini-batch is just a term. The "mini" does not refer to a specific size, but it only means that there is more than 1 example and less than the total training set. I consider "very small" to be <= 8 (I've just edited the answer). I also measured an extreme (more than 5x) increase in wall-clock training time for this. Normal is something like 64 or 128. I'm not too sure what "huge" is; I think this might depend on the hardware. – Martin Thoma – 2017-04-19T11:20:23.903

This answer asks more questions than it answers. Where is this sweet spot (maybe a graph would help)? How does it interact with learning rate and early stopping? – xjcl – 2019-09-03T01:13:04.463

The answer depends on the network and the dataset. Hence it doesn't make sense to give specific numbers and hence a graph would not help. About interactions with other hyperparameters: I don't know for sure. Try it and publish your results :-) – Martin Thoma – 2019-09-03T05:22:18.080