30

11

When training neural networks, one hyperparameter is the size of a minibatch. Common choices are 32, 64 and 128 elements per mini batch.

Are there any rules / guidelines how big a mini-batch should be? Any publications which investigates the effect on the training?

Other than fitting in memory? – Ehsan M. Kermani – 2017-04-17T16:28:01.643

Yes. For example, is there any publication with says "the bigger the batch size, the better" (as long as it fits in memory)? – Martin Thoma – 2017-04-17T16:29:24.263

@EhsanM.Kermani I think it does matter. I made a couple of runs on CIFAR-100 and I get different results depending on the batch size (with early stopping so that overfitting is hopefully not a problem) – Martin Thoma – 2017-04-17T16:45:12.487

Also, there is the trade-off between more updates and hence probably less epochs until convergence and time per update / more meaningful updates / gradient noise. – Martin Thoma – 2017-04-17T16:49:50.830

3

Bigger computes faster (is efficient), smaller converges faster, generalizes better; cf. Efficient Mini-batch Training for Stochastic Optimization and this RNN study. There is a sweet spot that you find empirically for your problem.

– Emre – 2017-04-17T17:00:03.987@Emre May I add your comment to my answer? (I would then make it community wiki) – Martin Thoma – 2017-04-17T17:49:59.883

Sure, feel free; that's what they're for. – Emre – 2017-04-17T18:09:38.087

3

This most insightful paper by Blei et al just came out: Stochastic Gradient Descent as Approximate Bayesian Inference

– Emre – 2017-04-17T21:04:11.307Interesting observations! – Ehsan M. Kermani – 2017-04-17T21:18:01.377