How do I choose the optimal batch size?



Batch size is a term used in machine learning and refers to the number of training examples utilised in one iteration. The batch size can be one of three options:

  1. batch mode: where the batch size is equal to the total dataset thus making the iteration and epoch values equivalent
  2. mini-batch mode: where the batch size is greater than one but less than the total dataset size. Usually, a number that can be divided into the total dataset size.
  3. stochastic mode: where the batch size is equal to one. Therefore the gradient and the neural network parameters are updated after each sample.

How do I choose the optimal batch size, for a given task, neural network or optimization problem?

If you hypothetically didn't have to worry about computational issues, what would the optimal batch size be?

Sebastian Nielsen

Posted 2018-10-21T17:09:04.533

Reputation: 211



From this awesome blog

How to Configure Mini-Batch Gradient Descent

Mini-batch gradient descent is the recommended variant of gradient descent for most applications, especially in deep learning.

Mini-batch sizes, commonly called “batch sizes” for brevity, are often tuned to an aspect of the computational architecture on which the implementation is being executed. Such as a power of two that fits the memory requirements of the GPU or CPU hardware like 32, 64, 128, 256, and so on.

Batch size is a slider on the learning process.

  • Small values give a learning process that converges quickly at the cost of noise in the training process.
  • Large values give a learning process that converges slowly with accurate estimates of the error gradient.

A good default for batch size might be 32


Posted 2018-10-21T17:09:04.533

Reputation: 371


Here are a few guidelines, inspired by the deep learning specialization course, to choose the size of the mini-batch:

  • If you have a small training set, use batch gradient descent (m < 200)

In practice:

  • Batch mode: long iteration times
  • Mini-batch mode: faster learning
  • Stochastic mode: lose speed up from vectorization

The typically mini-batch sizes are 64, 128, 256 or 512.

And, in the end, make sure the minibatch fits in the CPU/GPU.

Have also a look at the paper Practical Recommendations for Gradient-Based Training of Deep Architectures (2012) by Yoshua Bengio.


Posted 2018-10-21T17:09:04.533

Reputation: 42