How do I choose the optimal batch size?



Batch size is a term used in machine learning and refers to the number of training examples utilised in one iteration. The batch size can be one of three options:

  1. batch mode: where the batch size is equal to the total dataset thus making the iteration and epoch values equivalent
  2. mini-batch mode: where the batch size is greater than one but less than the total dataset size. Usually, a number that can be divided into the total dataset size.
  3. stochastic mode: where the batch size is equal to one. Therefore the gradient and the neural network parameters are updated after each sample.

How do I choose the optimal batch size, for a given task, neural network or optimization problem?

If you hypothetically didn't have to worry about computational issues, what would the optimal batch size be?

From this awesome blog

How to Configure Mini-Batch Gradient Descent

Mini-batch gradient descent is the recommended variant of gradient descent for most applications, especially in deep learning.

Mini-batch sizes, commonly called “batch sizes” for brevity, are often tuned to an aspect of the computational architecture on which the implementation is being executed. Such as a power of two that fits the memory requirements of the GPU or CPU hardware like 32, 64, 128, 256, and so on.

Batch size is a slider on the learning process.

  • Small values give a learning process that converges quickly at the cost of noise in the training process.
  • Large values give a learning process that converges slowly with accurate estimates of the error gradient.

A good default for batch size might be 32


Here are a few guidelines, inspired by the deep learning specialization course, to choose the size of the mini-batch:

  • If you have a small training set, use batch gradient descent (m < 200)

In practice:

  • Batch mode: long iteration times
  • Mini-batch mode: faster learning
  • Stochastic mode: lose speed up from vectorization

The typically mini-batch sizes are 64, 128, 256 or 512.

And, in the end, make sure the minibatch fits in the CPU/GPU.

Have also a look at the paper Practical Recommendations for Gradient-Based Training of Deep Architectures (2012) by Yoshua Bengio.


