**tl;dr:** A batch size is the number of samples a network sees before updating its gradients. This number can range from a single sample to the whole training set. Empirically, there is a sweet spot in the range 1 to a few hundreds, where people experience the fastest training speeds. Check this article for more details.

### A more detailed explanation...

If you have a small enough number of samples, you can let the network see all of the samples before updating its weights; this is called Gradient Descent. The benefit from this is that you guarantee that the weights will be updated in the direction that reduces the training loss for the whole dataset. The downside is that it is computationally expensive and in most cases infeasible for deep neural nets.

What is done in practice is that the network sees only a batch of the training data, instead of the whole dataset, before updating its weights. However, this technique does not guarantee that the network updates its weights in a way that will reduce the dataset's training loss; instead it reduces the batch's training loss, which might not the same thing. This adds noise to the training process, which can in some cases be a good thing, but requires the network to take too many steps to converge (this isn't a problem since each step is much faster).

What you're saying is essentially training the network each time on a single sample. This is formally called Stochastic Gradient Descent, however the term is used more broadly to include any case where the network is trained on a subset of the whole training set. The problem with this approach is that it adds too much noise to the training process, causing it to require a lot more steps to actually converge.

"...and in most cases infeasible for deep neural nets"Do you mean that it would take a long time to train (e.g. one year instead of one week) or is there some other infeasibility? – Mateen Ulhaq – 2020-03-02T06:02:41.203You're restricted in terms of hardware. Typically a neural network is trained on a GPU which has limited memory and computational units. The bigger the batch size, the more memory and computation the network requires. Therefore after a certain batch size, you cannot train anymore on the GPU. – spurra – 2020-03-02T12:41:52.767

Stochastic Gradient Descent feels like another method to back propagation. Maybe both use partial derivatives but is it true that backpropagation and Gradient Descent are totally different things? – MScott – 2020-03-02T18:10:51.173

2@MScott these two are often confused with one another. Backpropagation is simply an algorithm for efficiently

computing the gradientof the loss function w.r.t the model's parameters. Gradient Descent is an algorithm for using these gradients toupdate the parametersof the model, in order to minimize this loss. Algorithms like this are called optimization algorithms. SGD is just an extension of Gradient Descent and there are many others out there (e.g. Adam, Adadelta, Adagrad, RMSProp). – Djib2011 – 2020-03-02T23:05:49.583