tl;dr: A batch size is the number of samples a network sees before updating its gradients. This number can range from a single sample to the whole training set. Empirically, there is a sweet spot in the range 1 to a few hundreds, where people experience the fastest training speeds. Check this article for more details.
A more detailed explanation...
If you have a small enough number of samples, you can let the network see all of the samples before updating its weights; this is called Gradient Descent. The benefit from this is that you guarantee that the weights will be updated in the direction that reduces the training loss for the whole dataset. The downside is that it is computationally expensive and in most cases infeasible for deep neural nets.
What is done in practice is that the network sees only a batch of the training data, instead of the whole dataset, before updating its weights. However, this technique does not guarantee that the network updates its weights in a way that will reduce the dataset's training loss; instead it reduces the batch's training loss, which might not the same thing. This adds noise to the training process, which can in some cases be a good thing, but requires the network to take too many steps to converge (this isn't a problem since each step is much faster).
What you're saying is essentially training the network each time on a single sample. This is formally called Stochastic Gradient Descent, however the term is used more broadly to include any case where the network is trained on a subset of the whole training set. The problem with this approach is that it adds too much noise to the training process, causing it to require a lot more steps to actually converge.