The key advantage of using minibatch as opposed to the full dataset goes back to the fundamental idea of stochastic gradient descent1.
In batch gradient descent, you compute the gradient over the entire dataset, averaging over potentially a vast amount of information. It takes lots of memory to do that. But the real handicap is the batch gradient trajectory land you in a bad spot (saddle point).
In pure SGD, on the other hand, you update your parameters by adding (minus sign) the gradient computed on a single instance of the dataset. Since it's based on one random data point, it's very noisy and may go off in a direction far from the batch gradient. However, the noisiness is exactly what you want in non-convex optimization, because it helps you escape from saddle points or local minima(Theorem 6 in ). The disadvantage is it's terribly inefficient and you need to loop over the entire dataset many times to find a good solution.
The minibatch methodology is a compromise that injects enough noise to each gradient update, while achieving a relative speedy convergence.
1 Bottou, L. (2010). Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT'2010 (pp. 177-186). Physica-Verlag HD.
 Ge, R., Huang, F., Jin, C., & Yuan, Y. (2015, June). Escaping From Saddle Points-Online Stochastic Gradient for Tensor Decomposition. In COLT (pp. 797-842).
I just saw this comment on Yann LeCun's facebook, which gives a fresh perspective on this question (sorry don't know how to link to fb.)
Training with large minibatches is bad for your health.
More importantly, it's bad for your test error.
Friends dont let friends use minibatches larger than 32.
Let's face it: the only people have switched to minibatch sizes larger than one since 2012 is because GPUs are inefficient for batch sizes smaller than 32. That's a terrible reason. It just means our hardware sucks.
He cited this paper which has just been posted on arXiv few days ago (Apr 2018), which is worth reading,
Dominic Masters, Carlo Luschi, Revisiting Small Batch Training for Deep Neural Networks, arXiv:1804.07612v1
From the abstract,
While the use of large mini-batches increases the available computational parallelism, small batch training has been shown to provide improved generalization performance ...
The best performance has been consistently obtained for mini-batch sizes between m=2 and m=32, which contrasts with recent work advocating the use of mini-batch sizes in the thousands.