Why should the data be shuffled for machine learning tasks



In machine learning tasks it is common to shuffle data and normalize it. The purpose of normalization is clear (for having same range of feature values). But, after struggling a lot, I did not find any valuable reason for shuffling data.

I have read this post here discussing when we need to shuffle data, but it is not obvious why we should shuffle the data. Furthermore, I have frequently seen in algorithms such as Adam or SGD where we need batch gradient descent (data should be separated to mini-batches and batch size has to be specified). It is vital according to this post to shuffle data for each epoch to have different data for each batch. So, perhaps the data is shuffled and more importantly changed.

Why do we do this?


Posted 2017-11-09T07:42:15.517

Reputation: 12 077

1It might be useful to state exactly why the answer in the first link didn't help you. Otherwise, we're taking the risk of repeating content already said there with little improvements. – E_net4 wants more flags – 2017-11-09T11:01:45.500

As I have stated I want to know why not when, do you know why? is that really explained there? I have not seen any paper for this at all – Media – 2017-11-09T12:20:24.200


For more information on the impact of example ordering read Curriculum Learning [pdf].

– Emre – 2017-11-09T18:38:21.630


I posted this on CrossValidated and I think it's relevant. https://stats.stackexchange.com/a/311318/89653

– Josh – 2017-11-09T19:03:59.697

@Emre actually this paper is against the shuffling, thanks, I did not hear about this kind of learning. – Media – 2017-11-09T20:40:08.050



Based on What should we do when a question posted on DataScience is a duplicate of a question posted on CrossValidated?, I am reposting my answer to the same question asked on CrossValidated (https://stats.stackexchange.com/a/311318/89653).

Note: throughout this answer I refer to minimization of training loss and I do not discuss stopping criteria such as validation loss. The choice of stopping criteria does not affect the process/concepts described below.

The process of training a neural network is to find the minimum value of a loss function $ℒ_X(W)$, where $W$ represents a matrix (or several matrices) of weights between neurons and $X$ represents the training dataset. I use a subscript for $X$ to indicate that our minimization of $ℒ$ occurs only over the weights $W$ (that is, we are looking for $W$ such that $ℒ$ is minimized) while $X$ is fixed.

Now, if we assume that we have $P$ elements in $W$ (that is, there are $P$ weights in the network), $ℒ$ is a surface in a $P+1$-dimensional space. To give a visual analogue, imagine that we have only two neuron weights ($P=2$). Then $ℒ$ has an easy geometric interpretation: it is a surface in a 3-dimensional space. This arises from the fact that for any given matrices of weights $W$, the loss function can be evaluated on $X$ and that value becomes the elevation of the surface.

But there is the problem of non-convexity; the surface I described will have numerous local minima, and therefore gradient descent algorithms are susceptible to becoming "stuck" in those minima while a deeper/lower/better solution may lie nearby. This is likely to occur if $X$ is unchanged over all training iterations, because the surface is fixed for a given $X$; all its features are static, including its various minima.

A solution to this is mini-batch training combined with shuffling. By shuffling the rows and training on only a subset of them during a given iteration, $X$ changes with every iteration, and it is actually quite possible that no two iterations over the entire sequence of training iterations and epochs will be performed on the exact same $X$. The effect is that the solver can easily "bounce" out of a local minimum. Imagine that the solver is stuck in a local minimum at iteration $i$ with training mini-batch $X_i$. This local minimum corresponds to $ℒ$ evaluated at a particular value of weights; we'll call it $ℒ_{X_i}(W_i)$. On the next iteration the shape of our loss surface actually changes because we are using $X_{i+1}$, that is, $ℒ_{X_{i+1}}(W_i)$ may take on a very different value from $ℒ_{X_i}(W_i)$ and it is quite possible that it does not correspond to a local minimum! We can now compute a gradient update and continue with training. To be clear: the shape of $ℒ_{X_{i+1}}$ will -- in general -- be different from that of $ℒ_{X_{i}}$. Note that here I am referring to the loss function $ℒ$ evaluated on a training set $X$; it is a complete surface defined over all possible values of $W$, rather than the evaluation of that loss (which is just a scalar) for a specific value of $W$. Note also that if mini-batches are used without shuffling there is still a degree of "diversification" of loss surfaces, but there will be a finite (and relatively small) number of unique error surfaces seen by the solver (specifically, it will see the same exact set of mini-batches -- and therefore loss surfaces -- during each epoch).

One thing I deliberately avoided was a discussion of mini-batch sizes, because there are a million opinions on this and it has significant practical implications (greater parallelization can be achieved with larger batches). However, I believe the following is worth mentioning. Because $ℒ$ is evaluated by computing a value for each row of $X$ (and summing or taking the average; i.e., a commutative operator) for a given set of weight matrices $W$, the arrangement of the rows of $X$ has no effect when using full-batch gradient descent (that is, when each batch is the full $X$, and iterations and epochs are the same thing).


Posted 2017-11-09T07:42:15.517

Reputation: 413

+1. This answer is technically better explained than the other answer with higher number of upvotes. – Gokul NC – 2019-10-12T17:11:59.030


Shuffling data serves the purpose of reducing variance and making sure that models remain general and overfit less.

The obvious case where you'd shuffle your data is if your data is sorted by their class/target. Here, you will want to shuffle to make sure that your training/test/validation sets are representative of the overall distribution of the data.

For batch gradient descent, the same logic applies. The idea behind batch gradient descent is that by calculating the gradient on a single batch, you will usually get a fairly good estimate of the "true" gradient. That way, you save computation time by not having to calculate the "true" gradient over the entire dataset every time.

You want to shuffle your data after each epoch because you will always have the risk to create batches that are not representative of the overall dataset, and therefore, your estimate of the gradient will be off. Shuffling your data after each epoch ensures that you will not be "stuck" with too many bad batches.

In regular stochastic gradient descent, when each batch has size 1, you still want to shuffle your data after each epoch to keep your learning general. Indeed, if data point 17 is always used after data point 16, its own gradient will be biased with whatever updates data point 16 is making on the model. By shuffling your data, you ensure that each data point creates an "independent" change on the model, without being biased by the same points before them.

Valentin Calomme

Posted 2017-11-09T07:42:15.517

Reputation: 4 666

1As I explained, you shuffle your data to make sure that your training/test sets will be representative. In regression, you use shuffling because you want to make sure that you're not training only on the small values for instance. Shuffling is mostly a safeguard, worst case, it's not useful, but you don't lose anything by doing it.

For the stochastic gradient descent part, you again want to make sure that the model is not the way it is because of the order in which you fed it the data, so to make sure to avoid that, you shuffle – Valentin Calomme – 2017-11-09T13:19:52.273

2I think shuffling decreases variance and is likely to increase bias (i.e., it reduces the tendency to overfit the data). Imagine we were doing full-batch gradient descent, such that epochs and iterations are the same thing. Then there exists a global minimum (not that we can necessarily find it) which our solver is trying to locate. If we are using MSE loss, then we will minimize bias if we could reach this solution every time. But since this global minimum is likely to be found in a different place for different training sets, this solution will tend to have high variance. – Josh – 2017-11-09T19:10:20.007

2By shuffling, we are less likely to converge to a solution lying in the global minimum for the whole training set (higher bias), but more likely to find a solution that generalizes better (lower variance). – Josh – 2017-11-09T19:11:42.140


Suppose data is sorted in a specified order. For example a data set which is sorted base on their class. So, if you select data for training, validation, and test without considering this subject, you will select each class for different tasks, and it will fail the process.

Hence, to impede these kind of problems, a simple solution is shuffling the data to get different sets of training, validation, and test data.

About the mini-batch, answers to this post can be a solution to your question.


Posted 2017-11-09T07:42:15.517

Reputation: 1 001

1@Media The most related answer in the provided link is: "Shuffling mini-batches makes the gradients more variable, which can help convergence because it increases the likelihood of hitting a good direction" – OmG – 2017-11-09T13:14:35.717

Actually I have seen this in the paper of SGD but it as the authors of the paper claimed it is the reason of convergence not the shuffling. I saw the link and I doubt it a bit. for more clarity look this amazing paper. The authors have mentioned the point there, but as you will see there is no exact reason for shuffling

– Media – 2017-11-09T13:21:45.307


Because $ℒ$ is evaluated by computing a value for each row of $X$ (and summing or taking the average; i.e., a commutative operator) for a given set of weight matrices $W$, the arrangement of the rows of $X$ has no effect when using full-batch gradient descent

Complementing @Josh's answer, I would like to add that, for the same reason, shuffling needs to be done before batching. Otherwise, you are getting the same finite number of surfaces.

Gerardo Consuelos

Posted 2017-11-09T07:42:15.517

Reputation: 51

Thank you and welcome to our community. – Media – 2019-03-24T03:36:30.870


We need to shuffle only for minibatch/SGD, no need for batch gradient descent.

If not shuffling data, the data can be sorted or similar data points will lie next to each other, which leads to slow convergence:

  • Similar samples will produce similar surfaces (1 surface for the loss function for 1 sample) -> gradient will points to similar directions but this direction rarely points to the minimum-> it may drive the gradient very far from the minimum
  • “Best direction”: the average of all gradient of all surfaces (batch gradient descent) which points directly to the minum
  • “Minibatch direction”: average of a variety of directions will point closer to the minimum, although non of them points to the minimum
  • “1-sample direction”: point farther to the minimum compared to the minibatch

I drew the plot of the L-2 loss function for linear regression for y=2x here


Posted 2017-11-09T07:42:15.517

Reputation: 131


For best accuracy of the model, it's always recommended that training data should have all flavours of data.

Shuffling of training data helps us in achieving this target.

vipin bansal

Posted 2017-11-09T07:42:15.517

Reputation: 1 322


By shuffling the rows and training on only a subset of them during a given iteration, changes with every iteration, and it is actually quite possible that no two iterations over the entire sequence of training iterations and epochs will be performed on the exact same

Ankit DEsai

Posted 2017-11-09T07:42:15.517

Reputation: 11


Since the SGD algorithm selects the subset of instances randomly, it is quite possible that it may take few instances many numbers of time per epoch, which may bring the cost function to a global minima.

If training instances are shuffled then the chances of selecting repeating instances is much less.

Source of inforamation: handson machine learning with scikit learn keras and tensorflow.

Chirag Palan

Posted 2017-11-09T07:42:15.517

Reputation: 11


Well, after years, now I really know why we shuffle data! The idea is very simple, but I do not know why we really did not consider it.

For making the cost function, we are explicitly considering that the samples are i.i.d. For instance, in binary cross-entropy, you can easily see that we have a summation. That summation has been a product at first, and after taking the logarithm, it has been changed to sum. Actually, in the formulation of that cost function, we have discarded the joint probability, because it is difficult to compute. With i.i.d assumption, we have the current cost function. Now suppose our task is learning with different mini-batches and these mini-batches are not identical.


Posted 2017-11-09T07:42:15.517

Reputation: 12 077