Does the model learn from the average of all the data points in the mini-batch?


I used the example at - - to create my own classification model. I used different data but the basic outline of datasets was used.

It was important for my data type to shuffle the data and then create the training and testing sets. The problem, however, comes as a result of the shuffling.

When I train my model with the shuffled train set I get a +- 80% accuracy for train and +- 70% accuracy for the test set. I then want to input all the data (i.e. the set that made the training and test set) into the model to view the fully predicted output of this data set that I have.

If this data set is shuffled as the training and testing set was I get an accuracy of around 77% which is as expected, but then, if I input the unshuffled data (as I required to view the predictions), I get a 45% accuracy. How is this possible?

I assume it's due to the fact that the model is learning incorrectly and that it learns that the order of the data points plays a role in the prediction of those data points. But this shouldn't be happening as I am simply trying to (like the MNIST example) predict each data point separately. This could be a mini-batch training problem.

In the example mentioned above, using data sets and batches to train, does the model learn from the average of all the data points in the mini-batch or does it think one mini-batch is one data point and learn in that manner (which would mean order matters of the data)?

Or if there are any other suggestions.

Emile Engelbrecht

Posted 2018-05-22T09:49:42.907

Reputation: 61

I was using batch_normalization which seemed to provide a very high accuracy but then caused the the model to make the order of the data important. – Emile Engelbrecht – 2018-05-24T19:18:23.210



if I input the unshuffled data (as I required to view the predictions) I get a 45% accuracy. How is this possible?

When you build your dataset and then split it into training, test, and validation sets, you have to be sure that the training encompasses aspects of the test set or the validation set and more. That's why shuffling the data is important. In the case where you didn't shuffle your dataset, your NN was confronted to object it never sees before (i.e. in the training set) and then couldn't classify it properly.

does the model learn from the average of all the data points in the mini-batch or does it think one mini-batch is one data point and learn in that manner (which would mean order matters of the data)?

When using a mini batch you decide the size (b_size) of the mini batch and the number of mini batches (n_batch). Then, for n_batch times you will draw b_size random indices from X_train and y_train arrays, build with these indices X_batch and y_batch, and tune your model parameters on the 2 latter arrays (and do it n_batch times).

So having shuffled your dataset and takes randomly parts of your dataset to train your model ensure you to get rid of correlations between two points of your datasets and better generalization


Posted 2018-05-22T09:49:42.907

Reputation: 93


The example followed in the question uses a relatively straightforward Convolutional Neural Network. These are not stateful, so the order in which predictions for instances from a test set are queried should have no influence on those predictions.

In a comment written by the author of the question on their own question, it was mentioned that the use of Batch Normalization appears to have been confirmed to be the cause of the issue. Given this info, one possible cause of the issue described in the question is incorrect usage of the training flag of TensorFlow's Batch Normalization implementation. The official documentation contains the following info on this flag:

training: Either a Python boolean, or a TensorFlow boolean scalar tensor (e.g. a placeholder). Whether to return the output in training mode (normalized with statistics of the current batch) or in inference mode (normalized with moving statistics). NOTE: make sure to set this parameter correctly, or else your training/inference will not work properly.

If this is incorrectly set to True rather than False outside of the training phase (i.e. when evaluating performance), predictions can be expected to be poor. This alone doesn't explain why specifically the order of test data would matter though, if this alone were the issue then we'd expect test performance to be poor regardless of order.

A different possible explanation can be that there is a mistake in the code that still causes moving_mean and moving_variance ops of the Batch Normalization to be updated during the testing/evaluation. These should only be updated during the training phase, as explained in the documentation linked to above. If they are still getting updated during the test phase, and if there is a meaningful structure in the unshuffled ordering of the test set (i.e. unshuffled test set ordered by class, or ordered according to certain features, etc.), then we would expect precisely the issue described in the question to occur.

Dennis Soemers

Posted 2018-05-22T09:49:42.907

Reputation: 7 644


Problem Statement

These are the features of the runs.

  • CNN class prediction using two 2D convolutions with associating max pooling
  • mini-batch execution approach
  • fixed shuffling process used

These are the results.

  • shuffling obtained 80%/70% train/test accuracy
  • shuffling for full set obtained 77% accuracy
  • no shuffling for full set obtained 45% accuracy

Listed Causes

These are the potential Causes for the apparent anomaly that were listed in the question.

  • model is learning incorrectly
  • learns that the order of the data points plays a role in their prediction
  • data point not predicted separately because of mini-batching

Other Causal Possibilities not Listed

Notice that both of these additional possibilities are related to an insufficiency in the simulation of randomness, just as can be the case in cryptographic protocols.

  • The CNN learns the shuffling system or some aspect of it so that when the shuffling is removed, the training no longer applies to the input patterns
  • How the training and testing samples are drawn is not sufficiently random

Additional Questions

Does the model learn from the average of all the data points in the mini-batch? — Yes.

Does it think one mini-batch is one data point? — No. It doesn't think, and the loop does not average the data points before the propagation. Mini-batch simply aggregates the results before back-propagating the correction signal to the parameter tensors.

Does order matter? — Order cannot matter in a stateless system, but often does if there is state remembered between discrete events. Mini-batch requires averaging, which requires statefulness to accumulate the addends. But that is not the likely cause. How the batches are selected from the sample is a more likely factor affecting accuracy.

Principles to Comprehend

The convergence of artificial networks in general is based on the statistical characteristics of the training scenario matching the statistical characteristics of the usage scenario. In other words, to use PAC (probably approximately correct) framework terminology, how the training sample is drawn from the total population must be identical to how the validation sample is drawn from the total population. Therefore, if the training sample is not drawn with sufficient randomness from the total population convergence cannot not guaranteed.

Questions to Consider

  • How am I deciding the individual operations within the shuffling?
  • How am I drawing the train and test samples?
  • How am I deciding what samples go in what batch?
  • What natural order is in the data examples, and is it really a sequence rather than a set?
  • If a sequence, then is a classic CNN, not designed out of the box to handle temporal sequences, the correct network design to apply?

Answering these questions and gaining a full conceptual understanding of the probability and statistics aspects of the approach should occur prior to thinking about normalization, which could fix your problem accidentally, but cannot be the root cause of the anomaly.

Douglas Daseeco

Posted 2018-05-22T09:49:42.907

Reputation: 7 174