## Ordered elements of feature vectors for autoencoders?

3

1

Here is a newbie question; when one trains an autoencoder or a variational autoencoder, does the order of the objects in the training vector $x$ matter?

Suppose I take an MNIST image image $(28\times28)$ and turn it into a feature vector of size $x \in \mathbb{R}^{1\times784}$. Then does it matter if I e.g. flatten the whole image vertically, or horizontally, or some other fancy way? Or if I were to scramble the order of the elements in the feature vector, would that make the VAE or AE mess up?

2

For a fully-connected network the precise order of features does not matter initially (i.e. before you start to train), as long as it is consistent for each example. This is independent of whether you have an auto-encoder to train or some other fully-connected network. Processing images with pixels as features does not change this.

Some caveats:

• To succeed in training, you will need the pixel order to be the same for each example. So it can be randomly shuffled, but only if you keep the same shuffle for each and every example.

• As an aside, you will still get some training effect from fully random shuffling the variables, because for example writing an "8" has more filled pixels than writing a "1" on average. But the performance will be very bad, accuracy only a little better than guessing, for most interesting problem domains.
• To visualise what the auto-encoder has learned, your output needs to be unscrambled. You can actually input a (same shuffle each example) scrambled image and train the autoencoder to unscramble it - this will in theory get the same accuracy as training to match the scrambled input, showing again that pixel order is not important. You could also train autoencoder to match scrambled input to scrambled output and visualise it by reversing the scrambling effect (again this must be a consistent scramble, same for each example).

In a fully-connected neural network, there is nothing in the model that represents the local differences between pixels, or even that they are somehow related. So the network will learn relations (such as edges) irrespective of how the image is presented. But it will also suffer from being unable to generalise. E.g. just because an edge between pixels 3 and 4 is important, the network will not learn that the same edge between pixels 31 and 32 is similar, unless lots of examples of both occur in the training data.

Addressing poor generalisation due to loss of knowledge about locality in the model is one of the motivations for convolutional neural networks (CNNs). You can have CNN autoencoders, and for those, you intentionally preserve the 2D structure and local relationships between pixels - if you did not then the network would function very poorly or not at all.