The purpose of having a prior distribution $p(z)$ in any generative adversarial network is to be able to smoothly match a latent code $z$ in a known distribution to an input $x$ in the domain and vice versa. The encoder of a simple autoencoder, without any additional measures other than the in typical pipeline
$$x \rightarrow E \rightarrow z \rightarrow D \rightarrow x'$$
would only require $x$ to approach $x' = D(E(x))$, and for that purpose the decoder may simply learn to reconstruct $x$ regardless of the distribution obtained from $E$. This means that $p(z)$ can be very irregular, making generation of new samples less feasible. Even with slight changes to the bottleneck vector, we cannot be sure that the encoder would ever be able to produce that code with any $x$.
In an adversarial autoencoder (AAE) however, the encoder's job is two-fold: it encodes inputs in $p(x)$ to the respective code in $q(z)$ so that:
- it minimizes the reconstruction cost $f(x, D(E(x)))$ (where $f$ is a distance metric between samples, such as the mean squared error);
- while it learns to adapt $q(z)$ to the prior distribution $p(x)$.
The latter task is effectively enforced because the discriminator receives:
- positive feedback from codes in $p(z)$;
- and negative feedback from codes in $q(z)$.
Even if the discriminator might not know anything about any of the two distributions at the beginning, it is only a matter of making enough iterations before it does. The ideal encoder will manage to trick the discriminator to the point of having approximately 50% accuracy in the discrimination process.
Also note that $p(x)$ may not be just a Gaussian or uniform distribution (as in, some sort of noise).
Quoting from Goodfellow's Deep Learning book (chapter 20):
When developing generative models, we often wish to extend neural networks to implement stochastic transformations of $x$. One straightforward way to do this is to augment the neural network with extra inputs $z$ that are sampled from some simple probability distribution, such as a uniform or Gaussian distribution. The neural network can then continue to perform deterministic computation internally, but the function $f(x, z)$ will appear stochastic to an observer who does not have access to $z$.
Although denoising autoencoders rely on this aspect to learn a model that ignores noise from a sample, the same paper on AAEs (section 2.3) shows that combining noise with a one-hot encoded vector of classes can be used to incorporate label information about the sample. This information is only provided to the discriminator, but it still influences how the encoder produces $q(z)$.