In a variational antoencoder (VAE), the output of the encoder, $\mu$ and $\sigma$, is a deterministic function of input data $x$ (you put $x$ into your encoder neural network, and it generates $\mu$ and $\sigma$, there is nothing random here).

Then the hidden representaition $z$ (posterior) is sampled from the Guassian distribution parameterized by $\mu$ and $\sigma$. So notice that we don't talk about what 's the distribution of $\mu$ and $\sigma$ (since they are deterministic based on $x$, not random variables), but we do talk about the distribution of $z$.

### "why do we want to feed forward latent matrix with Gaussian distribution?"

To be specific, we feed a Gaussian distributed variable $z$ into the decoder network. This is the basic model assumption of VAE regarding how samples are generated. Notice that one of the main difference between VAE and standard autoencoder is that VAE is a *generative* model, in the sence that you can randomly generate samples, instead of using a input $x$ to reconstruct something similar to $x$.

If we imagine a VAE trained on the MNIST hand written digits dataset, we want to randomly sample $z$~$p(z)=\mathcal{N}(0,1)$, and feed this $z$ into the decoder network which give us the output distribution (which is again a Gaussian), and sampling from that output distribution results in an image of a digit.

### "Why compare a distribution of Mu and Sigma matrices? "

We do not compare the distribution of $\mu$ and $\sigma$. As explained above, we are talking about the Gaussian distributed hidden variable $z$ (posterior $q_{\theta}(z|x)$) parameterized by $\theta=(\mu,\sigma)$. We compare the posterior distribution of $z$ with its prior ($p(z)=\mathcal{N}(0,1)$) and the difference (measured by their KL-divergence) is used as a regularization term in the loss function.

The reason for that is we want the latent representation $z$ to be close to its prior (standard Gaussian). This is relevant to how we generate samples from the VAE (we first sample from the standard Gaussian prior). If we encode a $x$ to somewhere very far from this prior, it will be unlikely that we can generate such samples.

For example, a well trained posterior (first two dimensions) for MNIST looks like this (photo from here)

Notice that the posterior for each digit (e.g. dark blue for digit "0") is distributed like a Gaussian and not far from $\mathcal{N}(0,1)$, but difference digits do occupy different area.

Many online tutorials (e.g. here or here) have more detailed explanation about the concepts in VAE. I would suggest to read them.

Thank you for your time and your reply. I keep studying and your answer offers me good example.

Still I don't get one thing. This confuses me a lot:

in many tutorials you see this:

-----kl_loss = - 0.5 * mean(1 + z_log_sigma - square(z_mean) - exp(z_log_sigma), axis=-1)--- isn't this KL-divergence between mu and sigma matrices? Or I need to take a course on math as well. Thanks – Stenga – 2018-09-25T10:27:00.790

It's not. The equation you mentioned is calculating the KL divergence between (the Gaussian parameterized by z_man and z_sigma) and the standard Gaussian. You can find the equation here https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence#Multivariate_normal_distributions

– user12075 – 2018-09-25T16:41:11.490