What mu and sigma vector really mean in VAE?


In standard autoencoder, we encode data to bottleneck, then decode with using initial input as output to compute loss. We do activate matrix multiplication all over the network and if we are good, initial input should be close to output. I completely understand this.

But within a Vae, I have few problems, especially in understanding latent space vector. I think I know how to create it, but I am not really sure what is its purpose. Here I will write a few ideas, please let me know if I am correct.

  1. We first create mu and sigma matrices, which are just matrix multiplication of previously hidden layer and random weights.

  2. To create Z (latent matrix), we use parameterization trick. mu+log(0.5*sigma)*epsilon, which is a random matrix with 0 mean and 1 std. I have seen this Z (latent matrix) always produces Gaussian distribution, no matter what is the distribution of mu and sigma vectors. And here is my question. Why, why do we want to feed forward latent matrix with Gaussian distribution?

  3. When we come to the end of decoding. We compute loss function and we penalize the network with KL divergence of Mu and Sigma Matrix. I don't understand why is it important in the first place to compare a distribution of Mu and Sigma matrices? I assume, that when we do back-propagation, mu and sigma become more close in term of distribution, but my question is why is this important, why those two matrixes must be close to each other in term of distribution?

I would really appreciate very simplistic answers with simple examples if it is possible.


Posted 2018-09-24T15:21:47.983

Reputation: 165



In a variational antoencoder (VAE), the output of the encoder, $\mu$ and $\sigma$, is a deterministic function of input data $x$ (you put $x$ into your encoder neural network, and it generates $\mu$ and $\sigma$, there is nothing random here).

Then the hidden representaition $z$ (posterior) is sampled from the Guassian distribution parameterized by $\mu$ and $\sigma$. So notice that we don't talk about what 's the distribution of $\mu$ and $\sigma$ (since they are deterministic based on $x$, not random variables), but we do talk about the distribution of $z$.

"why do we want to feed forward latent matrix with Gaussian distribution?"

To be specific, we feed a Gaussian distributed variable $z$ into the decoder network. This is the basic model assumption of VAE regarding how samples are generated. Notice that one of the main difference between VAE and standard autoencoder is that VAE is a generative model, in the sence that you can randomly generate samples, instead of using a input $x$ to reconstruct something similar to $x$.

If we imagine a VAE trained on the MNIST hand written digits dataset, we want to randomly sample $z$~$p(z)=\mathcal{N}(0,1)$, and feed this $z$ into the decoder network which give us the output distribution (which is again a Gaussian), and sampling from that output distribution results in an image of a digit.

"Why compare a distribution of Mu and Sigma matrices? "

We do not compare the distribution of $\mu$ and $\sigma$. As explained above, we are talking about the Gaussian distributed hidden variable $z$ (posterior $q_{\theta}(z|x)$) parameterized by $\theta=(\mu,\sigma)$. We compare the posterior distribution of $z$ with its prior ($p(z)=\mathcal{N}(0,1)$) and the difference (measured by their KL-divergence) is used as a regularization term in the loss function.

The reason for that is we want the latent representation $z$ to be close to its prior (standard Gaussian). This is relevant to how we generate samples from the VAE (we first sample from the standard Gaussian prior). If we encode a $x$ to somewhere very far from this prior, it will be unlikely that we can generate such samples.

For example, a well trained posterior (first two dimensions) for MNIST looks like this (photo from here)

posterior on MNIST

Notice that the posterior for each digit (e.g. dark blue for digit "0") is distributed like a Gaussian and not far from $\mathcal{N}(0,1)$, but difference digits do occupy different area.

Many online tutorials (e.g. here or here) have more detailed explanation about the concepts in VAE. I would suggest to read them.


Posted 2018-09-24T15:21:47.983

Reputation: 1 949

Thank you for your time and your reply. I keep studying and your answer offers me good example.

Still I don't get one thing. This confuses me a lot:

in many tutorials you see this:

-----kl_loss = - 0.5 * mean(1 + z_log_sigma - square(z_mean) - exp(z_log_sigma), axis=-1)--- isn't this KL-divergence between mu and sigma matrices? Or I need to take a course on math as well. Thanks – Stenga – 2018-09-25T10:27:00.790

It's not. The equation you mentioned is calculating the KL divergence between (the Gaussian parameterized by z_man and z_sigma) and the standard Gaussian. You can find the equation here https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence#Multivariate_normal_distributions

– user12075 – 2018-09-25T16:41:11.490