Variational auto-encoders (VAE): why the random sample?

3

2

Why do people train variational auto-encoders (VAE) to encode means and variances (regularised towards 0 and 1), and then sample a random Gaussian, rather that simply encode latent vectors and regularise them to follow a standard N(0,I), which would appear as a more natural choice?

Antoine Savine

Posted 2019-03-15T23:54:04.297

Reputation: 181

Answers

5

To have a common mental image of AE and VAE please take a look at this answer first.

Lets go through this "why not?" thought process step by step:

  1. Why not deterministic? lets directly encode the latent vector $z$ inside a layer of neural network. But this way, $z$ would be deterministic, meaning a fix input $x$ always produces a fix latent vector $z$, thus, $z$ would not have distribution $q(z|x)$. This is the ordinary auto-encoder. Sampling $z$ from $q(z|x)$ means if we try the same input $x$ twice, we should get two different values for $z$. Deterministic computations (layers) cannot achieve this. Therefore, we need to inject a random element in the calculation of $z$, otherwise the same $x$ always gives the same $z$. Note that getting a different $z$ for a different training point $x$ implies the existence of distribution $p(z)$, which exists for ordinary auto-encoder too, not the existence of conditional distribution $q(z|x)$,

  2. Why not only mean? OK, lets add these random elements to a layer to get our beloved probabilistic $z$, lets call this layer $\mu$. So, suppose $\mu$ is calculated from previous layer $y$ as $\mu=\mbox{tanh}(y)$, then we select the random elements $\epsilon_d \sim N(0,1)$ per dimension of $\mu$, and then calculate $z= \mu + \epsilon$ which is the same as sampling $z_d \sim N(\mu_d, 1)$, now our $z$ is probabilistic and we have got rid of standard deviation $\sigma$ and its regularization all together, we now only need to regularize mean $\mu$ to $0$,

  3. Why not only random elements? Lets go one step further, lets throw $\mu$ away too and set $z_d = \epsilon_d \sim N(0,1)$, and get rid of all the regularizations. But wait a minute! now $z$ is completely disconnected from previous layers, so no information is delivered from input $x$ to latent variable $z$.

Going from VAE with two parameters $(\mu, \sigma)$ to (2) with one parameter $\mu$ and then (3) with zero parameter is the same as saying instead of using parameter $w$ and regularizing it via $\parallel w \parallel^2$, lets set $w=0$ and get rid of the regularization. Even though we want parameters to be close to zero but still we want them not to be zero and carry information. This is the same as $(\mu, \sigma)$ in VAE, we want them to carry information, although being closer to $(0, 1)$ is favorable. $(\mu, \sigma)$ are two channels through which information is distilled from $x$ to $z$. We may even add a third variable (channel) to our distribution if complexity-performance trade-off is favorable. It is safe to say that researchers have experimented with only $\mu$ and observed that adding another channel for variance would be beneficial, so here we are with two channels $(\mu, \sigma)$ and a crucial random vector $\epsilon$.

Why would VAE work better than AE?

The main difference that VAE introduces compared to AE is that now $z$ is loosely (probabilistically) related to $x$ compared to AE, this creates an additional regularization effect that throws some information of $x$ out by not obeying the exact network computations for $z$, thus this extra regularization could improve the performance in practice. Of course, this is not a complete dominance. In some tasks this extra regularization may work against VAE.

In my opinion, the why's could be answered this far as they reach a similar level to "Full Bayesian vs regularized least squares regression".

Esmailian

Posted 2019-03-15T23:54:04.297

Reputation: 7 434

Thank you but I guess my question is really this: why do we need z to be random given x? If z is a deterministic function f of x, and f is such that q(z) = q[f(x)] is a N(0,1), then we have achieved our goal, i.e. we can sample z in N(0,1) and decode it as x=(f-1)(z). So it would seem natural to make a standard auto-encoder with the additional condition that the empirical distribution of z over the training set is as close as possible to a N(0,1). Instead, VAE make z a stochastic function of x and I guess what I don't understand is why this should work better. – Antoine Savine – 2019-03-16T12:25:36.733

Maybe it was just shown to work better in practice but I don't get the theoretical point behind it, if any? – Antoine Savine – 2019-03-16T12:27:18.773

@AntoineSavine I added updates for the "why" – Esmailian – 2019-03-16T13:03:02.563

Thank you @esmailian – Antoine Savine – 2019-03-16T15:04:07.233