To have a common mental image of AE and VAE please take a look at this answer first.

Lets go through this "why not?" thought process step by step:

**Why not deterministic?** lets directly encode the latent vector $z$ inside a layer of neural network. But this way, $z$ would be deterministic, meaning a fix input $x$ always produces a fix latent vector $z$, thus, $z$ would not have distribution $q(z|x)$. This is the ordinary auto-encoder. Sampling $z$ from $q(z|x)$ means if we try the same input $x$ twice, we should get two different values for $z$. Deterministic computations (layers) cannot achieve this. Therefore, we need to *inject* a random element in the calculation of $z$, otherwise the same $x$ always gives the same $z$. Note that getting a different $z$ for a different training point $x$ implies the existence of distribution $p(z)$, which exists for ordinary auto-encoder too, not the existence of conditional distribution $q(z|x)$,

**Why not only mean?** OK, lets add these random elements to a layer to get our beloved probabilistic $z$, lets call this layer $\mu$. So, suppose $\mu$ is calculated from previous layer $y$ as $\mu=\mbox{tanh}(y)$, then we select the random elements $\epsilon_d \sim N(0,1)$ per dimension of $\mu$, and then calculate $z= \mu + \epsilon$ which is the same as sampling $z_d \sim N(\mu_d, 1)$, now our $z$ is probabilistic and we have got rid of standard deviation $\sigma$ and its regularization all together, we now only need to regularize mean $\mu$ to $0$,

**Why not only random elements?** Lets go one step further, lets throw $\mu$ away too and set $z_d = \epsilon_d \sim N(0,1)$, and get rid of all the regularizations. But wait a minute! now $z$ is completely disconnected from previous layers, so no information is delivered from input $x$ to latent variable $z$.

Going from VAE with two parameters $(\mu, \sigma)$ to (2) with one parameter $\mu$ and then (3) with zero parameter is the same as saying instead of using parameter $w$ and regularizing it via $\parallel w \parallel^2$, lets set $w=0$ and get rid of the regularization. Even though we want parameters to be *close* to zero but still we want them not to be zero and carry information. This is the same as $(\mu, \sigma)$ in VAE, we want them to carry information, although being closer to $(0, 1)$ is favorable. $(\mu, \sigma)$ are two channels through which information is distilled from $x$ to $z$. We may even add a third variable (channel) to our distribution if complexity-performance trade-off is favorable. It is safe to say that researchers have experimented with only $\mu$ and observed that adding another channel for variance would be beneficial, so here we are with two channels $(\mu, \sigma)$ and a crucial random vector $\epsilon$.

**Why would VAE work better than AE?**

The main difference that VAE introduces compared to AE is that now $z$ is loosely (probabilistically) related to $x$ compared to AE, this creates an additional **regularization effect** that throws some information of $x$ out by not obeying the exact network computations for $z$, thus this extra regularization could improve the performance in practice. Of course, this is not a complete dominance. In some tasks this extra regularization may work against VAE.

In my opinion, the why's could be answered this far as they reach a similar level to "Full Bayesian vs regularized least squares regression".

Thank you but I guess my question is really this: why do we need z to be random given x? If z is a deterministic function f of x, and f is such that q(z) = q[f(x)] is a N(0,1), then we have achieved our goal, i.e. we can sample z in N(0,1) and decode it as x=(f-1)(z). So it would seem natural to make a standard auto-encoder with the additional condition that the empirical distribution of z over the training set is as close as possible to a N(0,1). Instead, VAE make z a stochastic function of x and I guess what I don't understand is why this should work better. – Antoine Savine – 2019-03-16T12:25:36.733

Maybe it was just shown to work better in practice but I don't get the theoretical point behind it, if any? – Antoine Savine – 2019-03-16T12:27:18.773

@AntoineSavine I added updates for the "why" – Esmailian – 2019-03-16T13:03:02.563

Thank you @esmailian – Antoine Savine – 2019-03-16T15:04:07.233