Yes. Two changes are required to convert an AE to VAE, which shed light on their differences too. Note that if an already-trained AE is converted to VAE, it requires re-training, because of the following changes in the structure and loss function.

**Network of AE** can be represented as $$x \overbrace{\rightarrow .. \rightarrow y \overset{f}{\rightarrow}}^{\mbox{encoder}} z \overbrace{\rightarrow .. \rightarrow}^{\mbox{decoder}}\hat{x},$$

where

- $x$ denotes the input (vector, matrix, etc.) to the network, $\hat{x}$ denotes the output (reconstruction of $x$),
- $z$ denotes the latent output that is calculated from its previous layer $y$ as $z=f(y)$.
- And $f$, $g$, and $h$ denote non-linear functions such as $f(y) = \mbox{sigmoid}(Wy+B)$, $\mbox{ReLU}$, $\mbox{tanh}$, etc.

These two changes are:

**Structure**: we need to add a layer between $y$ and $z$. This new layer represents mean $\mu=g(y)$ and standard deviation $\sigma=h(y)$ of Gaussian distributions. Both $\mu$ and $\sigma$ must have the same dimension as $z$. Every dimension $d$ of these vectors corresponds to a Gaussian distribution $N(\mu_d, \sigma_d^2)$, from which $z_d$ is sampled. That is, for each input $x$ to the network, we take the corresponding $\mu$ and $\sigma$, then pick a random $\epsilon_d$ from $N(0, 1)$ for every dimension $d$, and finally compute $z=\mu+\sigma \odot \epsilon$, where $\odot$ is element-wise product. As a comparison, $z$ in AE was computed deterministically as $z=f(y)$, now it is computed probabilistically as $z=g(y)+h(y)\odot \epsilon$, i.e. $z$ would be different if $x$ is tried again. The rest of network remains unchanged. **Network of VAE** can be represented as $$x \overbrace{\rightarrow .. \rightarrow y \overset{g,h}{\rightarrow}(\mu, \sigma) \overset{\mu+\sigma\odot \epsilon}{\rightarrow} }^{\mbox{encoder}} z \overbrace{\rightarrow .. \rightarrow}^{\mbox{decoder}}\hat{x},$$

**Objective function**: we want to enforce our assumption (prior) that the distribution of factor $z_d$ is centered around $0$ and has a constant variance (this assumption is equivalent to parameter regularization). To this end, we add a penalty per dimension $d$ that punishes any deviation of latent distribution $q(z_d|x) = N(\mu_d, \sigma_d^2)$$= N(g_d(y), h_d(y)^2)$ from unit Gaussian $p(z_d)=N(0, 1)$. In practice, KL-divergence is used for this penalty. At the end, the **loss function** of VAE becomes: $$L_{VAE}(x,\hat{x},\mu,\sigma) = L_{AE}(x, \hat{x}) + \overbrace{\frac{1}{2} \sum_{d=1}^{D}(\mu_d^2 + \sigma_d^2 - 2\mbox{log}\sigma_d - 1)}^{KL(q \parallel p)}$$
where $D$ is the dimension of $z$.

**Side notes**

- In practice, since $\sigma_d$ can get very close to $0$, $\mbox{log}\sigma_d$ in objective function can explode to large values, so we let the network generate $\sigma'_d = \mbox{log}\sigma_d = h_d(y)$ instead, and then use $\sigma_d = exp(h_d(y))$. This way, both $\sigma_d=exp(h_d(y))$ and $\mbox{log}\sigma_d=h_d(y)$ would be numerically stable.
- The name "variational" comes from the fact that we assumed (1) each latent factor $z_d$ is independent of other factors, i.e. we ignore other $(\mu_{d'}, \sigma_{d'})_{d' \neq d}$ when we sample $z_d$, and (2) $z_d$ follows a Gaussian distribution. In other words, $q(z|x)$ is a
*simplified variation* to the true (and probably a more complex) distribution $p(z|x)$.

1Thank you for this explaination! – Kahina – 2019-03-12T09:12:30.060

Can we use any h(y) and g(y) for this? Also you convert it to normal by multiplying with an epsilon term which is normal...How does that work? – DuttaA – 2019-03-12T20:02:24.617

1@DuttaA yes the same as other layers, no constraints. Of course, in practice, one type of non-linear function may perform better than the others. Yes, u + s x N(0, 1) gives N(u, s^2), no trick. – Esmailian – 2019-03-12T20:07:09.703

1So if I understand correctly \mu_d and \sigma_d are just our assumption that the underlying variables is normal, which we enforce by multiplying with N(0,1) and finally to make it constant in a dimension we use the KL loss. – DuttaA – 2019-03-14T19:26:25.953

Why do we want the underlying distribution centred around 0? I mean if we use g(y) as sigmoid it is impossible to make it centred around 0? – DuttaA – 2019-03-14T19:38:03.303

@DuttaA No problem, N(0, 1) can produce large negative values to make the total negative. but, again, tanh(.) may perform better in practice. – Esmailian – 2019-03-14T19:44:58.803