## Transform an Autoencoder to a Variational Autoencoder?

8

5

I would like to compare the training by an Autoencoder and a variational autoencoder. I have already run the traing using AE. I would like to know if it's possible to transform this AE into a VAE and maintain the same outputs and inputs.

Thank you.

8

Yes. Two changes are required to convert an AE to VAE, which shed light on their differences too. Note that if an already-trained AE is converted to VAE, it requires re-training, because of the following changes in the structure and loss function.

Network of AE can be represented as $$x \overbrace{\rightarrow .. \rightarrow y \overset{f}{\rightarrow}}^{\mbox{encoder}} z \overbrace{\rightarrow .. \rightarrow}^{\mbox{decoder}}\hat{x},$$

where

1. $$x$$ denotes the input (vector, matrix, etc.) to the network, $$\hat{x}$$ denotes the output (reconstruction of $$x$$),
2. $$z$$ denotes the latent output that is calculated from its previous layer $$y$$ as $$z=f(y)$$.
3. And $$f$$, $$g$$, and $$h$$ denote non-linear functions such as $$f(y) = \mbox{sigmoid}(Wy+B)$$, $$\mbox{ReLU}$$, $$\mbox{tanh}$$, etc.

These two changes are:

1. Structure: we need to add a layer between $$y$$ and $$z$$. This new layer represents mean $$\mu=g(y)$$ and standard deviation $$\sigma=h(y)$$ of Gaussian distributions. Both $$\mu$$ and $$\sigma$$ must have the same dimension as $$z$$. Every dimension $$d$$ of these vectors corresponds to a Gaussian distribution $$N(\mu_d, \sigma_d^2)$$, from which $$z_d$$ is sampled. That is, for each input $$x$$ to the network, we take the corresponding $$\mu$$ and $$\sigma$$, then pick a random $$\epsilon_d$$ from $$N(0, 1)$$ for every dimension $$d$$, and finally compute $$z=\mu+\sigma \odot \epsilon$$, where $$\odot$$ is element-wise product. As a comparison, $$z$$ in AE was computed deterministically as $$z=f(y)$$, now it is computed probabilistically as $$z=g(y)+h(y)\odot \epsilon$$, i.e. $$z$$ would be different if $$x$$ is tried again. The rest of network remains unchanged. Network of VAE can be represented as $$x \overbrace{\rightarrow .. \rightarrow y \overset{g,h}{\rightarrow}(\mu, \sigma) \overset{\mu+\sigma\odot \epsilon}{\rightarrow} }^{\mbox{encoder}} z \overbrace{\rightarrow .. \rightarrow}^{\mbox{decoder}}\hat{x},$$

2. Objective function: we want to enforce our assumption (prior) that the distribution of factor $$z_d$$ is centered around $$0$$ and has a constant variance (this assumption is equivalent to parameter regularization). To this end, we add a penalty per dimension $$d$$ that punishes any deviation of latent distribution $$q(z_d|x) = N(\mu_d, \sigma_d^2)= N(g_d(y), h_d(y)^2)$$ from unit Gaussian $$p(z_d)=N(0, 1)$$. In practice, KL-divergence is used for this penalty. At the end, the loss function of VAE becomes: $$L_{VAE}(x,\hat{x},\mu,\sigma) = L_{AE}(x, \hat{x}) + \overbrace{\frac{1}{2} \sum_{d=1}^{D}(\mu_d^2 + \sigma_d^2 - 2\mbox{log}\sigma_d - 1)}^{KL(q \parallel p)}$$ where $$D$$ is the dimension of $$z$$.

Side notes

1. In practice, since $$\sigma_d$$ can get very close to $$0$$, $$\mbox{log}\sigma_d$$ in objective function can explode to large values, so we let the network generate $$\sigma'_d = \mbox{log}\sigma_d = h_d(y)$$ instead, and then use $$\sigma_d = exp(h_d(y))$$. This way, both $$\sigma_d=exp(h_d(y))$$ and $$\mbox{log}\sigma_d=h_d(y)$$ would be numerically stable.
2. The name "variational" comes from the fact that we assumed (1) each latent factor $$z_d$$ is independent of other factors, i.e. we ignore other $$(\mu_{d'}, \sigma_{d'})_{d' \neq d}$$ when we sample $$z_d$$, and (2) $$z_d$$ follows a Gaussian distribution. In other words, $$q(z|x)$$ is a simplified variation to the true (and probably a more complex) distribution $$p(z|x)$$.

1Thank you for this explaination! – Kahina – 2019-03-12T09:12:30.060

Can we use any h(y) and g(y) for this? Also you convert it to normal by multiplying with an epsilon term which is normal...How does that work? – DuttaA – 2019-03-12T20:02:24.617

1@DuttaA yes the same as other layers, no constraints. Of course, in practice, one type of non-linear function may perform better than the others. Yes, u + s x N(0, 1) gives N(u, s^2), no trick. – Esmailian – 2019-03-12T20:07:09.703

1So if I understand correctly \mu_d and \sigma_d are just our assumption that the underlying variables is normal, which we enforce by multiplying with N(0,1) and finally to make it constant in a dimension we use the KL loss. – DuttaA – 2019-03-14T19:26:25.953

Why do we want the underlying distribution centred around 0? I mean if we use g(y) as sigmoid it is impossible to make it centred around 0? – DuttaA – 2019-03-14T19:38:03.303

@DuttaA No problem, N(0, 1) can produce large negative values to make the total negative. but, again, tanh(.) may perform better in practice. – Esmailian – 2019-03-14T19:44:58.803

4

Yes, it is possible. You need to:

1. Convert the bottleneck into a stochastic bottleneck. In VAE's the bottleneck are not the values deterministically generated by the encoder. Instead, the encoder generates the parameters defining some random variables. These random vars normally follow independent Gaussian distributions, and the encoder generates a vector with the means and the standard deviations of the Gaussians. In order to connect this output to the decoder, you first generate noise from a standard Normal distribution $$\mathcal{N}(0, 1)$$, multiply by the standard deviation $$\sigma$$ and add the mean $$\mu$$; this is called the reparameterization trick. The results are then passed to the decoder, which remains the same as before.

2. Add a KL divergence term to the reconstruction loss. With a VAE, you impose a prior distribution to your latent variables. This prior is normally a standard Normal distribution $$\mathcal{N}(0, 1)$$. You need to add an extra term to the loss to make the stochastic bottleneck similar to the prior, normally with the KL divergence between both. The derivation of the KL divergence between two Gaussians can be found here. Taking into account that the prior has $$\sigma=1$$ and $$\mu=0$$, the expression is $$\sum_i \sigma_i^2 \mu_i^2 - \log(\sigma_i) - 1$$ .

Note: Given the logarithm of $$\sigma$$ in the KL divergence, it may be a good idea to have the stochastic bottleneck to generate the log standard deviation rather the deviation itself, in order to improve numerical stability.

Thanks for you answer – Kahina – 2019-03-12T09:12:55.623