## Intractability in Variational Autoencoders

2

3

I'm having difficulty understanding when integrals are intractable in variational inference problems.

In a variational autoencoder with observation $$x$$ and latent variable $$z$$ we want to maximize data likelihood $$p_\theta (x) = \prod_{i=1}^N p_\theta (x_i)$$ which is

$$p_\theta (x) = \int p_{\theta_1} (z) p_{\theta_2} (x|z) dz$$

We know $$p_{\theta_1} (z)$$ (usually a Gaussian with mean $$\mu$$ and covariance matrix $$\Sigma$$), and we also know $$p_{\theta_2} (x|z)$$ which is usually a Gaussian distribution with mean $$\mu_z$$ and covariance matrix $$\Sigma_z$$ modeled as a neural network. We want to maximize the likelihood with respect to $$\theta = \{\theta_1, \theta_2\}$$. I understand that it is not possible to analytically compute this integral to optimize for $$\theta$$. It is possible to obtain a sample-based approximation of the integral for a stochastic gradient ascent update. Denote by $$z_l$$ samples from $$p(z)$$ and write log likelihood to optimize:

$$\nabla_{\theta_1} \log p_\theta (x_i) = \frac{\int \nabla_{\theta_1} p_{\theta_1}(z) p_{\theta_2}(x_i|z)dz}{\int p_{\theta_1}(z) p_{\theta_2}(x_i|z)dz} \approx \frac{\frac{1}{L} \sum_{l=1}^L \nabla_{\theta_1} \log p_{\theta_1}(z_l)}{\frac{1}{L} \sum_{l=1}^L p_{\theta_2}(x_i|z_l)}$$

where we used the following general property: $$p_\theta (x) \nabla_\theta \log p_\theta (x) = \nabla_\theta p_\theta (x)$$. As I understand, we don't like this approach because it requires sampling $$L$$ samples $$z_l$$ for every observation which is inefficient. I'm confused to see that this kind of sampling is still used later.

Now, we assume to have an encoder network $$q_\phi (z|x)$$ (again typically a Gaussian) and write the following objective:

$$E_{z \sim q_\phi(z|x)} \big[ \log p_{\theta_2} (x|z) \big] - D_{KL}(q_\phi(z|x)||p_{\theta_1}(z))$$.

In the paper, authors say that it is possible to calculate gradient updates for $$\theta$$ and they use a reparametrization trick to calculate updates for $$\phi$$. I tried to derive the updates for $$\theta$$ below. To optimize with respect to $$\theta_1$$ we can sample $$z_l$$ from $$q_\phi(z|x_i)$$

$$\nabla_{\theta_1} \Big(\int q_\phi(z|x_i) [\log p_{\theta_1} (z) + \log p_{\theta_2} (x_i|z)] dz \Big) = \int q_\phi(z|x_i) \frac{\nabla_{\theta_1}p_{\theta_1}(z)} {p_{\theta_1}(z)} \approx \frac{1}{L} \sum_{l=1}^L \frac{\nabla_{\theta_1}p_{\theta_1}(z_l)} {p_{\theta_1}(z_l)}$$

and similarly for $$\theta_2$$

$$\nabla_{\theta_2} \Big(\int q_\phi(z|x_i) [\log p_{\theta_1} (z) + \log p_{\theta_2} (x_i|z)] dz \Big) = \int q_\phi(z|x_i) \frac{\nabla_{\theta_2}p_{\theta_2} (x_i|z)} {p_{\theta_2} (x_i|z)} \approx \frac{1}{L} \sum_{l=1}^L \frac{\nabla_{\theta_2}p_{\theta_2} (x_i|z_l)} {p_{\theta_2} (x_i|z_l)}$$

I don't understand why the last two updates are ok but the original update is not. I'd appreciate it if you'd point out my mistakes/misinterpretation.

0

I believe that you got bogged down by this thought:

I understand that it is not possible to analytically compute this integral to optimize for $$\theta$$.

From the paper I understood that:

$$p_\theta (x) = \int p_{\theta} (z) p_{\theta} (x|z) dz$$

Can be analytically solved if we simplify $$p_{\theta} (x|z)$$. i.e. in the case where the Bayes' Rule for its posterior is easy. This integral is instead intractable if the Bayes Rule for $$p_{\theta} (x|z)$$ turn out to be a very complex function, e.g. $$(z|x)$$ is very complex. In general integrating a simple distribution is easy, conditional distributions can be (very) complex to integrate.

The reparametization trick on the other hand is what makes the paper's point. They do not try to compute the the integrals as you derived them but they attempt to substitute all $$(z|x)$$ with a $$g_{\phi}$$ function (parametrized by $$\phi$$) which will build $$z$$ from $$x$$ and $$\epsilon$$.

The idea is that we choose $$p(\epsilon)$$ and instead of estimating $$q_{\theta}(z|x)$$ we estimate $$\phi$$ (in $$g_{\phi}(z_{parametrized}, x$$). Estimating (SGD) a value for a parametrization of a vector function is not difficult.

I understood (analogically and very lousily) the reparametrization they do in not a very different way than what we do with, say, integrals of sphere or cylinder-like surfaces: we reparametrize the variables and recalculate the interval. Here we want the interval to not have the two sums of the KL (including both distributions) but one sum.

I did read the paper a little lousily but when I compared it with what you are trying to achieve in the question I thought you are going the wrong path. – grochmal – 2019-06-13T23:38:16.627