2

3

I'm having difficulty understanding when integrals are intractable in variational inference problems.

In a variational autoencoder with observation $x$ and latent variable $z$ we want to maximize data likelihood $p_\theta (x) = \prod_{i=1}^N p_\theta (x_i)$ which is

$p_\theta (x) = \int p_{\theta_1} (z) p_{\theta_2} (x|z) dz$

We know $p_{\theta_1} (z)$ (usually a Gaussian with mean $\mu$ and covariance matrix $\Sigma$), and we also know $p_{\theta_2} (x|z)$ which is usually a Gaussian distribution with mean $\mu_z$ and covariance matrix $\Sigma_z$ modeled as a neural network. We want to maximize the likelihood with respect to $\theta = \{\theta_1, \theta_2\}$. I understand that it is not possible to analytically compute this integral to optimize for $\theta$. It is possible to obtain a sample-based approximation of the integral for a stochastic gradient ascent update. Denote by $z_l$ samples from $p(z)$ and write log likelihood to optimize:

$\nabla_{\theta_1} \log p_\theta (x_i) = \frac{\int \nabla_{\theta_1} p_{\theta_1}(z) p_{\theta_2}(x_i|z)dz}{\int p_{\theta_1}(z) p_{\theta_2}(x_i|z)dz} \approx \frac{\frac{1}{L} \sum_{l=1}^L \nabla_{\theta_1} \log p_{\theta_1}(z_l)}{\frac{1}{L} \sum_{l=1}^L p_{\theta_2}(x_i|z_l)}$

where we used the following general property: $p_\theta (x) \nabla_\theta \log p_\theta (x) = \nabla_\theta p_\theta (x)$. As I understand, we don't like this approach because it requires sampling $L$ samples $z_l$ for every observation which is inefficient. I'm confused to see that this kind of sampling is still used later.

Now, we assume to have an encoder network $q_\phi (z|x)$ (again typically a Gaussian) and write the following objective:

$E_{z \sim q_\phi(z|x)} \big[ \log p_{\theta_2} (x|z) \big] - D_{KL}(q_\phi(z|x)||p_{\theta_1}(z))$.

In the paper, authors say that it is possible to calculate gradient updates for $\theta$ and they use a reparametrization trick to calculate updates for $\phi$. I tried to derive the updates for $\theta$ below. To optimize with respect to $\theta_1$ we can sample $z_l$ from $q_\phi(z|x_i)$

$\nabla_{\theta_1} \Big(\int q_\phi(z|x_i) [\log p_{\theta_1} (z) + \log p_{\theta_2} (x_i|z)] dz \Big) = \int q_\phi(z|x_i) \frac{\nabla_{\theta_1}p_{\theta_1}(z)} {p_{\theta_1}(z)} \approx \frac{1}{L} \sum_{l=1}^L \frac{\nabla_{\theta_1}p_{\theta_1}(z_l)} {p_{\theta_1}(z_l)}$

and similarly for $\theta_2$

$\nabla_{\theta_2} \Big(\int q_\phi(z|x_i) [\log p_{\theta_1} (z) + \log p_{\theta_2} (x_i|z)] dz \Big) = \int q_\phi(z|x_i) \frac{\nabla_{\theta_2}p_{\theta_2} (x_i|z)} {p_{\theta_2} (x_i|z)} \approx \frac{1}{L} \sum_{l=1}^L \frac{\nabla_{\theta_2}p_{\theta_2} (x_i|z_l)} {p_{\theta_2} (x_i|z_l)}$

I don't understand why the last two updates are ok but the original update is not. I'd appreciate it if you'd point out my mistakes/misinterpretation.

I did read the paper a little lousily but when I compared it with what you are trying to achieve in the question I thought you are going the wrong path. – grochmal – 2019-06-13T23:38:16.627