- For the actual loss function of a VAE, we use $−\mathcal{L}$, more or less. Of course, it's expensive to actually calculate the expectation, which is why we use a single sample each time, yes?

Yes. It turns out a single sample MC estimate has fairly low variance in this case. However, the *Importance Weighted Autoencoder* shows that taking multiple samples can be useful.

- It's usually explained that we treat () as being $\mathcal{N}(0,1)$. But simply plugging in $\mathcal{N}(0,1)$ to the KL divergence in the loss (as we do) just ensures that $_(|)$ gets close to it. We know nothing about $()$, right (besides the fact that if all $_(|)$ are close, then so is $_()$?

I prefer to think of it this way. There are in fact *two* models: a stochastic encoder (inference model) $q_\phi(x,z)=q_\phi(z|x) q(x)$ and a probabilistic decoder (generative model) $p_\theta(x,z)=p_\theta(x|z) p(z)$. Our goal is to enforce that $ q_\phi(x,z) $ and $ p_\theta(x,z) $ are close (small KL divergence).

Note that $q(x)$ and $p(z)$ are the empirical distribution and Bayesian prior, and are fixed ahead of time (not learned). This is for the vanilla VAE. However, these are very different from the aggregate marginals:
$$
q_\phi(z) = \int q_\phi(z|x) q(x) \, dx
\;\;\;\;\&\;\;\;\;
p_\theta(x) = \int p_\theta(x|z) p(z) \, dz
$$
In other words, $p(z)$ is something we *choose* as part of the model. Our job is then to ensure that the inference marginal $q_\phi(z)$ is close to it. In other words, we *know* $p(z)$, but we learn $q_\phi(z)$ (also called the *inferred prior* or *aggregate posterior*), which we would like to match it.

- How should we think about that KL divergence in the second formula? It's usually pointed out that for fixed $_$, maximizing $\mathcal{L}$ is equivalent to minimizing the KL div. But $_$ is not fixed; it's being learned. We could also improve $\log _()$ (our ultimate goal) by making that divergence worse, couldn't we? What justifies optimizing $\mathcal{L}$ alone (other than tractability)?

It's important to note that while $p_\theta(x)$ is being learned (well, an approximation is being learned), there is actually an "correct" one that can be computed via Bayes Rule:
$$ p_\theta(z|x) = \frac{p_\theta(x|z) p(z) }{p_\theta(x)} $$
when $ p_\theta(x|z) $ is fixed.
In other words, we don't *have to* learn an inference model; there is an optimal one given by Bayes rule. In theory we can just learn the generative model and use this rule to create the encoder.
But this is of course intractable to compute.

How should we think about that KL divergence in the second formula?

We are doing *variational inference*. We cannot compute the true posterior $p_\theta(z|x)$. So instead we compute an approximation $q_\phi(z|x)$. The second KL says the approximation should be good. Notice something important: *when the variational posterior approximation is perfect, then the ELBO equals the marginal likelihood*. In other words, when our $q_\phi$ is perfect, then we are indeed optimizing the true marginal likelihood, as you noted we want to do.

We could also improve $\log _()$ (our ultimate goal) by making that divergence worse, couldn't we?

No, decreasing it should always make the marginal likelihood worse. I don't have a proof, but I conjecture it is so.

What justifies optimizing $\mathcal{L}$ alone (other than tractability)?

Well, it is a *lower bound*, so maximizing it guarantees we are "pushing up" the true marginal likelihood. As noted above, when the variational approximation is good, we are guaranteed to be doing the right thing.
There is a lot of literature on variational inference theory; I suspect one can say more under more constrained conditions.

Basically, I'm confused by the common explanation that what we really want is to minimize that second KL divergence, and also that the best way to do that is to maximize the ELBO.

I don't know if it is the *best* way. It is certainly an efficient one. But in some cases, it may be better to perform inference more directly, e.g. Hamiltonian Monte Carlo methods.

1Thanks. I am going to have to refresh my understanding to make sense of all this. One thing that jumps out is that I may have been confusing $p_{\theta}$ and $p$ (and similarly for $q$). I take it they're different? Also, in what sense are $p$ and $q$ empirical? We don't have any empirical information about latent variables in the real world, do we? – A_P – 2019-10-03T19:27:58.963

Ah, "empirical" was only about q(x) (the distribution of samples, which we have). Another question: when you say "In theory we can just learn the generative model," how would we do that? – A_P – 2020-02-05T18:28:34.560

@A_P $p_\theta$ and $p$ only denote whether or not there are any parameters to the distribution. $p(z)$ is fixed and parameterless, while the other $p$ distributions are not (and hence have a $\theta$). When I wrote $q(x)$ is the empirical distribution, it is usually defined to mean the distribution of the data (over data space) - we can trivially sample from this via drawing from the uniform categorical distribution over the training set. – user3658307 – 2020-02-24T23:38:30.033

@A_P Regarding empirical info about latent variables, usually no (by definition, latent variables model

unobservedvariables). However, sometimes, we might be. This is common in semi-supervised learning. For instance, in computer vision, I might have a set of images, and know the lighting variables that generated some of the images. The lighting would generally be considered a latent variable, but we may know it (or estimate it) by some other means. – user3658307 – 2020-02-24T23:41:31.570@A_P I was saying just learn $p_\theta(x,z) = p_\theta(x|z) p(z)$ (the generative model)

withoutlearning $q$ (the inference model). Then use Bayes Rule as the inference model instead (e.g., some MCMC-based models do this). Then no $q$ was ever needed! But, it turns out that using Bayes rule is computationally intractable in many cases. Hence why VAEs do NOT do this, and instead learn a variational approximation (i.e., $q$). Notice that learning $p$ alone without $q$ is quite common: this is what most GANs do (i.e., they have no encoder). – user3658307 – 2020-02-24T23:45:58.8101Thanks again! Let me try to summarize. The original problem is to find a $p(x, z)$ where $p(z)$ is $\mathcal{N}(0,1)$ so that we can draw from $p(x|z)$, subject to maximizing $\prod p(x)$ over our dataset. Then for some reason we find ourselves trying to solve for $p(z|x)$, but that's intractable, so we use V.I. I guess the "some reason" is that we notice that

ifwe knew $p(z|x)$ (and the corresponding $q(z|x)$), we could use the formulas in my question to provide a possibly good estimate $p(x)$? – A_P – 2020-02-27T21:28:33.1532@A_P yeah that seems right. Only thing I'd clarify is that

ifwe knew $p(z|x)$ exactly there would be no need for $q(z|x)$ (the variational distribution) because the latter is merely a (usually poor) approximation for the former. We could then compute $p(x)$ directly. In this case, the ELBO is literally the log likelihood $\log p(x)$ because the KL term disappears (if I recall correctly). PS: are you familiar with normalizing flow models? Like GLOW? Those might help you understand. :) – user3658307 – 2020-02-28T21:17:02.170I had not heard of normalizing flows. Thanks, reading about them now! – A_P – 2020-03-04T02:33:41.707