2

1

I'm trying to implement the VQ-VAE model. In there, a continuous variable $x$ is encoded in an array $z$ of discrete latent variables $z_i$ that are mapped each to an embedding vector $e_i$. These vectors can be used to generate an $\hat{x}$ that approximates $x$.

In order to obtain a reasonable generative model $p_\theta(x)=\int p_\theta(x|z)p(z)$, one needs to learn the prior distribution of the code $z$. However, it is not clear in this paper, or its second version, what should be the input of the network that learns the prior. Is it $z=[z_i]$ or $e=[e_i]$? The paper seems to indicate that it is $z$, but if that's the case, I don't understand how I should encode $z$ properly. For example, a sample of $z$ might be an $n\times n$ matrix with discrete values between $0$ and $511$. It is not reasonable to me to use a one-hot encoding, nor to simply use the discrete numbers as if they were continuous, given that there is no defined order for them. On the other hand, using $e$ doesn't have this problem since it represents a matrix with continuous entries, but then the required network would be much bigger.

So, what should be the input for the prior model? $z$ or $e$? If it is $z$, how should I represent it? If it is $e$, how should I implement the network?