I have several thousand text documents and I am currently working on obtaining the latent feature representations of words and generate sentences using variational auto encoder. The main obstacle I am facing is “how to input such large textual vectors into VAE (or say even in a simple auto encoder)". I know there are other techniques such as adding layers in the form of LSTMs, but I want to keep it simple and keep the model as just VAE. One way is to use one-hot-encoded vectors or bag of words, but again, this is not the most efficient way since for a vocabulary of 100K unique words, each document will have a 100K input vector. Additionally, we loose the sentence structure. For small datasets however, there is no problem in training an autoencoder using this type of input. Another method is to use word embeddings using a pre-trained Word2Vec. This is what I have been trying to do and the python notebook which can be DOWNLOADED HERE uses this technique. The code is too long and has multiple pre-processing steps, so I am unable to embed this code in my post. The following are my questions:
Now, each sentence (or document) will have different number of words. So the number of word embeddings for each document will have different lengths. Unfortunately, keras requires all the inputs to be of same length ( if I am right). So, how to handle such cases of varying input lengths?. Currently, in the fifth block of the python notebook, you can see the statement
data = [x for x in vect if len(x) == 10]. That is, I only consider documents that have exactly 10 words to overcome this problem. Ofcourse, this is not practical. Can we pad 0 vectors?
The VAE example shown in Keras blog uses MNIST data as exmple. Therefore, they use a sigmoid activation in the final reconstruction layer; consequently, “binary_crossentropy” as the loss function (along with the KL divergence). Since my inputs are word embeddings, where there are even negative values in the embeded vectors, I believe I should not use the activation as “sigmoid” in the final reconstruction layer. Is that right?. Additionally, I have also changed the loss as “mean_squared_error” instead of “binary_crossentropy” in the attached code.
It will be great if someone who has worked on VAE and autoencoders for text data can provide their inputs regarding the questions mentioned above.
Note: the attached code is a simplified version of the code in THIS LINK