I am trying to reduce the dimensionality of topic vectors (300, 1) to a two dimensional space. This has been done with various methods (e.g. t-SNE and autoencoders). A published example of reducing topic vectors to 2-dim space can be found here.
To train a test autoencoder, I took the top 10k words from Google's word vector model and tried to encode/decode them. My autoencoder (300, 100, 2) with tanh activation functions (due to the nature of the vector elements of the word2vec model) seems to learn the vectors quickly (the loss function quickly goes negative).
However, when I compare the closest words based on the 2-dim reduction the results from the autoencoder (cosine and euclidean distance of the word presentations) I notice that they don't match the similar words provided from the original w2v model (output of
Most similar words based on the autoencoded representation (2-dim, calculated via cosine distance). Results look random and it seems that the original distances didn't get preserved.
>>> extract_from_embed('car') # extract method for my own embedding [(football', 0.91979183998553593), (u'\xa9', 0.9515135906035519), (u'Thank', 0.96150527440098321), (u'innings', 0.96893565858480013), (u'thank', 0.9699300787802696), (u'balls', 0.97004978663463826), (u'Admission', 0.97050270191042776), (u'thanked', 0.97098601610349322), (u'Announces', 0.97186348679591361), (u'drills', 0.97214077185079129)]
Word2Vec most similar words based on 300, 1 vectors
>>> word_vectors.most_similar(positive=['car']) # std gensim method [(u'vehicle', 0.7821096777915955), (u'cars', 0.7423831224441528), (u'SUV', 0.7160963416099548), (u'truck', 0.6735789775848389), (u'Car', 0.667760968208313), ... (u'automobile', 0.5838367938995361)]
Is there a way to "preserve" the relative distances between vectors when I used an autoencoder? I understand that the absolute distances differ between vectors from a 2 and 300 dim space. But can I preserve the relative distances between the vectors? Is this even possible?