Retain similarity distances when using an autoencoder for dimensionality reduction



I am trying to reduce the dimensionality of topic vectors (300, 1) to a two dimensional space. This has been done with various methods (e.g. t-SNE and autoencoders). A published example of reducing topic vectors to 2-dim space can be found here.

To train a test autoencoder, I took the top 10k words from Google's word vector model and tried to encode/decode them. My autoencoder (300, 100, 2) with tanh activation functions (due to the nature of the vector elements of the word2vec model) seems to learn the vectors quickly (the loss function quickly goes negative).

However, when I compare the closest words based on the 2-dim reduction the results from the autoencoder (cosine and euclidean distance of the word presentations) I notice that they don't match the similar words provided from the original w2v model (output of most_similar).

Most similar words based on the autoencoded representation (2-dim, calculated via cosine distance). Results look random and it seems that the original distances didn't get preserved.

>>> extract_from_embed('car')  # extract method for my own embedding
[(football', 0.91979183998553593), (u'\xa9', 0.9515135906035519), (u'Thank', 0.96150527440098321), (u'innings', 0.96893565858480013), (u'thank', 0.9699300787802696), (u'balls', 0.97004978663463826), (u'Admission', 0.97050270191042776), (u'thanked', 0.97098601610349322), (u'Announces', 0.97186348679591361), (u'drills', 0.97214077185079129)]

Word2Vec most similar words based on 300, 1 vectors

>>> word_vectors.most_similar(positive=['car'])  # std gensim method
[(u'vehicle', 0.7821096777915955),
 (u'cars', 0.7423831224441528),
 (u'SUV', 0.7160963416099548),
 (u'truck', 0.6735789775848389),
 (u'Car', 0.667760968208313),
 (u'automobile', 0.5838367938995361)]

Is there a way to "preserve" the relative distances between vectors when I used an autoencoder? I understand that the absolute distances differ between vectors from a 2 and 300 dim space. But can I preserve the relative distances between the vectors? Is this even possible?


Posted 2017-04-02T23:37:59.717

Reputation: 119



No it is not possible to preserve relative distances when reducing dimensions for arbitrary data. This is not due to a property of auto-encoders compared to e.g. PCA or T-SNE. It is due to geometry.

You can see this relatively easily by considering a reduction of dimensions from 3 to 2, and examining a tetrahedron where all four corner points are 1 unit apart. There is no two-dimensional shape that can place four points mutually equidistant (unless perhaps you consider non-Euclidean spaces). It should be clear that relative distances near those corner vertices would also be affected, and thus this special shape demonstrates a general property of dimension reduction.

Neil Slater

Posted 2017-04-02T23:37:59.717

Reputation: 24 613