What are some standard ways of computing the distance between documents?



When I say "document", I have in mind web pages like Wikipedia articles and news stories. I prefer answers giving either vanilla lexical distance metrics or state-of-the-art semantic distance metrics, with stronger preference for the latter.


Posted 2014-07-05T16:10:21.580

Reputation: 783



There's a number of different ways of going about this depending on exactly how much semantic information you want to retain and how easy your documents are to tokenize (html documents would probably be pretty difficult to tokenize, but you could conceivably do something with tags and context.)

Some of them have been mentioned by ffriend, and the paragraph vectors by user1133029 is a really solid one, but I just figured I would go into some more depth about plusses and minuses of different approaches.

  • Cosine Distance - Tried a true, cosine distance is probably the most common distance metric used generically across multiple domains. With that said, there's very little information in cosine distance that can actually be mapped back to anything semantic, which seems to be non-ideal for this situation.
  • Levenshtein Distance - Also known as edit distance, this is usually just used on the individual token level (words, bigrams, etc...). In general I wouldn't recommend this metric as it not only discards any semantic information, but also tends to treat very different word alterations very similarly, but it is an extremely common metric for this kind of thing
  • LSA - Is a part of a large arsenal of techniques when it comes to evaluating document similarity called topic modeling. LSA has gone out of fashion pretty recently, and in my experience, it's not quite the strongest topic modeling approach, but it is relatively straightforward to implement and has a few open source implementations
  • LDA - Is also a technique used for topic modeling, but it's different from LSA in that it actually learns internal representations that tend to be more smooth and intuitive. In general, the results you get from LDA are better for modeling document similarity than LSA, but not quite as good for learning how to discriminate strongly between topics.
  • Pachinko Allocation - Is a really neat extension on top of LDA. In general, this is just a significantly improved version of LDA, with the only downside being that it takes a bit longer to train and open-source implementations are a little harder to come by
  • word2vec - Google has been working on a series of techniques for intelligently reducing words and documents to more reasonable vectors than the sparse vectors yielded by techniques such as Count Vectorizers and TF-IDF. Word2vec is great because it has a number of open source implementations. Once you have the vector, any other similarity metric (like cosine distance) can be used on top of it with significantly more efficacy.
  • doc2vec - Also known as paragraph vectors, this is the latest and greatest in a series of papers by Google, looking into dense vector representations of documents. The gensim library in python has an implementation of word2vec that is straightforward enough that it can pretty reasonably be leveraged to build doc2vec, but make sure to keep the license in mind if you want to go down this route

Hope that helps, let me know if you've got any questions.


Posted 2014-07-05T16:10:21.580

Reputation: 3 979


There's a number of semantic distance measures, each with its pros and cons. Here are just a few of them:

  • cosine distance, inner product between document feature vectors;
  • LSA, another vector-based model, but utilizing SVD for de-noising original term-document matrix;
  • WordNet-based, human verified, though hardly extensible.

Start with a simplest approach and then move further based on issues for your specific case.


Posted 2014-07-05T16:10:21.580

Reputation: 2 751

1Note that when doing LSA, typically you use the cosine distance on the LSA projections of the original dataset. Just to clarify. – Simon – 2014-07-07T18:09:18.477


Empirically I have found LSA vastly superior to LDA every time and on every dataset I have tried it on. I have talked to other people who have said the same thing. It's also been used to win a number of the SemEval competitions for measuring semantic similarity between documents, often in combinations with a wordnet based measure, so I wouldn't say it's going out of fashion, or is definitely inferior to LDA, which is better for topic modelling and not semantic similarity in my experience, contrary to what some responders have stated.

If you use gensim (a python library), it has LSA, LDA and word2vec, so you can easily compare the 3. doc2vec is a cool idea, but does not scale very well and you will likely have to implement it yourself as I am unaware of any open source implementations. It does not scale well as for each document, a new and separate model has to be built using SGD, a slow machine learning algorithm. But it will probably give you the most accurate results. LSA and LDA also don't scale well (word2vec does however), LDA scales worse in general. Gensim's implementations are very fast however, as it uses iterative SVD.

One other note, if you use word2vec, you will still have to determine a way to compose vectors from documents, as it gives you a different vector per word. The simplest way to do this is to normalize each vector and take the mean over all word vectors in the document, or take a weighted mean by idf weighting of each word. So it's not as simple as 'use word2vec', you will need to do something further to compute document similarity.

I would personally go with LSA, as I have seen it work well empirically, and gensim's library scales very well. However, there's no free lunch, so preferably try each method and see which works better for your data.


Posted 2014-07-05T16:10:21.580

Reputation: 916

How exactly have you used LSA? It's worth noting that LDA is actually a pretty thin wrapper around LSA (it's pLSA with a dirichlet prior) that has been empirically shown to greatly increase generalization. You would almost certainly see better accuracies with LSA, but that's generally a result of overfitting, which is a very notable problem with LSA. Also, what exactly do you mean by scaling here? doc2vec does not actually require a new model for each document, and for computation there's no notable difference between LSA and LDA, both being very scalable. – Slater Victoroff – 2014-07-07T19:36:26.987

I have not observed over fitting with LSA, and like I said, I have met multiple other people who have seen better performance over LDA. Also, I have seen LSA used in many winning entries in semeval competitions, I have never seen LDA used in a winning entry. That is the academic conference for comparing semantic similarity between documents, so I assume they know what they are doing. Doc2vec, if you are referring to Mikolov's paragraph vector implementation, does SGD on each document separately. So it's very slow. – Simon – 2014-07-08T20:48:55.170

@SlaterVictoroff I think that's over stating things to say it's overfitting. LDA is known to be poor for search / information retrieval and recommendation cases, empirically LSA has been shown to work much better and that matches my own experience too as I like to validate these findings against our own data. Versions of Doc2Vec do a gradient descent per document, it depends on the algorithm used in Doc2Vec, as that generally refers to a lot of different algorithms. – Simon – 2018-07-21T19:12:43.200


State of the art appears to be "paragraph vectors" introduced in a recent paper. Cosine/Euclidean distance between paragraph vectors would likely work better than any other approach. This probably isn't feasible yet due to lack of open source implementations.

Next best thing is cosine distance between LSA vectors or cosine distance between raw BOW vectors. Sometimes it works better to choose different weighting schemes, like TF-IDF.


Posted 2014-07-05T16:10:21.580

Reputation: 121

Note my comments below about paragraph vector scalability. This technique looks very promising, but is hard to implement, and does not scale well at all, as you are doing a separate SGD for each document, which is very costly, if I remember the paper correctly – Simon – 2014-07-08T20:52:43.710


It is useful to have in you bag of tools the family of locality sensitive hashing algorithms. This family is not semantic at all. Actually is consider the text as a sequence of bits. I find it useful in dirty data sets when the same text appears many times with slight differences.

You can use ssdeep (which is based on Nilsimsa hash) for identifying such documents. Ssdeep was originally planned for the domain of spam. Spammers often do small changes in the message (add a space) in order to prevent detection by exact signature (e.g., md5).

Since many version of almost the same document in the same data set will cause havoc to statistical methods that will be applied on it, doing such a cleanup can be very beneficial.


Posted 2014-07-05T16:10:21.580

Reputation: 2 463