Best practical algorithm for sentence similarity



I have two sentences, S1 and S2, both which have a word count (usually) below 15.

What are the most practically useful and successful (machine learning) algorithms, which are possibly easy to implement (neural network is ok, unless the architecture is as complicated as Google Inception etc.).

I am looking for an algorithm that will work fine without putting too much time into it. Are there any algorithms you've found successful and easy to use?

This can, but does not have to fall into the category of clustering. My background is from machine learning, so any suggestions are welcome :)


Posted 2017-11-23T14:40:25.603

Reputation: 363

What did you implement ? I am also facing same, have to come up with solution for 'k' related articles in a corpus that keeps updating. – Dileepa – 2018-08-15T03:07:12.737



Cosine Similarity for Vector Space could be you answer.

Or you could calculate the eigenvector of each sentences. But the Problem is, what is similarity?

"This is a tree", "This is not a tree"

If you want to check the semantic meaning of the sentence you will need a wordvector dataset. With the wordvector dataset you will able to check the relationship between words. Example: (King - Man + woman = Queen)

Siraj Raval has a good python notebook for creating wordvector datasets.

Christian Frei

Posted 2017-11-23T14:40:25.603

Reputation: 331


One approach you could try is averaging word vectors generated by word embedding algorithms (word2vec, glove, etc). These algorithms create a vector for each word and the cosine similarity among them represents semantic similarity among the words. In the case of the average vectors among the sentences. A good starting point for knowing more about these methods is this paper: How Well Sentence Embeddings Capture Meaning. It discusses some sentence embedding methods. I also suggest you look into Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features the authors claim their approach beat state of the art methods. Also they provide the code and some usage instructions in this github repo.

Dani Mesejo

Posted 2017-11-23T14:40:25.603

Reputation: 2 086


bert-as-service offers just that solution.

To answer your question, implementing it yourself from zero would be quite hard as BERT is not a trivial NN, but with this solution you can just plug it in into your algo that uses sentence similarity.


Posted 2017-11-23T14:40:25.603

Reputation: 31


You should check out this. fuzzywuzzy is an awesome library for string/text matching that gives a number between 0 to 100 based on how similar two sentences are. It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package. Also, check out this blog post for a detailed explanation of how fuzzywuzzy does the job. This blog is also written by the fuzzywuzzy author

karthikeyan mg

Posted 2017-11-23T14:40:25.603

Reputation: 728


This blog has the solution for short text similarity. They mainly use the BERT neural network model to find similarities between sentences.

vimal Dharmalingam

Posted 2017-11-23T14:40:25.603

Reputation: 11

Hi, welcome to Data Science Stack Exchange! When referencing a solution from an outside website, please consider writing a summary in your answer. Indeed, this will be easier to read, and prevents your answer from becoming obsolete if the target page changes or the link breaks. – Romain Reboulleau – 2019-11-11T11:55:17.607

Nice, this is really good stuff. So they basically use BERT? @RomainReboulleau is definitely right though! – DaveTheAl – 2019-11-13T10:34:44.257


Depending on the representation of your sentences, you have different similarity metrics available. Some might be more suited to the representation you are using than others.

One of the most popular metrics is the cosine distance.

However, you have other available in the literature, such as:

You can experiment with these alternatives and see what works best for your use-case.


Posted 2017-11-23T14:40:25.603

Reputation: 181


Python now has sent2vec library:


Posted 2017-11-23T14:40:25.603

Reputation: 692