Similarity between two words



I'm looking for a Python library that helps me identify the similarity between two words or sentences.

I will be doing Audio to Text conversion which will result in an English dictionary or non dictionary word(s) ( This could be a Person or Company name) After that, I need to compare it to a known word or words.


1) Text to audio result: Thanks for calling America Expansion will be compared to American Express.

Both sentences are somehow similar but not the same.

Looks like I may need to look into how many chars they share. Any ideas will be great. Looks a functionality like Google search "did you mean" feature.


Posted 2016-07-04T06:00:34.093

Reputation: 689



The closest would be like Jan has mentioned inhis answer, the Levenstein's distance (also popularly called the edit distance).

In information theory and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.

It is a very commonly used metric for identifying similar words. Nltk already has an implementation for the edit distance metric, which can be invoked in the following way:

import nltk
nltk.edit_distance("humpty", "dumpty")

The above code would return 1, as only one letter is different between the two words.


Posted 2016-07-04T06:00:34.093

Reputation: 7 606

4Lavenshtien's distance is the worst algorithm that you can use if NLP is what you intend to do. If 2 synonyms have a different character set, LD will perform very poorly in those cases. – It's a trap – 2018-02-10T05:24:22.313


Apart from very good responses here, you may try SequenceMatcher in difflib python library.

import difflib

a = 'Thanks for calling America Expansion'
b = 'Thanks for calling American Express'

seq = difflib.SequenceMatcher(None,a,b)
d = seq.ratio()*100
### OUTPUT: 87.323943

Now Consider the below code:

a = 'Thanks for calling American Expansion'
b = 'Thanks for calling American Express'

seq = difflib.SequenceMatcher(None,a,b)
d = seq.ratio()*100
### OUTPUT: 88.88888

Now you may compare the d value to evaluate the similarity.


Posted 2016-07-04T06:00:34.093

Reputation: 251

1If you feel that seq.ratio() is slow, you can use seq.quick_ratio() – Nabin – 2019-08-18T03:22:08.237


If your dictionary is not too big a common approach is to take the Levenshtein distance, which basically counts how many changes you have to make to get from one word to another. Changes include changing a character, removing a character or adding a character. An example from Wikipedia:

lev(kitten, sitting) = 3

  • k itten -> s itten
  • sitt e n -> sitt i n
  • sittin -> sittin g

Here are some Python implements on Wikibooks.

The algorithm to compute these distances is not cheap however. If you need to do this on a big scale there are ways to use cosine similarity on bi-gram vectors that are a lot faster and easy to distribute if you need to find matches for a lot of words at once. They are however only an approximation to this distance.

Jan van der Vegt

Posted 2016-07-04T06:00:34.093

Reputation: 8 538

(+1) for the Lev. distance metric. nltk comes with a ready-made implementation. Cosine similarity isn't a good string-similarity measure IMHO :) – Dawny33 – 2016-07-04T08:29:51.697

I agree that it's much worse than the Levenshtein distance but if you need fuzzy matching between 2 datasets of millions it can actually do that in a reasonable time due to needing some tricks plus matrix multiplication – Jan van der Vegt – 2016-07-04T11:48:59.540

1@Dawny33 I would disagree. Not only has cosine similarity worked very quick for me but also very accurately given that the right n-gram was used. – Mohit Motwani – 2019-07-01T13:44:24.627


An old and well-known technique for comparison is the Soundex algorithm. The idea is to compare not the words themselves but approximations of how they are pronounced. To what extent this actually improves the quality of the results I don't know.

However it feels a bit strange to apply something like Soundex to results from a speech-to-text recognition engine. First you throw away information about how the words are pronounced, then you try to add it back again. It would be better to combine these two phases.

Hence, I expect the state of the art technology in this area to do that, and be some form of adaptive classification, e.g. based on neural networks. Google does return recent research on Speech Recognition with Neural Networks.


Posted 2016-07-04T06:00:34.093

Reputation: 131


I think the function get_close_matches in module difflib could be more suitable for such a requirement.

get_close_matches(word, possibilities, n=3, cutoff=0.7)

possibilities -> is the list of words
n = maximum number of close matches
cutoff = accuracy of matches.



if len(get_close_matches(w, data,  n=3, cutoff=0.7)) > 0:
   return data[get_close_matches(w, data,  n=3, cutoff=0.7)[0]]

This piece of code will return the best first match and that will be word rain.

JP Chauhan

Posted 2016-07-04T06:00:34.093

Reputation: 111