How to fix spelling mistakes in data?


I have an input data file which contains list of drug names.

I have more than 1000 unique drug names. However, the drug names has spelling mistakes and space character issues.


You can see how the above 3 terms are different in representation (due to spelling mistake) but actually indicate the same drug ISONIAZID 300MG TAB (which is the right spelling)

But the problem is there are several other drugs with such spelling mistakes and I am not sure how can I group all of them into one (meaning rename it with right spelling)? ex: All the 3 terms above should be renamed to ISONIAZID 300MG TAB (which is the right spelling)

I am posting it here to seek your opinion on is there any medical dictionary or automated approach which can take my raw csv file as input and output proper drug names?

The Great

Posted 2020-04-21T04:32:20.287

Reputation: 1 539

Try this library.

– ashutosh singh – 2020-04-21T14:27:07.437

It contains multiple methods to correct spellings.

  1. Contextual correction
  2. Edit Distances
  3. Phonetic Correction.

It doesn't support the word split problem though. – ashutosh singh – 2020-04-21T14:27:12.910



There are several general approaches to this but almost all of them compare the values to some baseline and make a decision whether the individual value is close enough.

You can compare the similarity of strings using different methods e.g. I often use the Levenshtein distance which basically measures how many characters you would have to change to convert word a into word b. By grouping all words with a low enough Levenshtein distance you already identify all words that should be the same.

It is even easier if you have a dictionary of "correct values" in which case you would compare each value to all entries in the dictionary and assign it the value with the lowest levenshtein distance.

Using some basic text analysis you could also identify the most common spelling mistakes in your data and eliminate them based on replacement rules.


Posted 2020-04-21T04:32:20.287

Reputation: 1 300

Hi @Fnguyen, Can you share any tutorial or example on how this is implemented? – The Great – 2020-06-12T03:37:17.697