Algorithms and techniques for spell checking



Can anyone suggest any algorithm and technique for spell checking? After some googling, I found some interesting ones such as this one from Peter Norvig, and few others. However, most of them were written many years ago. Therefore, I am trying to find out whether is there any newer/improved approach to tackle such problem.


Posted 2017-01-06T06:05:56.000

Reputation: 323



Here is what I built...

  • Step 1: Store all the words in a Trie data structure. Wiki about trie.

  • Step 2: Train an RNN or RNTN to get seq2seq mapping for words and store the model

  • Step 3: Retrieve top n words with levenshtein distance with the. Wiki about LD with the word that you are trying to correct.

    • Naive Approach: Calculating the edit distance between the query term and every dictionary term. Very expensive.
    • Peter Norvig's Approach: Deriving all possible terms with an edit distance<=2 from the query term, and looking them up in the dictionary. Better, but still expensive (114,324 terms for word length=9 and edit distance=2) Check this.
    • Faroo's Approach: Deriving deletes only with an edit distance<=2 both from the query term and each dictionary term. Three orders of magnitudes faster.
  • Step 4: Based on the previous 2 or 3 or 4 words, predict the words that are retrieved from Step 3 above. Select the number of words depending on how much accuracy you want (Of course we want 100% accuracy), and the processing power of the machine running the code, you can select the number of words you wanna consider.

Check out this approach too. It's an acl paper from 2009. They call it - language independent auto correction.

Rahul Reddy Vemireddy

Posted 2017-01-06T06:05:56.000

Reputation: 675

1Please elaborate how these steps work together: What is the input and the output in the NN in Step 2? Do you mean with "previous 2-4 words" the words in the text being corrected? In step 4 you predict the words from step 3 with the model from step 2. I guess you calculate their probability, is this right? What do you mean with "select the number of words depending on how much accuracy you want"? Is this an interactive approach where the final choice is left to the user? – Claude – 2017-01-17T08:19:29.693


Character level Recurrent Neural Network to the rescue here. Mainly you will build a character level language model using a recurrent neural network. Look into this github project. Github and also this medium blog to get a good understanding of the process.

Himanshu Rai

Posted 2017-01-06T06:05:56.000

Reputation: 1 608


I would assume you want to do this for the cleaning phase of your project. Microsoft Cognitive Services provides a spell check API

If you are trying to group words into embedding's, then I might suggest Word2Vec or Glove. Glove is pre-trained and has a different specified number of dimensions for each vector. However, both can potentially account for misspellings as well as similarities.

Samuel Sherman

Posted 2017-01-06T06:05:56.000

Reputation: 186