NLP: What are some popular packages for multi-word tokenization?



I intend to tokenize a number of job description texts. I have tried the standard tokenization using whitespace as the delimiter. However I noticed that there are some multi-word expressions that are splitted by whitespace, which may well cause accuracy problems in subsequent processing. So I want to get all the most interesting/informative collocations in these texts.

Is there any good packages for doing multi-word tokenization, regardless of specific programming language? For example, "He studies Information Technology" ===> "He" "studies" "Information Technology".

I've noticed NLTK (Python) has some related functionalities.

What the difference between these two?

The MWETokenizer class in the nltk.tokenize.mwe module seems working towards my objective. However MWETokenizer seems to require me to use its construction method and .add_mwe method to add multi-word expressions. Is there a way to use external multi-word expression lexicon to achieve this? If so, are there any multi-word lexicon?



Posted 2017-03-02T07:04:41.123

Reputation: 321



The multiword tokenizer 'nltk.tokenize.mwe' basically merges a string already divided into tokens, based on a lexicon, from what I understood from the API documentation.

One thing you can do is tokenize and tag all words with it's associated part-of-speech (PoS) tag, and then define regular expressions based on the PoS-tags to extract interesting key-phrases.

For instance, an example adapted from the NLTK Book Chapter 7 and this blog post:

def extract_phrases(my_tree, phrase):
   my_phrases = []
   if my_tree.label() == phrase:

   for child in my_tree:
       if type(child) is nltk.Tree:
            list_of_phrases = extract_phrases(child, phrase)
            if len(list_of_phrases) > 0:

    return my_phrases

def main():
    sentences = ["The little yellow dog barked at the cat",
                 "He studies Information Technology"]

    grammar = "NP: {<DT>?<JJ>*<NN>|<NNP>*}"
    cp = nltk.RegexpParser(grammar)

    for x in sentences:
        sentence = pos_tag(tokenize.word_tokenize(x))
        tree = cp.parse(sentence)
        print "\nNoun phrases:"
        list_of_noun_phrases = extract_phrases(tree, 'NP')
        for phrase in list_of_noun_phrases:
            print phrase, "_".join([x[0] for x in phrase.leaves()])

You defined a grammar based on regex over PoS-tags:

grammar = "NP: {<DT>?<JJ>*<NN>|<NNP>*}"
cp = nltk.RegexpParser(grammar)

Then you applied it to a tokenized and tagged sentence, generating a Tree:

sentence = pos_tag(tokenize.word_tokenize(x))
tree = cp.parse(sentence)

Then you use extract_phrases(my_tree, phrase) to recursively parse the Tree and extract sub-trees labeled as NP. The example above would extract the following noun-phrases:

Noun phrases:
(NP The/DT little/JJ yellow/JJ dog/NN) The_little_yellow_dog
(NP the/DT cat/NN) the_cat

Noun phrases:
(NP Information/NNP Technology/NNP) Information_Technology

There is a great blog post by Burton DeWilde about many more ways to extract interesting keyphrases: Intro to Automatic Keyphrase Extraction

David Batista

Posted 2017-03-02T07:04:41.123

Reputation: 133


The tokenization process shouldn't be changed even when you are interested in multi words. After all, the words are still the basic tokens. What you should do it to find a way to combine the proper words into term.

A simple way to do so is to look for term in which the probability of the term is higher than that of the independent tokens. For example P("White house") > P("White")*P("House") Choosing the proper values of need lift, number of occurrences and term classification can be deduce if you have a dataset of terms form the domain. If you don't have such a domain then requirement at least 10 occurrences and and a lift of at least 2 (usually it is much higher since each token probability is low) will work quite well.

In your case can can also extract terms by combining contexts relevant to your domain (e.g., "studied X", "practiced Y").

Again, you can build complex and elegant models for that but usually, looking for the few next words after the context indicators will be very beneficial.


Posted 2017-03-02T07:04:41.123

Reputation: 2 463


For your problem i think gensim can be very useful, what can be implemented with Gensim library is phrase detection. It is similar to n-gram, but instead of getting all the n-gram by sliding the window, it detects frequently used phrases and stick them together. It statistically walks through the text corpus and identifies the common side-by-side occuring words.
Following is the way it calculates the best suitable multi word tokens. enter image description here

Following is the code to use it. it calculates the two word tokens.

from gensim.models.phrases import Phrases, Phraser

tokenized_train = [t.split() for t in x_train]
phrases = Phrases(tokenized_train)
bigram = Phraser(phrases)

and this is how you would use it

enter image description here

Notice the word "new_york" which is concatenated, since in the corpus, statistical evidence of both "new" and "york" words coming together was significant.

Moreover, you can go upto n-grams for this not just bi-grams. Here is the article which explains it in detail.

Qaisar Rajput

Posted 2017-03-02T07:04:41.123

Reputation: 111


This extension of Stanford CoreNLP to capture MultiWord Expressions(MWE's) worked like a charm for one of my tasks. For Python users, gear up to write some connector code or hack the Java code.


Posted 2017-03-02T07:04:41.123

Reputation: 111


Use Stanford CoreNLP library for multi word tokenization. I found it when I was working on similar task and it worked pretty well!

Updated: You can use Stanford CoreNLP pipeline which includes multi word tokenization model. Link of demo for training neural networks with your own data is here


Posted 2017-03-02T07:04:41.123

Reputation: 5

Maybe you could add also a small snippet of code to let the OP knows how to use it. – Tasos – 2019-02-07T11:06:04.920

Yes, but detailed information on how to use it and train with own data is given in the link mentioned. I also added a specific link which gives demo for training their model for multi-word tokenization – Noman – 2019-02-08T12:11:51.653

It's a common thing that link-on answers should be avoided for various reasons. For example, the content of the page might change or the link will not be reachable in the future. When a new user visits this page, should be able to get the information the OP asked for. Just as best-practice. – Tasos – 2019-02-08T12:14:47.447

Oh now I get it, thanks for pointing out. I will try to find my code for using CoreNLP mwt model or I will code it again and paste it here (in couple of days) for the sake of information of community! – Noman – 2019-02-08T12:19:52.550


gensim is one of the best to do nlp tasks on text data, Gensim python library


Posted 2017-03-02T07:04:41.123

Reputation: 181