Complex Chunking with NLTK

8

1

I am trying to figure out how to use NLTK's cascading chunker as per Chapter 7 of the NLTK book. Unfortunately, I'm running into a few issues when performing non-trivial chunking measures.

Let's start with this phrase:

"adventure movies between 2000 and 2015 featuring performances by daniel craig"

I am able to find all the relevant NPs when I use the following grammar:

grammar = "NP: {<DT>?<JJ>*<NN.*>+}"

However, I am not sure how to build nested structures with NLTK. The book gives the following format, but there are clearly a few things missing (e.g. How does one actually specify multiple rules?):

grammar = r"""
  NP: {<DT|JJ|NN.*>+}          # Chunk sequences of DT, JJ, NN
  PP: {<IN><NP>}               # Chunk prepositions followed by NP
  VP: {<VB.*><NP|PP|CLAUSE>+$} # Chunk verbs and their arguments
  CLAUSE: {<NP><VP>}           # Chunk NP, VP
  """

In my case, I'd like to do something like the following:

grammar = "MEDIA: {<DT>?<JJ>*<NN.*>+}
           RELATION: {<V.*>}{<DT>?<JJ>*<NN.*>+}
           ENTITY: {<NN.*>}"

It occurs to me that a CFG might be a better fit for this, but I only became aware of NLTK's support for this function about 5 minutes ago (from this question), and it does not appear that much documentation for the feature exists.

So, assuming that I'd like to use a cascaded chunker for my task, what syntax would I need to use? Additionally, is it possible for me to specify specific words (e.g. "directed" or "acted") when using a chunker?

grill

Posted 2015-05-16T00:15:37.807

Reputation: 234

Answers

2

your grammar is correct!

grammar = """MEDIA: {<DT>?<JJ>*<NN.*>+}
           RELATION: {<V.*>}
                     {<DT>?<JJ>*<NN.*>+}
           ENTITY: {<NN.*>}"""

by specifying

RELATION: {<V.*>}
          {<DT>?<JJ>*<NN.*>+}

you are indicating that there are two ways to generate the RELATION chunk i.e. {<V.*>} or {<DT>?<JJ>*<NN.*>+}

so

grammar = """MEDIA: {<DT>?<JJ>*<NN.*>+}
               RELATION: {<V.*>}
                         {<DT>?<JJ>*<NN.*>+}
               ENTITY: {<NN.*>}"""
    chunkParser = nltk.RegexpParser(grammar)
    tagged = nltk.pos_tag(nltk.word_tokenize("adventure movies between 2000 and 2015 featuring performances by daniel craig"))

    tree = chunkParser.parse(tagged)

    for subtree in tree.subtrees():
        if subtree.label() == "RELATION": 
            print("RELATION: "+str(subtree.leaves()))

gives

RELATION: [('featuring', 'VBG')]

AbtPst

Posted 2015-05-16T00:15:37.807

Reputation: 358