I have been working in NLTK for a while using Python. The problem I am facing is that their is no help available on training NER in NLTK with my custom data. They have used MaxEnt and trained it on ACE corpus. I have searched on the web a lot but I could not find any way that can be used to train NLTK's NER.

If anyone can provide me with any link/article/blog etc which can direct me to Training Datasets Format used in training NLTK's NER so I can prepare my Datasets on that particular format. And if I am directed to any link/article/blog etc which can help me TRAIN NLTK's NER for my own data.

This is a question widely searched and least answered. Might be helpful for someone in the future whose working with NER.


Training a model, related to information extraction, in general, and named entity recognition/resolution (NER), in particular, is described in detail in Chapter 7 of the NLTK Book, available online at this URL:

Additionally, I think that you might find useful my related answer on Cross Validated site. It has a lot of references to relevant sources on NER and related topics as well as to various related software tools.

They do not mention how to train the NER model on custom data, can you tell how to do it? – Hima Varsha – 2016-12-28T12:29:33.997


@HimaVarsha I'm not an expert in this area. However, ... I think that NLTK NER model comes pre-trained on the conll2000 corpus, hence no info in NLTK book. Check the following resources: 1. (most likely what you need; probably the Training IOB Chunkers section). 2. (might be useful as well). 3. (in case you use or will decide to use Stanford NER software).

– Aleksandr Blekh – 2016-12-28T21:22:18.173

I think stanfordcrf implementation does take custom data, but NTLK NER comes just pre-trained. The Training IOB Chunkers is just chunking right? Or does it even do NER? – Hima Varsha – 2016-12-29T05:47:27.630

@HimaVarsha Please pay more attention to advice you're getting. If you'd read post via link #2 above carefully, you'd see that the code there does both NER model training and running. I don't think I can help you beyond the advice above. – Aleksandr Blekh – 2016-12-29T08:37:49.590


Is this article good enough?

There is explanation about how corpus should look like.

Your data needs to be in IOB format (word tag chunktag) to make it work.
is VB O
the AT B-NP
of IN O


1It would be ideal to post a short summary of the article in this answer. – sheldonkreger – 2015-03-02T23:53:40.360


I found this tutorial quite helpful: Complete guide to build your own Named Entity Recognizer with Python He uses the Groningen Meaning Bank (GMB) corpus to train his NER chunk.

After that you can check this tutorial from the same person: Training a NER System Using a Large Dataset Where he uses scikit learn to improve the performance of his system.

Finally some really useful tutorials can be found here: NLTK tutorial This guy has a youtube channel with a lot of tutorial in many subjects (ML,NLP, Python...)

Hope it helps.


