Dataset for Named Entity Recognition on Informal Text



I'm currently searching for labeled datasets to train a model to extract named entities from informal text (something similar to tweets). Because capitalization and grammar are often lacking in the documents in my dataset, I'm looking for out of domain data that's a bit more "informal" than the news article and journal entries that many of today's state of the art named entity recognition systems are trained on.

Any recommendations? So far I've only been able to locate 50k tokens from twitter published here.

Madison May

Posted 2014-06-30T21:02:05.053

Reputation: 1 959


Recommend asking on

– Air – 2014-06-30T21:28:41.943

@Madison May. Did you find a data set? I'm looking for something similar. Thanks. – ahoffer – 2014-07-31T22:49:26.447

I had to make do with the twitter ner corpus from U. Washington (linked to in original post). – Madison May – 2014-07-31T23:01:20.507

FYI Corpus of tagged text (English newspapers or any tagged text)

– Franck Dernoncourt – 2017-04-23T23:30:58.867

got any related good annotated English corpus ? – Achyuta nanda sahoo – 2018-06-25T07:51:42.833

@MadisonMay were you able to get the dataset or pre-trained classifier for informal text ? I need to get name and addresses. – Shan Khan – 2018-10-11T15:18:08.623



As I understand it, these are the properties that you're seeking in a sample dataset:

  1. Text data
  2. It should be informal, i.e. have typos, slang, and basically something not professionally edited
  3. Something other than Twitter (I don't blame you, Twitter is a useful yet way overused example datasource in text mining)

Here are some recommendations:

  1. Emails from the SpamAssassin corpus -- note that both "ham" (non-spam) and spam datasets are available
  2. microblogPCU data set from UCI, which is data scraped from the microblogs of Sina Weibo users -- note, the raw text data is a mix of Chinese and English (you could perform machine translation of the Chinese, filter to only English, or use it as-is)
  3. Amazon Commerce reviews dataset from UCI
  4. Within the bag-o-words dataset, try using the Enron emails
  5. The Twenty Newsgroups dataset
  6. This nice collection of SMS spam
  7. You can always scrape (extract) your own text data from the Internet; I'm not sure which language or statistical package you're using, but XPath-based packages are available in R (rvest, scrapeR, etc) and Python to accomplish this


Posted 2014-06-30T21:02:05.053

Reputation: 1 829

1Are any of these datasets annotated with named entities though? I believe that's what OP was looking for. – Mr. Phil – 2017-02-28T19:36:41.033



Posted 2014-06-30T21:02:05.053

Reputation: 1 810

1Please update these links as none of them is working anymore. – Mr. Phil – 2017-02-28T19:31:06.377


Some of the sources that I have used:

I think these datasets will be of great help for your task

Gyan Ranjan

Posted 2014-06-30T21:02:05.053

Reputation: 681