How to train a model to extract custom and unknown entities



I'm trying to figure out how to extract specific text from an utterance by a user.

I need to extract "unknown" text from a short and simple text. In this case, the user wants to create a list. everything in the {} is unknown text. As it doesn't belong to a specific entity such as food, athletes, movies, etc.

  • create a new {groceries} list
  • create a list {movies}
  • create a new list {movies}
  • create a list and call it {books}
  • create a new list and give it the name {stamps}
  • create a list with the title {red ketchup}
  • create another list called {rotten food}

the above list is but a small sample of all the different ways that a user can say he wants to create a list.

In everything that I have seen, it's all based on existing entities for the NER and when someone says that it's custom, I found that it just means we have to train a specific set of words and hope for the best. If I add one more word that isn't trained, it fails to get the data.

But in this case, the user can say anything such as "old shoes", "schools I want to go to", "Keanu Reeves movies". So I cannot see how I could possibly train it.

With Spacy, I followed this example ( and it mostly works in getting the proper titles. However, I have to train it for every different phrase to work.

For example, if a user says

create a beautiful new list and give it the name {stamps}

the word beautiful causes it to fail and now I have to train for that as well. At this rate, we are looking at millions of phrases to train.

before Spacy, we tried Dialogflow and Rasa. At each point, it's about training phrases but the more we train, the more one thing worked and another broke.

At this point, we have tried and overall had good intent detection success but when it comes to extracting data such as this, I'm starting to look like a deer in a headlight.

We are new to NLP and while we've had a lot of good progress and over the past few weeks, we cannot seem to find any articles written on this specific problem and whether it can be solved. Dialogflow has the concept of any entity but they recommend avoiding it and it works 2 out of 3 times when things get complicated.

The goal is to detect which of these words is the title based training. Can it be done? and if so, what's the approach?

Any code, hints or articles that might get us started would be appreciated.

Safwan Hak

Posted 2019-10-11T06:44:04.070

Reputation: 31

No answers