How do "intent recognisers" work?



Amazon's Alexa, Nuance's Mix and Facebook's all use a similar system to specify how to convert a text command into an intent - i.e. something a computer would understand. I'm not sure what the "official" name for this is but I call it "intent recognition". Basically a way to go from "please set my lights to 50% brightness" to lights.setBrightness(0.50).

The way they are specified is by having the developer provide a list of "sample utterances" which are associated with an intent, and optionally tagged with locations of "entities" (basically parameters). Here's an example from example

My question is: how do these systems work? Since they are all very similar I assume there is some seminal work that they all use. Does anyone know what it is?

Interestingly Houndify uses a different system that is more like regexes: ["please"] . ("activate" | "enable" | "switch on" | "turn on") . [("the" | "my")] . ("lights" | "lighting") . ["please"]. I assume that is integrated into the beam search of their voice recognition system, whereas Alexa, and Mix seem to have separate Speech->Text and Text->Intent systems.

Edit: I found a starting point - A Mechanism for Human - Robot Interaction through Informal Voice Commands. It uses something called Latent Semantic Analysis to compare utterances. I'm going to read up on that. At least it has given me a starting point in the citation network.

Edit 2: LSA is essentially comparing the words used (Bag of Words) in each paragraph of text. I don't see how it can work very well for this case as it totally loses the word order. Although maybe word order doesn't matter much for these kinds of commands.

Edit 3: Hidden Topic Markov Models look like they might be interesting.


Posted 2016-04-05T09:03:15.113

Reputation: 231

This post explains intent classification in details:

– znat – 2017-11-28T21:33:35.107

This appears to use the "bag of words" method I mentioned in my question. Basically just add up the word vectors in the sentence. That can't be how it works though. Wit and Nuance's interfaces show that they recognise entities which bag of words can't easily do. Also bag of words loses all ordering so something like "Set an alarm for 10 past 5" would be indistinguishable from "Set an alarm for 5 past 10". There must be something more going on. – Timmmm – 2017-11-29T11:40:36.837

Entity extraction is another problem where sequence matters. If you have a lot of data an Rnn will work, in smaller datasets, which are frequent in chatbots, conditional random fields work very well – znat – 2017-11-29T15:30:25.403

Ok, so ... I'm looking for a more detailed answer than "an RNN will work". Most modern sequence learning systems use RNNs so that seems a given. – Timmmm – 2017-11-29T18:15:10.983

intents are about the general meaning of the sentences (avg of vectors) and entities are about learning the context (surrounding words) in which they appear. Rnn or CRF are just algorithms that can be used because they learn from sequences. If you want to learn in detail, look into Rasa source code – znat – 2017-11-29T18:24:12.113

I would include the parameters of a query as part of the intent. Anyway even though Rasa apparently uses a bag of word vectors, which I already understand and don't think is sufficient, it is probably a good place to start looking, thanks! – Timmmm – 2017-11-30T09:30:53.890



Although not directly answering your question, you may be interested in the field of automated question answering. To answer natural language text questions they must first be understood, which overlaps with your problem.

A good resource is the course by Jurafsky and Manning. Particularly the sections on semantics and question answering may help with what you are looking for. There are accompanying lecture videos available on youtube here.


Posted 2016-04-05T09:03:15.113

Reputation: 171

I find first part of your answer very funny yet informative. – Diego – 2016-11-05T00:01:40.023

Perhaps this would be better as a comment since, as you admit, it doesn’t answer the question. – kbrose – 2017-11-29T00:04:22.860


This post has an approach. Basically they use bag of words - they convert the words to sparse vectors and then add them up.

It appears to work fairly well but one major flaw is the answer is independent of word order, so you can't do queries like "How many kilos in a pound" unless you special case them.

However I did text with Alexa and it is fairly insensitive to word order changes so maybe they do use something similar.


Posted 2016-04-05T09:03:15.113

Reputation: 231

Curious - what advantage do sparse-vectors have over Naive Bayesian? Both to me solve linearly-separable problems with the naive bag-of-words assumption – Angad – 2017-10-25T05:30:47.373