Amazon's Alexa, Nuance's Mix and Facebook's Wit.ai all use a similar system to specify how to convert a text command into an intent - i.e. something a computer would understand. I'm not sure what the "official" name for this is but I call it "intent recognition". Basically a way to go from "please set my lights to 50% brightness" to
The way they are specified is by having the developer provide a list of "sample utterances" which are associated with an intent, and optionally tagged with locations of "entities" (basically parameters). Here's an example from Wit.ai:
My question is: how do these systems work? Since they are all very similar I assume there is some seminal work that they all use. Does anyone know what it is?
Interestingly Houndify uses a different system that is more like regexes:
["please"] . ("activate" | "enable" | "switch on" | "turn on") . [("the" | "my")] . ("lights" | "lighting") . ["please"]. I assume that is integrated into the beam search of their voice recognition system, whereas Alexa, Wit.ai and Mix seem to have separate Speech->Text and Text->Intent systems.
Edit: I found a starting point - A Mechanism for Human - Robot Interaction through Informal Voice Commands. It uses something called Latent Semantic Analysis to compare utterances. I'm going to read up on that. At least it has given me a starting point in the citation network.
Edit 2: LSA is essentially comparing the words used (Bag of Words) in each paragraph of text. I don't see how it can work very well for this case as it totally loses the word order. Although maybe word order doesn't matter much for these kinds of commands.
Edit 3: Hidden Topic Markov Models look like they might be interesting.