I would like to solve the following classification problem: Given an input sequence and zero or more initial terms of the true output, predict the next term in the output sequence.
For example, my training set is a large corpus of question and answer pairs. Now I would like to predict the response to "How are you?" one word at a time.
Some reasonable responses might be "I'm fine.", "OK.", "Very well, thanks." etc. So the predicted distribution over possible first words would include the first words in those examples, among others.
But now if I see the first word is "Very", I would like my prediction for the second word to reflect that "OK" is now less likely, because "Very OK" is not a common response.
And I would like to repeat this process of predicting the next word given the input sentence and every word so far in the output sentence.
One approach might be to train a stateful RNN on examples like
How are you<END> I am fine<END>. But if I understand correctly this would also learn to predict words in the "How are you" part, which I think might detract from the goal.
Maybe separate layers for the input and partial output that are then merged into a decoder layer?