By now anybody even remotely familiar with RNNs has been exposed to the famous figure representing variants of such networks:
However one type that is missing is one where there would be a partial overlap in time between the input and the output.
For example, in machine translation, a commonly presented solution is to encode the input sequence entirely, and then, decode it. (That corresponds to the fourth variant above).
However, human translators typically do not wait till the end of a sentence to start translating: they would start as soon as they estimate that they have enough information for it.
Is there any literature on the subject, and what would be the key idea to train such a network? One difficulty I see is that in an RNN world, time is discrete, and therefore optimizing over the starting time would be difficult to achieve using SGD.