Will I overfit my LSTM if I train it via the sliding-window approach? Why do people not seem to use it for LSTMs?
For a simplified example, assume that we have to predict the sequence of characters:
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Is it bad (or better?) if I keep training my LSTM with the following minibatches:
A B C D E F G H I J K L M N, backprop, erase the cell B C D E F G H I J K L M N O, backprop, erase the cell .... and so on, shifting by 1 every time?
Previously, I always trained it as:
A B C D E F G H I J K L M N, backprop, erase the cell O P Q R S T U V W X Y Z, backprop, erase the cell
Instead of shifting by one, would it be better to slide the window by 2 entries instead, etc? What would that mean (in terms of precision/overfitting)?
Also, if I were to do the sliding-window approach in a Feed-forward network, would it result in overfitting? I would assume yes, because the network is exposed to the same information regions for a very long time. For example, it is exposed to
E F G H I J K for a long time.
Please remember that cell state is erased between training batches, so the LSTM will have a "hammer to head" at these times. It's unable to remember what was before OPQRSTUVWXYZ. This means that the LSTM is unable to ever learn that "O" follows the "M".
So, I thought (thus my entire question), why not to give it intermediate (overlapping) batch in between...and in that case why not use multiple overlapping minibatches - to me this would provide a smoother training? Ultimately, that would mean a sliding window for an LSTM.
Some useful info I've found after answer was accepted:
The first word of the English translation is probably highly correlated with the first word of the source sentence. But that means decoder has to consider information from 50 steps ago, and that information needs to be somehow encoded in the vector. Recurrent Neural Networks are known to have problems dealing with such long-range dependencies. In theory, architectures like LSTMs should be able to deal with this, but in practice long-range dependencies are still problematic.
For example, researchers have found that reversing the source sequence (feeding it backwards into the encoder) produces significantly better results because it shortens the path from the decoder to the relevant parts of the encoder. Similarly, feeding an input sequence twice also seems to help a network to better memorize things. For example, if one training example is "John went home", you would give "John went home John went home" to the network as one input.
Edit after accepting the answer:
Several months after, I am more inclined to use the sliding window approach, as it uses the data better. But in that case you probably don't want to train BCDEFGHIJKLMNO right after ABCDEFGHIJKLMNO. Instead, shuffle your examples, to gradually and uniformly "brush-in" all of the information into your LSTM. Give it HIJKLMNOPQRSTU after ABCDEFGHIJKLMNO etc. That's directly related to Catastrophic Forgetting. As always, monitor the Validation and Test set closely, and stop as soon as you see their errors steadily increasing
Also, the "hammer to head" issue can be improved, by using Synthetic Gradients. See its benefit here: (linked answer discusses its benefit of long sequences) https://datascience.stackexchange.com/a/32425/43077