Advantages of stacking LSTMs?



I'm wondering in what situations it is advantageous to stack LSTMs?

Vadim Smolyakov

Posted 2017-08-29T16:48:40.890

Reputation: 586

3 might be helpful to get an idea?? – i.n.n.m – 2017-08-29T16:58:08.167



From What are the advantages of stacking multiple LSTMs? (I'll only update the answer there):

From {1}:

While it is not theoretically clear what is the additional power gained by the deeper architecture, it was observed empirically that deep RNNs work better than shallower ones on some tasks. In particular, Sutskever et al (2014) report that a 4-layers deep architecture was crucial in achieving good machine-translation performance in an encoder-decoder framework. Irsoy and Cardie (2014) also report improved results from moving from a one-layer BI-RNN to an architecture with several layers. Many other works report result using layered RNN architectures, but do not explicitly compare to 1-layer RNNs.


Franck Dernoncourt

Posted 2017-08-29T16:48:40.890

Reputation: 4 975


One situation in which it's advantageous to stack LSTMs is when we want to learn hierarchical representation of our time-series data. In stacked LSTMs, each LSTM layer outputs a sequence of vectors which will be used as an input to a subsequent LSTM layer. This hierarchy of hidden layers enables more complex representation of our time-series data, capturing information at different scales.

For example, stacked LSTMs can be used to improve accuracy in time-series classification, such as activity prediction, in which heart-rate, step-count, GPS and other signals can be used to predict activity such as walking, running, biking, climbing stairs or resting. For an example of time-series classification with stacked LSTMs using EEG data have a look at the following ipython notebook.

Vadim Smolyakov

Posted 2017-08-29T16:48:40.890

Reputation: 586


In sequence to sequence model: The encoder network’s job is to read the input sequence to our Seq2Seq model and generate a fixed-dimensional context vector C for the sequence. To do so, the encoder will use a recurrent neural network cell – usually an LSTM – to read the input tokens one at a time. The final hidden state of the cell will then become C. However, because it’s so difficult to compress an arbitrary-length sequence in to a single fixed-size vector (especially for difficult tasks like translation), the encoder will usually consist of stacked LSTMs: a series of LSTM "layers" where each layer’s outputs are the input sequence to the next layer. The final layer’s LSTM hidden state will be used as Context vector.

Umer Rana

Posted 2017-08-29T16:48:40.890

Reputation: 11