When to use GRU over LSTM?

143

76

The key difference between a GRU and an LSTM is that a GRU has two gates (reset and update gates) whereas an LSTM has three gates (namely input, output and forget gates).

Why do we make use of GRU when we clearly have more control on the network through the LSTM model (as we have three gates)? In which scenario GRU is preferred over LSTM?

Sayali Sonawane

Posted 2016-10-17T11:47:45.340

Reputation: 1 651

1

A GRU is slightly less complex but is approximately as good as an LSTM performance-wise. An implementation in TensorFlow is found here: https://www.data-blogger.com/2017/08/27/gru-implementation-tensorflow/.

– www.data-blogger.com – 2017-08-27T12:28:24.723

GRUs are generally used when you do have long sequence training samples and you want a quick and decent accuracy and maybe in cases where infrastructure is an issue. LSTMs are preferred when sequence lengths are more and some good context is there. LSTMs when trained with more data give you better results than GRUs. – Subir Verma – 2020-10-18T08:03:00.510

Answers

96

GRU is related to LSTM as both are utilizing different way if gating information to prevent vanishing gradient problem. Here are some pin-points about GRU vs LSTM-

  • The GRU controls the flow of information like the LSTM unit, but without having to use a memory unit. It just exposes the full hidden content without any control.
  • GRU is relatively new, and from my perspective, the performance is on par with LSTM, but computationally more efficient (less complex structure as pointed out). So we are seeing it being used more and more.

For a detailed description, you can explore this Research Paper - Arxiv.org. The paper explains all this brilliantly.

Plus, you can also explore these blogs for a better idea-

Hope it helps!

Abhishek Jaiswal

Posted 2016-10-17T11:47:45.340

Reputation: 1 719

1

In addition to your answer there is a nice paper evaluating the performance between GRU and LSTM and their various permutations "An empirical exploration of recurrent network architectures" by Google

– minerals – 2017-06-10T18:11:18.413

@abhishek I found your answer is a bit counter intuitive - how GRU "without having memory unit" can have a better performance - perhaps I should read the other papers – user702846 – 2020-01-22T13:45:19.093

59

*To complement already great answers above.

  • From my experience, GRUs train faster and perform better than LSTMs on less training data if you are doing language modeling (not sure about other tasks).

  • GRUs are simpler and thus easier to modify, for example adding new gates in case of additional input to the network. It's just less code in general.

  • LSTMs should in theory remember longer sequences than GRUs and outperform them in tasks requiring modeling long-distance relations.

*Some additional papers that analyze GRUs and LSTMs.

minerals

Posted 2016-10-17T11:47:45.340

Reputation: 1 727

Good balance of specific precise details while yet concise. – StephenBoesch – 2021-02-20T16:57:19.397

18

FULL GRU Unit

$ \tilde{c}_t = \tanh(W_c [G_r * c_{t-1}, x_t ] + b_c) $

$ G_u = \sigma(W_u [ c_{t-1}, x_t ] + b_u) $

$ G_r = \sigma(W_r [ c_{t-1}, x_t ] + b_r) $

$ c_t = G_u * \tilde{c}_t + (1 - G_u) * c_{t-1} $

$ a_t = c_t $

LSTM Unit

$ \tilde{c}_t = \tanh(W_c [ a_{t-1}, x_t ] + b_c) $

$ G_u = \sigma(W_u [ a_{t-1}, x_t ] + b_u) $

$ G_f = \sigma(W_f [ a_{t-1}, x_t ] + b_f) $

$ G_o = \sigma(W_o [ a_{t-1}, x_t ] + b_o) $

$ c_t = G_u * \tilde{c}_t + G_f * c_{t-1} $

$ a_t = G_o * tanh(c_t) $

As can be seen from the equations LSTMs have a separate update gate and forget gate. This clearly makes LSTMs more sophisticated but at the same time more complex as well. There is no simple way to decide which to use for your particular use case. You always have to do trial and error to test the performance. However, because GRU is simpler than LSTM, GRUs will take much less time to train and are more efficient.

Credits:Andrew Ng

balboa

Posted 2016-10-17T11:47:45.340

Reputation: 191

On the next slide after the hand written ones, the last equation is different: $a^{<t>} = \Gamma_o \odot tanh\left(\tilde c^{<t>}\right)$. This formula is confirmed correct here.

– Tom Hale – 2019-03-25T08:04:31.660

12

This answer actually lies on the dataset and the use case. It's hard to tell definitively which is better.

  • GRU exposes the complete memory unlike LSTM, so applications which that acts as advantage might be helpful. Also, adding onto why to use GRU - it is computationally easier than LSTM since it has only 2 gates and if it's performance is on par with LSTM, then why not?
  • This paper demonstrates excellently with graphs the superiority of gated networks over a simple RNN but clearly mentions that it cannot conclude which of the either are better. So, if you are confused as to which to use as your model, I'd suggest you to train both and then get the better of them.

Hima Varsha

Posted 2016-10-17T11:47:45.340

Reputation: 2 146

1

GRU is better than LSTM as it is easy to modify and doesn't need memory units, therefore, faster to train than LSTM and give as per performance.

Vivek Khetan

Posted 2016-10-17T11:47:45.340

Reputation: 353

24please support the performance claim with fair references – Kari – 2018-05-21T03:57:32.947

1

Actually, the key difference comes out to be more than that: Long-short term (LSTM) perceptrons are made up using the momentum and gradient descent algorithms. When you reconcile LSTM perceptrons with their recursive counterpart RNNs, you come up with GRU which is really just a generalized recurrent unit or Gradient Recurrent Unit (depending on the context) that more closely integrates the momentum and gradient descent algorithms. Were I you, I'd do more research on AdamOptimizers.

GRU is an outdated concept by the way. However, I can understand you researching it if you want moderate-advanced in-depth knowledge of TF.

Andre Patterson

Posted 2016-10-17T11:47:45.340

Reputation: 11

18I'm curious. Could you explain why GRU is an outdated concept? – random_user – 2018-11-16T18:01:28.710

If you make a claim that the primary topic of the question is "an outdated concept" you really should back that up with references (as mentioned in prior comment) – StephenBoesch – 2021-02-20T16:59:18.150