Any good Implementations of Bi-LSTM bahdanau attention in Keras?



From past few weeks I'm trying to learn sequence to sequence machine translation modelling but I couldn't find any good examples/tutorials with bahdanau attention implemented. I did come across a ton of examples where people have implemented attention but it's mostly on gru (or) luong attention (or) outdated code.

Has anyone come across any good implementations of bahdanau attention model in keras which you have implemented (or) tried? I really want to learn the coding part but no good material regarding implementation is found?

Please help.


Posted 2019-12-02T21:22:22.810

Reputation: 327



Here's the notebook that is going to be helpful to understand it. Neural machine translation with attention


Posted 2019-12-02T21:22:22.810

Reputation: 1 913


In tensorflow-tutorials-for-text they are implementing bahdanau attention layer to generate context vector by giving encoder inputs, decoder hidden states and decoder inputs.

Encoder class is simply passing the encoder inputs from Embedding layer to GRU layer along with encoder_states and returns encoder_outputs and ecoder_states.

If we use LSTM instead of GRU then states would be state_h, state_c of size (batch_size, units).

In the Decoder part, they are passing decoder_inputs, encoder_outputs and states (to initialize use encoder_states for remaining use deocder_states).

When we give the above three parameters in the Decoder, BahdanauAttention Layer will calculate contex_vector and weights using encoder_outputs and states. We can also use AdditiveAttention-Layer it is Bahdanau-style attention. In which query is our decoder_states and value is our encoder_outputs.

It is one of the nice tutorials for attention in Keras using TF backend that I came across.

Khushali Vaghani

Posted 2019-12-02T21:22:22.810

Reputation: 11


Luong's attention came after Bahdanau's and is generally considered an advancement over the former even though it has several simplifications.

None of the pre-written layers I have seen, entirely implement Luong or Bahdanu's attention in entirety but only implement key pieces of those. It has been shown that major gains are seen in performance with the introduction of Attention in any basic form. The specific implementation does not seem to matter that much though there seems to be strong benefits in passing on the attention weights learnt to subsequent timesteps.

Both Bahdanau and Luong and subsequent attention models that came out from 2014-2018 have now been replaced by self-attention in most cases. Self-attention was introduced by Google in 2018 (though they were probably inspired by an earlier paper on intra-attention)

Since there are over a dozen flavors of attention and each flavor could have several ways of implementation, this can sometimes be a source of confusion to many. The below link contains a simplified rendering of the evolution of Attention and shows how to implement attention in 6 lines of code:


Posted 2019-12-02T21:22:22.810

Reputation: 351