I see two different ways of applying attention in seq2seq:
(a) the context vector (the weighted sum of encoder hidden states) fed into the output softmax, as shown in the diagram below. The diagram is from here.
(b) the context vector fed into the decoder input as shown the diagram below. The diagram is from here.
What are the pros and the cons of the two different approaches? Is there any paper comparing the two?