## How do attention mechanisms in RNNs learn weights for a variable length input

9

4

Attention mechanisms in RNNs are reasonably common to sequence to sequence models.

I understand that the decoder learns a weight vector $\alpha$ which is applied as a weighted sum of the output vectors from the encoder network. This is used to produce a new input vector.

What I don't understand is that the learned weight vectors $\alpha$ must be a fixed size vector because it's treated as learned weights, but it's applied to a variable length sequence.

If someone could help me understand this particular mechanism I'd appreciate it.

1

bump because this link still doesn't explain how number of $\alpha$ weights can vary. Just as OP I see a big limitation - number of $\alpha$ weights indeed has to be constrained, meaning LSTM must produce fixed number of encoded timesteps, thus we loose the benefit of LSTM. An answer would be greatly appriciated

– Kari – 2018-03-09T05:16:16.770

3

Attention weight $$\boldsymbol{\alpha}$$ is not, and need not to be, constrained in size.

For source sequence $$\boldsymbol{x} = x_1\cdots x_{T_x}$$ (where $$T_x$$ can vary from one source to another) and target sequence $$\boldsymbol{y} = y_1...y_{T_y}$$ (where $$T_y$$ can also vary from one target to another), weight $$\boldsymbol{\alpha}_i = (\alpha_{i1},\cdots,\alpha_{iT_x})$$ is calculated for target element $$y_i$$ as follows $$\alpha_{ij}=\frac{\text{exp}(e_{ij})}{\sum_{k=1}^{T_x}\text{exp}(e_{ik})}$$ where $$e_{ij}$$ is calculated by neural network $$a$$ that receives hidden state $$s_{i-1}$$ of decoder (decoder generates the target sequence element-by-element) and hidden state $$h_j$$ of encoder (encoder distills the source sequence into hidden states $$h_j$$, where $$h_j$$ is the concatenation of $$j$$-th hidden states of forward and backward RNNs) and outputs $$e_{ij} = a(s_{i-1}, h_j).$$ In other words, network $$a$$ recieves a vector of size $$|s| + |h|$$ and outputs a number. Because of this network, the attention matrix $$\boldsymbol{\alpha}$$ is free in size.

Attention weights are then used to calculate context $$c_i$$ for target $$y_i$$ as follows $$c_i = \sum_{j=1}^{T_x} \alpha_{ij}h_j,$$

and so on and so forth.

Consequently, for the next (source, target) pair with lengths ($$T_{x_2}$$, $$T_{y_2}$$), the size of $$\boldsymbol{\alpha}$$ would also be $$T_{y_2} \times T_{x_2}$$ with no problem.

1

Here's a nice visualization of attention I came across: https://towardsdatascience.com/light-on-math-ml-attention-with-keras-dc8dbc1fad39

– davidparks21 – 2019-05-21T16:02:27.133