How do attention mechanisms in RNNs learn weights for a variable length input



Attention mechanisms in RNNs are reasonably common to sequence to sequence models.

I understand that the decoder learns a weight vector $\alpha$ which is applied as a weighted sum of the output vectors from the encoder network. This is used to produce a new input vector.

What I don't understand is that the learned weight vectors $\alpha$ must be a fixed size vector because it's treated as learned weights, but it's applied to a variable length sequence.

If someone could help me understand this particular mechanism I'd appreciate it.


Posted 2018-01-30T00:35:51.420

Reputation: 363


bump because this link still doesn't explain how number of $\alpha$ weights can vary. Just as OP I see a big limitation - number of $\alpha$ weights indeed has to be constrained, meaning LSTM must produce fixed number of encoded timesteps, thus we loose the benefit of LSTM. An answer would be greatly appriciated

– Kari – 2018-03-09T05:16:16.770



Attention weight $\boldsymbol{\alpha}$ is not, and need not to be, constrained in size.

For source sequence $\boldsymbol{x} = x_1\cdots x_{T_x}$ (where $T_x$ can vary from one source to another) and target sequence $\boldsymbol{y} = y_1...y_{T_y}$ (where $T_y$ can also vary from one target to another), weight $\boldsymbol{\alpha}_i = (\alpha_{i1},\cdots,\alpha_{iT_x})$ is calculated for target element $y_i$ as follows $$\alpha_{ij}=\frac{\text{exp}(e_{ij})}{\sum_{k=1}^{T_x}\text{exp}(e_{ik})}$$ where $e_{ij}$ is calculated by neural network $a$ that receives hidden state $s_{i-1}$ of decoder (decoder generates the target sequence element-by-element) and hidden state $h_j$ of encoder (encoder distills the source sequence into hidden states $h_j$, where $h_j$ is the concatenation of $j$-th hidden states of forward and backward RNNs) and outputs $$e_{ij} = a(s_{i-1}, h_j).$$ In other words, network $a$ recieves a vector of size $|s| + |h|$ and outputs a number. Because of this network, the attention matrix $\boldsymbol{\alpha}$ is free in size.

Attention weights are then used to calculate context $c_i$ for target $y_i$ as follows $$c_i = \sum_{j=1}^{T_x} \alpha_{ij}h_j,$$

and so on and so forth.

Consequently, for the next (source, target) pair with lengths ($T_{x_2}$, $T_{y_2}$), the size of $\boldsymbol{\alpha}$ would also be $T_{y_2} \times T_{x_2}$ with no problem.


Posted 2018-01-30T00:35:51.420

Reputation: 7 434


Here's a nice visualization of attention I came across:

– davidparks21 – 2019-05-21T16:02:27.133