Variable input/output length for Transformer

10

2

I was reading the paper "Attention is all you need" (https://arxiv.org/pdf/1706.03762.pdf ) and came across this site http://jalammar.github.io/illustrated-transformer/ which provided a great breakdown of the architecture of the Transformer.

Unfortunately, I was unable to find any explanation of why it works with input/output lengths that are not equal (eg. input: “je suis étudiant” and expected output: “i am a student”).

My main confusion is this. From what I understand, when we are passing the output from the encoder to the decoder (say $3 \times 10$ in this case), we do so via a Multi-Head Attention layer, which takes in 3 inputs:

  1. A Query (from encoder), of dimension $3 \times k_1$
  2. A Key (from encoder), of dimension $3 \times k_1$
  3. A Value (from decoder), of dimension $L_0 \times k_1$, where $L_0$ refers to the number of words in the (masked) output sentence.

Given that the Multi-Head Attention should take in 3 matrices which have the same number of rows (or at least this is what I have understood from its architecture), how do we deal with the problem of varying output lengths?

Sean Lee

Posted 2019-02-13T03:43:48.647

Reputation: 221

Answers

3

Your understanding is not correct: in the encoder-decoder attention, the Keys and Values come from the encoder (i.e. source sequence length) while the Query comes from the decoder itself (i.e. target sequence length).

The Query is what determines the output sequence length, therefore we obtain a sequence of the correct length (i.e. target sequence length).


In order to understand how the attention block works maybe this analogy helps: think of the attention block as a Python dictionary, e.g.

keys =   ['a', 'b', 'c']
values = [2, 7, 1]
attention = {keys[0]: values[0], keys[1]: values[1], keys[2]: values[2]}
queries = ['c', 'a']
result = [attention[queries[0]], attention[queries[1]]]

In the code above, result should have value [1, 2].

The attention from the transformer works in a similar way, but instead of having hard matches, it has soft maches: it gives you a combination of the values weighting them according to how similar their associated key is to the query.

While the number of values and keys has to match, the number of queries is independent.

noe

Posted 2019-02-13T03:43:48.647

Reputation: 10 494

0

As far as I understand, we are not passing $3 \ x \ 10$ but $\text{maximum_sentence_size} \ x \ 10$. Actually it is in a sense static and you can not exceed this maximum sentence size. What happens if your sentence is smaller thane this size? You just pad with "padding vectors". And make sure that your model is not attending to those padding vectors.

iRestMyCaseYourHonor

Posted 2019-02-13T03:43:48.647

Reputation: 1