## what is the first input to the decoder in a transformer model?

7

4

The image is from url: Jay Alammar on transformers

K_encdec and V_encdec are calculated in a matrix multiplication with the encoder outputs and sent to the encoder-decoder attention layer of each decoder layer in the decoder.
The previous output is the input to the decoder from step 2 but what is the input to the decoder in step 1? Just the K_encdec and V_encdec or is it necessary to prompt the decoder by inputting the vectorized output (from the encoder) for the first word?

8

At each decoding time step, the decoder receives 2 inputs:

• the encoder output: this is computed once and is fed to all layers of the decoder at each decoding time step as key ($$K_{endec}$$) and value ($$V_{endec}$$) for the encoder-decoder attention blocks.
• the target tokens decoded up to the current decoding step: for the first step, the matrix contains in its first position a special token, normally </s>. After each decoding step $$k$$, the result of the decoder at position $$k$$ is written to the target tokens matrix at position $$k+1$$, and then the next decoding step takes place.

For instance, in the fairseq implementation of the decoding, you can see how they create the target tokens matrix and fill it with padding here and then how they place an EOS token (</s>) at the first position here.

As you have tagged your question with the bert tag, you should know that what I described before only applies to the sequence-to-sequence transduction task way of using the Transformer (i.e. when used for machine translation), and this is not how BERT works. BERT is trained on a masked language model loss which makes its use at inference time much different than the NMT Transformer.

in hindsight I asked my question prematurely before really diving deeply into the annotated pytorch transformer and BERT paper. However since you mention it, are [mask] tokens used at inference time for BERT? I thought it was just during pretraining so at inference time, I only had to worry about [CLS], [SEP] and possibly [UNK]. Also, I was not able to find much info on size of vocabulary to use. I thought "unique words in train_text + 20 or 30 thousand of most common words from wiktionary top 100,000 words" would suffice but then I read use vocab size V but what is an ideal V? – mLstudent33 – 2019-05-12T02:44:20.067

Please create new questions so that we can answer them, as these are too unrelated (to the original question) to have in the comments. – noe – 2019-05-12T19:08:10.540

sounds good, I might post them later as I am gearing down to try to use the T2T model first after defining a new problem for JA to EN translation. – mLstudent33 – 2019-05-13T08:52:16.137

I'm not sure if this deserves its own question but is there an annotated Bert? Something like this annotated Transformer from Harvard NLP: http://nlp.seas.harvard.edu/2018/04/03/attention.html

– mLstudent33 – 2019-05-13T10:07:31.490

2

I'm not aware of any blog post about BERT with explanations at source code level. Maybe the illustrated BERT blog post is enough: http://jalammar.github.io/illustrated-bert/

– noe – 2019-05-13T11:14:57.277

Jay's blog has this quote: "The next step would be to look at the code in the BERT repo:

The model is constructed in modeling.py (class BertModel) and is pretty much identical to a vanilla Transformer encoder." So I assumed that the [mask] was only applied during pretraining and that info would already be a embedded in the vectorized representation I obtain from Bert. – mLstudent33 – 2019-05-18T04:02:29.143