In a Transformer model, why does one sum positional encoding to the embedding rather than concatenate it?



While reviewing the Transformer architecture, I realized something I didn't expect, which is that :

  • the positional encoding is summed to the word embeddings
  • rather than concatenated to it.

positional encoding summed to word embedding

Based on the graphs I have seen wrt what the encoding looks like, that means that :

  • the first few bits of the embedding are completely unusable by the network because the position encoding will distort them a lot,
  • while there is also a large amount of positions in the embedding that are only slightly affected by the positional encoding (when you move further towards the end).

graph shows positional encoding affects firsts logits a lot, last logits hardly not

So, why not instead have smaller word embeddings (reduce memory usage) and a smaller positional encoding retaining only the most important bits of the encoding, and instead of summing the positional encoding of words keep it concatenated to word embeddings?


Posted 2019-07-18T08:34:46.710

Reputation: 203

I was also curious about this, have you figured it out? – Lee MJ – 2020-02-01T14:34:22.823

@LeeMJ: No, I did not. – FremyCompany – 2020-02-04T11:26:23.793

Have you figured it out now? – Marcos Pereira – 2020-05-25T18:16:28.547

Is anyone aware of any papers where they tried concatenation instead of adding? – Keith Johnson – 2021-02-24T16:29:12.507

@keith-johnson Not per se, but Google T5 does use a different approach, where position is encoded separately. Since there is a lot about Google T5 you can maybe also check this other paper that builds on top of T5 and tweaks its positional encoding some more:

– FremyCompany – 2021-02-24T19:48:16.197



When you concatenate, you have to define a priori the size of each vector to be concatenated. This means that, if we were to concatenate the token embedding and the positional embedding, we would have to define two dimensionalities, $d_t$ for the token and $d_p$ for the position, with the total dimensionality $d = d_t + d_p$, so $d>d_t$ and $d>d_p$. We would be decreasing the total size we devote to tokens in favor of positional information.

However, adding them together is potentially a super case of the concatenation: imagine that there is an ideal split of $d$ into $d_t$ and $d_p$ in terms of minimizing the loss; then, the training could converge to position vectors that only take $d_t$ elements, making the rest zero, and the positions were learned and happened the same, taking the complementary $d_p$ elements and leaving the rest to zero.

Therefore, by adding them, we leave the optimization of the use of the $d$ dimensions to the optimization process, instead of assuming there is an optimal partition of the vector components and setting a new hyperparameter to tune. Also, the use of the vector space is not restricted by a hard split in the vector components, but takes the whole representation space.


Posted 2019-07-18T08:34:46.710

Reputation: 10 494

This would make sense for learned positional encoding. What about the sine/cosine encoding? Does it just rely on the fact that nothing much is happening in dimensions beyond the first few? – max – 2021-02-24T05:56:36.347

While the equivalence of concatenation and addition may only apply to learned positional encoding, I think that the general optimization of the representation space does apply to fixed encodings as well (although the optimization only happens in the token embeddings). I don't think the picture is correct (it has changed in the referenced tutorial)

– noe – 2021-02-24T08:24:11.763

maybe a stupid question, but why this addition doesn't spoil the embedding - like we had word king, add this pattern and receive slave ? – spiridon_the_sun_rotator – 2021-02-25T21:39:03.620

If such a thing would happen, the final loss would be bad. The training aims at improving the loss, and therefore prevents that situation from happening. – noe – 2021-02-25T21:45:30.553


So the question is about why positional embeddings are directly added to word embeddings instead of concatenated. This is a particularly interesting question. To answer this question, I will need to firstly separate the differences between sequential networks like RNNs and Transformers, which then introduces this problem nicely.

In RNNs, we feed in data (let's say a sequence of words) into the model in a sequential manner. This means that in the context of inputting in a sequence of words, the model does arguably obtain the order the tokens as it is fed in one by one.

With transformers, on the other hand, all of the words in the sequence are fed in all at once. This means that, so far, transformers do not have any notion of word ordering. Therefore, we need positional embeddings to tell the model where each word belongs in the sequence.

I believe the reason why we add them to word embeddings is because we want to maintain a similar input into the model as an RNN, which takes in word embeddings as its input as well. I think your question is a very good one to ask, and maybe you should experiment with having a more compressed word embedding with its positional embedding and compare your approach against the more "traditional" approach and see what results you yield. I'll be excited to see them.


Posted 2019-07-18T08:34:46.710

Reputation: 1 204


It is been a while, but I think anyone ending up here might also be interested in the reading of the following paper:

What Do Position Embeddings Learn? An Empirical Study of Pre-Trained Language Model Positional Encoding (Yu-An Wang, Yun-Nung Chen)

I am not changing the accepted answer as this article is not specific.


Posted 2019-07-18T08:34:46.710

Reputation: 203