What is the positional encoding in the transformer model?

44

23

I'm trying to read and understand the paper Attention is all you need and in it, there is a picture:

enter image description here

I don't know what positional encoding is. by listening to some youtube videos I've found out that it is an embedding having both meaning and position of a word in it and has something to do with $sin(x)$ or $cos(x)$

but I couldn't understand what exactly it is and how exactly it is doing that. so I'm here for some help. thanks in advance.

Peyman

Posted 2019-04-28T14:43:17.090

Reputation: 543

Answers

51

For example, for word $w$ at position $pos \in [0, L-1]$ in the input sequence $\boldsymbol{w}=(w_0,\cdots, w_{L-1})$, with 4-dimensional embedding $e_{w}$, and $d_{model}=4$, the operation would be $$\begin{align*}e_{w}' &= e_{w} + \left[sin\left(\frac{pos}{10000^{0}}\right), cos\left(\frac{pos}{10000^{0}}\right),sin\left(\frac{pos}{10000^{2/4}}\right),cos\left(\frac{pos}{10000^{2/4}}\right)\right]\\ &=e_{w} + \left[sin\left(pos\right), cos\left(pos\right),sin\left(\frac{pos}{100}\right),cos\left(\frac{pos}{100}\right)\right]\\ \end{align*}$$

where the formula for positional encoding is as follows $$\text{PE}(pos,2i)=sin\left(\frac{pos}{10000^{2i/d_{model}}}\right),$$ $$\text{PE}(pos,2i+1)=cos\left(\frac{pos}{10000^{2i/d_{model}}}\right).$$ with $d_{model}=512$ (thus $i \in [0, 255]$) in the original paper.

This technique is used because there is no notion of word order (1st word, 2nd word, ..) in the proposed architecture. All words of input sequence are fed to the network with no special order or position; in contrast, in RNN architecture, $n$-th word is fed at step $n$, and in ConvNet, a word is fed to a specific position. Therefore, model has no idea how the words are ordered. Consequently, a position-dependent signal is added to each word-embedding to help the model incorporate the order of words. Based on experiments, this addition not only avoids destroying the embedding information but also adds the vital position information.

This article by Jay Alammar explains the paper with excellent visualizations. The example on positional encoding calculates $\text{PE}(.)$ the same, with the only difference that it puts $sin$ in the first half of embedding dimensions and $cos$ in the second half. As pointed out by @ShaohuaLi, this difference does not matter since vector operations would be invariant to permutation of dimensions.

Also, this blog by Kazemnejad has an interesting take on positional encoding. It explains that the specific choice of ($sin$, $cos$) pair helps the model in learning patterns like "when 'are' comes after 'they',..." which depend only on the relative position of the words, i.e. "after", rather than their absolute positions $pos$ and $pos+1$.

Esmailian

Posted 2019-04-28T14:43:17.090

Reputation: 7 434

4

You also have this excellent article purely focused on the positional embedding : https://kazemnejad.com/blog/transformer_architecture_positional_encoding/

– Yohan Obadia – 2020-02-05T21:51:03.467

Is the 10000 in the denominator related to this comment by Jay Alammar in his post: "Let’s assume that our model knows 10,000 unique English words (our model’s 'output vocabulary')? – tallamjr – 2020-12-03T15:53:33.533

Which half of the positional encoding is sin and cos doesn't matter. The dot product is the same after shuffling the embedding dimensions :-) – Shaohua Li – 2021-01-11T09:15:52.417

@ShaohuaLi Thanks, I believe you are correct. Updated. – Esmailian – 2021-01-11T09:55:56.867

11

Positional encoding is a re-representation of the values of a word and its position in a sentence (given that is not the same to be at the beginning that at the end or middle).

But you have to take into account that sentences could be of any length, so saying '"X" word is the third in the sentence' does not make sense if there are different length sentences: 3rd in a 3-word-sentence is completely different to 3rd in a 20-word-sentence.

What a positional encoder does is to get help of the cyclic nature of $sin(x)$ and $cos(x)$ functions to return information of the position of a word in a sentence.

Juan Esteban de la Calle

Posted 2019-04-28T14:43:17.090

Reputation: 2 102

4thank you. could you elaborate on how this positional encoder does this with $sin$ and $cos$? – Peyman – 2019-04-28T16:56:24.513

5

To add to other answers, OpenAI's ref implementation calculates it in natural log-space (to improve precision, I think. Not sure if they could have used log in base 2). They did not come up with the encoding. Here is the PE lookup table generation rewritten in C as a for-for loop:

int d_model = 512, max_len = 5000;
double pe[max_len][d_model];

for (int i = 0; i < max_len; i++) {
   for (int k = 0; k < d_model; k = k + 2) {
      double div_term = exp(k * -log(10000.0) / d_model);
      pe[i][k] = sin(i * div_term);
      pe[i][k + 1] = cos(i * div_term);
   }
}

Eris

Posted 2019-04-28T14:43:17.090

Reputation: 51

0

Here is an awesome recent Youtube video that covers position embeddings in great depth, with beautiful animations:

Visual Guide to Transformer Neural Networks - (Part 1) Position Embeddings

Taking excerpts from the video, let us try understanding the “sin” part of the formula to compute the position embeddings:

enter image description here

Here “pos” refers to the position of the “word” in the sequence. P0 refers to the position embedding of the first word; “d” means the size of the word/token embedding. In this example it d=5. Finally, “i” refers to each of the 5 individual dimensions of the embedding (i.e. 0, 1,2,3,4)

While “d” is fixed, “pos” and “i” vary. Let us try understanding the later two.

"pos"

enter image description here

If we plot a sin curve and vary “pos” (on the x-axis), you will land up with different position values on the y-axis. Therefore, words with different positions will have different values position embeddings values.

There is a problem though. Since “sin” curve repeat in intervals, you can see in the figure above that P0 and P6 have the same position embedding values, despite being at two very different positions. This is where the ‘i’ part in the equation comes into play.

"i"

enter image description here

If you vary “i” in the equation above, you will get a bunch of curves with varying frequencies. Reading of the position embedding values against different frequencies, lands up giving different values at different embedding dimensions for P0 and P6.

Batool

Posted 2019-04-28T14:43:17.090

Reputation: 101