For example, for word $w$ at position $pos \in [0, L-1]$ in the input sequence $\boldsymbol{w}=(w_0,\cdots, w_{L-1})$, with 4-dimensional embedding $e_{w}$, and $d_{model}=4$, the operation would be
$$\begin{align*}e_{w}' &= e_{w} + \left[sin\left(\frac{pos}{10000^{0}}\right), cos\left(\frac{pos}{10000^{0}}\right),sin\left(\frac{pos}{10000^{2/4}}\right),cos\left(\frac{pos}{10000^{2/4}}\right)\right]\\
&=e_{w} + \left[sin\left(pos\right), cos\left(pos\right),sin\left(\frac{pos}{100}\right),cos\left(\frac{pos}{100}\right)\right]\\
\end{align*}$$

where the formula for positional encoding is as follows
$$\text{PE}(pos,2i)=sin\left(\frac{pos}{10000^{2i/d_{model}}}\right),$$
$$\text{PE}(pos,2i+1)=cos\left(\frac{pos}{10000^{2i/d_{model}}}\right).$$
with $d_{model}=512$ (thus $i \in [0, 255]$) in the original paper.

This technique is used because there is **no notion of word order** (1st word, 2nd word, ..) in the proposed architecture. All words of input sequence are fed to the network with no special order or position; in contrast, in RNN architecture, $n$-th word is fed at step $n$, and in ConvNet, a word is fed to a specific position. Therefore, model has no idea how the words are ordered. Consequently, a position-dependent signal is added to each word-embedding to help the model incorporate the order of words. Based on experiments, this addition not only avoids destroying the embedding information but also adds the vital position information.

This article by Jay Alammar explains the paper with excellent visualizations. The example on positional encoding calculates $\text{PE}(.)$ the same, with the only difference that it puts $sin$ in the first half of embedding dimensions and $cos$ in the second half. As pointed out by @ShaohuaLi, this difference does not matter since vector operations would be invariant to permutation of dimensions.

Also, this blog by Kazemnejad has an interesting take on positional encoding. It explains that the specific choice of ($sin$, $cos$) pair helps the model in learning patterns like "when '*are*' comes after '*they*',..." which depend only on the relative position of the words, i.e. "after", rather than their absolute positions $pos$ and $pos+1$.

4

You also have this excellent article purely focused on the positional embedding : https://kazemnejad.com/blog/transformer_architecture_positional_encoding/

– Yohan Obadia – 2020-02-05T21:51:03.467Is the 10000 in the denominator related to this comment by Jay Alammar in his post: "Let’s assume that our model knows 10,000 unique English words (our model’s 'output vocabulary')? – tallamjr – 2020-12-03T15:53:33.533

Which half of the positional encoding is sin and cos doesn't matter. The dot product is the same after shuffling the embedding dimensions :-) – Shaohua Li – 2021-01-11T09:15:52.417

@ShaohuaLi Thanks, I believe you are correct. Updated. – Esmailian – 2021-01-11T09:55:56.867