In the papers "Convolutional Sequence to Sequence Learning" and "Attention Is All You Need", positions embeddings are simply added to the input words embeddings to give the model a sense of the order of the input sequence. These position embeddings are generated from a sinusoidal signal depending on the absolute position of the word in the sequence and the dimension. We obtain position embeddings of the same dimension as the word embeddings and we simply sum these two.
I can understand that this helps the model to get a sens of the ordering of the input, but I'm quite disturbed by the fact that adding these two might also erase some of the information contained in the word embeddings. Do you have an explanation on why this might work (or not) ? Is there some literature about it ?