While reviewing the Transformer architecture, I realized something I didn't expect, which is that :
- the positional encoding is summed to the word embeddings
- rather than concatenated to it.
Based on the graphs I have seen wrt what the encoding looks like, that means that :
- the first few bits of the embedding are completely unusable by the network because the position encoding will distort them a lot,
- while there is also a large amount of positions in the embedding that are only slightly affected by the positional encoding (when you move further towards the end).
So, why not instead have smaller word embeddings (reduce memory usage) and a smaller positional encoding retaining only the most important bits of the encoding, and instead of summing the positional encoding of words keep it concatenated to word embeddings?