It this podcast between Oriol Vinyals and Lex Friedman:, at 29:29, Oriol Vinyals refers to a paper:

If you look at research in computer vision where it makes a lot of sense to treat images as two dimensional arrays... There is actually a very nice paper from Facebook. I forgot who the authors are but I think [it's] part of Kaiming He's group. And what they do is they take an image, which is a 2D signal, and they actually take pixel by pixel, and scramble the image, as if it was a just a list of pixels, crucially they encode the position of the pixels with the XY coordinates. And this is a new architecture which we incidentally also use in Starcraft 2 called the transformer, which is a very popular paper from last year which yielded very nice results in machine translation.

Do you know which paper he is referring to?

I'm guessing maybe he is talking about non-local neural networks, but I'm probably guessing wrong.

Edit: after reviewing the recent publications of Kaiming He (, maybe I'm guessing right. Any thoughts?

So this paper is by google, but is very similar where they use 2D positional embeddings and perform MHA on the flattened image. Are you talking about Attention Augmented Convolutional Networks


