Name of paper for encoding/representing XY coordinates in deep learning



It this podcast between Oriol Vinyals and Lex Friedman:, at 29:29, Oriol Vinyals refers to a paper:

If you look at research in computer vision where it makes a lot of sense to treat images as two dimensional arrays... There is actually a very nice paper from Facebook. I forgot who the authors are but I think [it's] part of Kaiming He's group. And what they do is they take an image, which is a 2D signal, and they actually take pixel by pixel, and scramble the image, as if it was a just a list of pixels, crucially they encode the position of the pixels with the XY coordinates. And this is a new architecture which we incidentally also use in Starcraft 2 called the transformer, which is a very popular paper from last year which yielded very nice results in machine translation.

Do you know which paper he is referring to?

I'm guessing maybe he is talking about non-local neural networks, but I'm probably guessing wrong.

Edit: after reviewing the recent publications of Kaiming He (, maybe I'm guessing right. Any thoughts?

Benjamin Crouzier

Posted 2019-05-01T16:29:02.923

Reputation: 299

If you verify please do update. I’d like to read that, too. Also, does this podcast have show notes? Sometimes they include links to referenced papers / blogs, etc. – Hanzy – 2019-05-01T16:47:20.437


This isnt the paper but it sounds related. "An intriguing failing of convolutional neural networks and the CoordConv solution" from Uber AI Labs.

They also have a really informative video and a repo.

– Jaden Travnik – 2019-05-01T17:18:04.533



So this paper is by google, but is very similar where they use 2D positional embeddings and perform MHA on the flattened image. Are you talking about Attention Augmented Convolutional Networks


Posted 2019-05-01T16:29:02.923

Reputation: 1 845

Looks like this is not "the" paper, but it's a good find, thanks – Benjamin Crouzier – 2019-05-13T19:55:12.823