What is the difference between asymmetric and depthwise separable convolution?



I have recently discovered asymmetric convolution layers in deep learning architectures, a concept which seems very similar to depthwise separable convolutions.

Are they really the same concept with different names? If not, where is the difference? To make it concrete, what would each one look like if applied to a 128x128 image with 3 input channels (say R,G,B) and 8 output channels?

NB: I cross-posted this from stackoverflow, since this kind of theoretical question is maybe better suited here. Hoping it is OK...

Pierre Gramme

Posted 2019-08-02T12:57:49.087

Reputation: 143

Hi and welcome to this community! This is the type of question that is suited for this website and not for Stack Overflow, which is dedicated to programming issues. So, I would delete the question from SO, unless you received a good answer there. Btw, it might be useful to link to that SO question. – nbro – 2019-08-02T21:18:40.310

You're right: I just deleted the SO question (which got no answer) – Pierre Gramme – 2019-08-04T10:33:36.260



They are not the same thing.

asymmetric convolutions work by taking the x and y axes of the image separately. For example performing a convolution with an $(n \times 1)$ kernel before one with a $(1 \times n)$ kernel.

On the other-hand depth-wise separable convolutions separate the spatial and channel components of a 2D convolution. It will first perform the $(n \times n)$ convolution on each channel separately (full kernel shape will be $(n \times n \times 1)$ rather than $(n \times n \times k)$ where $k$ is the number of channels in the previous layer) before doing a $(1 \times 1)$ convolution to learn a relationship between the channels (full kernel size for that being $(1 \times 1 \times k)$)


Posted 2019-08-02T12:57:49.087

Reputation: 1 845

Thanks, I think it's already clearer now. Just two extra questions: 1. How does the number of channels (input & output) play a role in the asymmetric convolution? 2. For the depthwise separable convolution, is a $(1 \times 1 \times k)$ convolution just a linear combination of the channels (after the previous $(n \times n \times 1)$ convolution)? – Pierre Gramme – 2019-08-04T10:31:57.070


  • so the kernels are $(1\times n \times k)$ for each of the first one, and the $(n \times 1 \times m)$ where $m$ is the number of $1\ times n$ filters you used. The number of the second filter is the number of output channels thatll exist. (Generally used where $k=m$). 2) Yes.
  • < – mshlis – 2019-08-04T13:52:30.623