How is the convolution layer is usually implemented in practice?


Following an earlier question, I'm interested in understanding the basics of Conv2d and especially how the kernel is applied, summed, and the propagated. I understand that:

  1. a kernel has size W x H and that more than one kernel is applied (e.g., S x W x H) where S is the amount of kernels.
  2. A stride or step is used when iterating the network input
  3. Padding may or may not be used by being added to the network input.

What I would ideally like to see is either a description, or a python sample (pytorch or tensorflow) of how that is done, what the dimensionality of the output is, and any operation I may be missing (some YouTube videos say that the kernel is summarised and then divided to hold one new unique value representing the feature activation?)


Posted 2020-04-15T15:09:28.690

Reputation: 175

It's not clear to me if you're asking for a general description of how the convolution layer works or just some particular detail, like how to calculate the shape of the output of a convolution layer. – nbro – 2020-04-15T16:18:33.357

@nbro I'd say I'm asking for the implementation, either in a math level, or a pseudo code level, or even better as a minified example? The basics of Conv2D and how that works on the input and how the output is produced. I understand there are other parts and many variants. – Ælex – 2020-04-15T16:20:19.930

1Well, the implementation may be different than the actual general idea. So, I suggest you edit your post to clarify that you're specifically looking for "how the convolution layer is usually implemented in practice" – nbro – 2020-04-15T16:22:01.230

1@nbro done :-) Only reason I'm asking for the implementation is because it tends to be more easily digestible than the theory more often than not. – Ælex – 2020-04-15T16:35:51.013



I don't think that to understand convolution you need to dig into the nested code of huge libraries, since the code becomes quickly really hard to understand and convoluted (ba dum tsss!). Joking apart, in PyTorch Conv2d is a layer that applies another low level function, conv2d, written in c++.

Luckily enough, the guys from PyTorch wrote the general idea of how convolution is implemented in the documentation:

enter image description here

From this paragraph, we already have some important information like the input and output dimensions. The number of channels should be easy to understand, if we have an RGB image, for example, the channels are 3, one for each color, so they are just different matrices representing different features.

The next important element is the reference to the cross correlation, the function applied to our input images through the kernel k. Why cross-correlation? Because it is almost identical to a convolution, as you can see comparing their formulas:

enter image description here

The only difference consists in the way the indexing of the width is implemented, which causes the operation to start from the bottom right of the input matrix for the convolution and from the top left of the input image from the cross-correlation (the circles in the squares in the previous pic). Since, in most programming languages, matrix indexing starts from top left, cross-correlation is the most common choice to implement.

But how do these formulas work in practice? Here's another picture taken from Chapter 9 of Deep Learning (Goodfellow, Bengio, Courville), which I strongly suggest you to read.

enter image description here

Basically, from the input matrix, a submatrix is extracted, with the same dimension of the kernel, then the sub-matrix and the kernel are multiplied elementwise and all the resulting product summed together to produce a single output element that will form a 'pixel' of the resulting feature map (the output matrix).

Here's another example with fake numbers that I made. I hope the double notation for filter/kernel doesn't generate confusion, I actually found that sometimes it is inconsistent (in the chapter I linked they don't even use the filter at all). In practice they actually mean the same thing, I usually call kernel the actual matrix that is multiplied to the input and with filter I refer to the sliding window on the input image (that has, of course, must have the same dimension of the kernel).

enter image description here

Lastly, when you apply padding, the filter can actually move also outside the 'edges' of the input matrix, in which case all the elements outside would be considered to be zero. The computation is exactly the same, but since there are more splitting steps, the output matrix will have bigger dimension.

enter image description here

Please note that with multiple input channels you can perform either 2d convolution or 3d convolution, the difference relies in the filter dimension, in case of 2d convolution it would be a square whereas in 3d convolution it would be a cube. This means that for an RGB image a 2d convolution would treat each color layer independently, mixing the information from each channel with further computations like pooling (averaging the resulting feature maps of each color or select the max value among the feature maps for each pixel, etc..) while 3d convolution would mix together the color layers already during convolution thanks to the 3d kernel which will sum together elements from different layers.

Edoardo Guerriero

Posted 2020-04-15T15:09:28.690

Reputation: 1 098

This is absolutely epic and more than I had hoped for! Thank you very much for taking the time and making the effort to answer this in such a magnificent way! I'll read Chapter 9 as you suggested! – Ælex – 2020-04-16T13:21:41.140

Thanks! Glad that it helped – Edoardo Guerriero – 2020-04-16T15:16:44.517