In a convolutional neural network (CNN), a convolutional layer has several channels, each of which has one convolution kernel, often written down as a matrix. This convolution kernel is nothing more than a collection of weights used to compute a linear combination of elements of the input.

While both "traditional" dense layers and convolutional layers compute a linear combination of their inputs, convolutional layers have the added structure of preserving spatial information. Additionally, the output of a convolutional layer, called a **feature map**, can be understood as the result of sliding the kernel along the input.

For instance, take a simple kernel $K = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{bmatrix}$ and a small input "image" $I = \begin{bmatrix} 9 & 8 & 7 & 6 \\ 5 & 4 & 3 & 2 \\ 1 & 2 & 3 & 4 \\ 5 & 6 & 7 & 8 \end{bmatrix}$. We take an input with only one channel for convenience: if it's the first layer of the CNN, we can assume the values of $I$ represent a grayscale image. Let us look at how the convolution is performed if no padding is added to the image, in which case the output will be a 2 by 2 matrix.

Starting at the top left corner, we convolve $K = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{bmatrix}$ with $\begin{bmatrix} 9 & 8 & 7 \\ 5 & 4 & 3 \\ 1 & 2 & 3 \end{bmatrix}$. To do so, we multiply the two matrices element-wise, then sum up all the elements of the resulting matrix: $1 \cdot 9 + 2 \cdot 8 + 3 \cdot 7 + 4 \cdot 5 + 5 \cdot 4 + 6 \cdot 3 + 7 \cdot 1 + 8 \cdot 1 + 9 \cdot 3 = 146$.

Doing the same three more times yields the following feature map: $\begin{bmatrix} 146 & 157 \\ 200 & 233 \end{bmatrix}$.

Depending on the kernel of a layer, the feature map will look different, and the information extracted from the input has a different "meaning". The idea behind a CNN is to make the coefficients of the kernel learnable parameters. Often, multiple "meanings" must be extracted from a single input, so a convolutional layer has multiple channels.

However, CNNs didn't invent convolution, they merely made the kernels learnable. Indeed, such methods have been used in "traditional" computer vision (CV) for a long time. In traditional CV, the kernels are hand-crafted to fill specific roles, extract specific types of information.

For edge detection or line detection, you might use a Prewitt filter which is the discrete equivalent of a derivative operator or a slightly more complicated Sobel filter.

In fact, it is not uncommon for a CNN to learn approximations of well-known hand-crafted convolution kernels from traditional CV on the first layer. Also, by visualizing feature maps on deeper layers, as the training of a CNN progresses, gives you good insight into what the specific convolution channels are learning to detect.

Thanks for your nice answer! Really helpful to understand convolutions in general. However, can you elaborate your answer w.r.t. my question if I can say that the kernels measure the correlation between its input signals? Thanks a lot in advance! – user3352632 – 2021-02-16T19:49:31.293

I'm not certain what you mean by "correlation between input signals". If you're asking whether a convolutional layer measures the correlation between different, distant patches of the input, then no, although a dense layer close to the output layer might indirectly measure it. It does, however, measure the cross-correlation between one patch of the input and the convolution kernel. – David Cian – 2021-02-16T23:24:08.987

Thanks for clarification. This means a cnn learns useful cross-correlations with its kernel's weight w.r.t to the traning data. If the test data does not contain these cross-correlation, the network will not work as expected? – user3352632 – 2021-02-16T23:37:27.667

If, for instance, your CNN has been trained on human faces, it will probably have a convolution which has learned to respond strongly to a human eye, another one which has learned to respond strongly to a human mouth, etc. If you then feed it a test set of bicycles, those convolutions will not register strong activations, which might be good if you expect your network to discriminate between human faces and bicycles. If instead you expect your network to "recognize" monkey faces, and it's been trained on human faces, it probably won't perform very well. Depends on what you expect out of it. – David Cian – 2021-02-16T23:41:49.553