Why do we get a three-dimensional output after a convolutional layer?


In a convolutional neural network, when we apply the convolution on a $5 \times 5$ image with $3 \times 3$ kernel, with stride $1$, we should get only one $4 \times 4$ as output. In most of the CNN tutorials, we are having $4 \times 4 \times m$ as output. I don't know how we are getting a three-dimensional output and I don't know how we need to calculate $m$. How is $m$ determined? Why do we get a three-dimensional output after a convolutional layer?

Prabu M

Posted 2019-08-16T05:47:03.300

Reputation: 23

Welcome to ai.se.... Questions very similar to this have already been asked many times on this site. I suggest you go through some of them and then come back if you still require clarification. – DuttaA – 2019-08-16T06:14:39.483

Hi Prabu! Please, ask one question per post. I will edit your post to leave only the first question, otherwise, your post should be considered too broad, IMHO. Ask the other question in their separate posts. – nbro – 2019-08-16T08:12:20.120



If you have a $h_i \times w_i \times d_i$ input, where $h_i, w_i$ and $d_i$ respectively refer to the height, width and depth of the input, then we usually apply $m$ $h_k \times w_k \times d_i$ kernels (or filters) to this input (with the appropriate stride and padding), where $m$ is usually a hyper-parameter. So, after the application of $m$ kernels, you will obtain $m$ $h_o \times w_o \times 1$ so-called feature maps (also known as activation maps), which are usually concatenated along the depth dimension, hence your output will have a depth of $m$ (given that the application of a kernel to the input usually produces a two-dimensional output). For this reason, the output is usually referred to as output volume.

In the context of CNNs, the kernels are learned, so they are not constant (at least, during the learning process, but, after training, they usually remain constant, unless you perform continual lifelong learning). Each kernel will be different from any other kernel, so each kernel will be doing a different convolution with the input (with respect to the other kernels), therefore, each kernel will be responsible for filtering (or detecting) a specific and different (with respect to the other kernels) feature of the input, which can, for example, be the initial image or the output of another convolutional layer.


Posted 2019-08-16T05:47:03.300

Reputation: 19 783

from my understanding of the above answer, in a single layer of CNN, we can have m kernels and each kernel will perform different actions. if m if the hyperparameter then what is the learning parameter in CNN?. if we can apply multiple filters in a single layer then why we are going for multi-layer CNN? @nrbo – Prabu M – 2019-08-16T10:01:27.580

@PrabuM The kernels are the learnable parameters in the CNN. We do not use MLPs because an equivalent MLP would require a lot of more parameters to be learned. – nbro – 2019-08-16T11:08:01.623

sorry, still I can't understand how the kernel become learning parameter? – Prabu M – 2019-08-16T12:21:19.027

@PrabuM The kernels are initialized in some way before the training of the CNN. During the training process, the kernels are updated using gradient descent and back-propagation. If you want to know the details, you will have to study how gradient descent and back-propagation are used in the case of CNN. – nbro – 2019-08-16T14:54:58.293

so the values in the kernels are updated, right? – Prabu M – 2019-08-16T15:45:56.993

@PrabuM Yes, during training, the values of the kernels are updated. – nbro – 2019-08-17T00:44:04.183


Why do we get a three-dimensional output after a convolutional layer?

During a search for an optimal convolution kernel via gradient descent or some other method there must be at least one additional dimension to represent trials. It is most often one. If the input is in $\mathcal{R}^n$ space, then the output of the convolution operation is $\mathcal{R}^{n+1}$ space. However, this is not often (or ever) the output of the learning system using a convolutional layer, since the final layer in the conventional deep network designs used today is not the convolution layer.

In the case in this question, $m$ represents the number of discrete kernels tried, usually in rapid succession in most algorithms and hardware acceleration scenarios. It is using the results in this $\mathcal{R}^{n+1}$ space that the corrective mechanism requires to converge on an optimum efficiently.

By corrective mechanism is meant whatever corrects the assumptions made for the next set of kernels to be tried. This mechanism often involves gradient descent and back propagation in an artificial network design. It is the learning algorithm or, in the case of hardware acceleration, the learning circuit.

The value of $m$ is not arbitrary, but is largely problem dependent and based on hardware and execution environment. If $m$ is too large, then too much convolution work is performed before the correction is made. If $m$ is too small, then whatever efficiency is gained by grouping kernel tries together in time is lost. There may be a formula to find $m$, but it will be based on platform dependent metrics and can be found by trying several $m$ values and determining the $m$ providing the lowest convergence time.

The value of $m$ can also affect accuracy and reliability of convergence. Predicting this effect is not straightforward. Such prediction, which would allow the automated selection of hyper-parameters like $m$, is of interest to many researchers for obvious reasons. It has been and will probably continue to be an objective of AI development to remove the need for human intervention in AI system applications.

Douglas Daseeco

Posted 2019-08-16T05:47:03.300

Reputation: 7 174