In a CNN, does each new filter have different weights for each input channel, or are the same weights of each filter used across input channels?



My understanding is that the convolutional layer of a convolutional neural network has four dimensions: input_channels, filter_height, filter_width, number_of_filters. Furthermore, it is my understanding that each new filter just gets convoluted over ALL of the input_channels (or feature/activation maps from the previous layer).

HOWEVER, the graphic below from CS231 shows each filter (in red) being applied to a SINGLE CHANNEL, rather than the same filter being used across channels. This seems to indicate that there is a separate filter for EACH channel (in this case I'm assuming they're the three color channels of an input image, but the same would apply for all input channels).

This is confusing - is there a different unique filter for each input channel?

Convolutional filters diagram


The above image seems contradictory to an excerpt from O'reilly's "Fundamentals of Deep Learning":

"...filters don't just operate on a single feature map. They operate on the entire volume of feature maps that have been generated at a particular layer...As a result, feature maps must be able to operate over volumes, not just areas"

...Also, it is my understanding that these images below are indicating a THE SAME filter is just convolved over all three input channels (contradictory to what's shown in the CS231 graphic above):

Application of a volumetric convolutional filter to an RGB image

Convolutions on an RGB image

Ryan Chase

Posted 2018-03-22T02:36:20.950

Reputation: 543

1 chapter 2 – Martin Thoma – 2018-11-26T12:12:26.517

Great question and answers below. The thing that threw me was that you actually need the same amount of filters as the amount of output channels. So a 2dConv on an RGB image that outputs an RGB image will require 3 filters of shape [3,W,H]. So the actual volume of weights is [3,3,W,H]. I wrote an implementation in pytorch here.

– Gal_M – 2020-07-02T07:17:31.303



The following picture that you used in your question, very accurately describes what is happening. Remember that each element of the 3D filter (grey cube) is made up of a different value (3x3x3=27 values). So, three different 2D filters of size 3x3 can be concatenated to form this one 3D filter of size 3x3x3.


The 3x3x3 RGB chunk from the picture is multiplied elementwise by a 3D filter (shown as grey). In this case, the filter has 3x3x3=27 weights. When these weights are multiplied element wise and then summed, it gives one value.

So, is there a separate filter for each input channel?

YES, there are as many 2D filters as number of input channels in the image. However, it helps if you think that for input matrices with more than one channel, there is only one 3D filter (as shown in the image above).

Then why is this called 2D convolution (if filter is 3D and input matrix is 3D)?

This is 2D convolution because the strides of the filter is along the height and width dimensions only (NOT depth) and therefore, the output produced by this convolution is also a 2D matrix. The number of movement directions of the filter determine the dimensions of convolution.

Note: If you build up your understanding by visualizing a single 3D filter instead of multiple 2D filters (one for each layer), then you will have an easy time understanding advanced CNN architectures like Resnet, InceptionV3, etc.

Mohsin Bukhari

Posted 2018-03-22T02:36:20.950

Reputation: 694

this is a good explanation, but more specifically the question I'm trying to understand is whether the filters that operate on each input channel are copies of the same weights, or completely different weights. This isn't actually shown in the image and in fact to me that image kind of suggests that it's the same weights applied to each channel (since their the same color)... Per @neil slater 's answer, it sounds like each filter actually has number of input_channels versions with different weights. If this is also your understanding, is there an "official" source confirming this? – Ryan Chase – 2018-03-22T23:25:44.220

Yes, indeed, that's also my understanding. For me, that was clear when I tried to think of that grey cube to be composed of 27 different weight values. This means that there are 3 different 2D filters rather the same 2D filter applied to each input layer. – Mohsin Bukhari – 2018-03-23T10:59:44.657

I could not find any official source for confirming this. However, when I was trying to wrap my head around this same concept, I created a dummy input and weight filter in Tensorflow and observed the output. I was content with that. If I find any official explanation. I will edit my answer above. – Mohsin Bukhari – 2018-03-23T11:01:17.133

If you follow the Tensorflow path. You can print your weight filter after showing your dummy CNN layer an input sample. – Mohsin Bukhari – 2018-03-23T11:03:00.913

@Moshsin Bukhari I will definitely try to explore the filters within TensorFlow. Would you be willing to share your code for how you went about about exploring what's contained in the filters? Are you able to print the values of the filter at each step in the network for example? – Ryan Chase – 2018-03-25T00:58:20.790


In a convolutional neural network, is there a unique filter for each input channel or are the same new filters used across all input channels?

The former. In fact there is a separate kernel defined for each input channel / output channel combination.

Typically for a CNN architecture, in a single filter as described by your number_of_filters parameter, there is one 2D kernel per input channel. There are input_channels * number_of_filters sets of weights, each of which describe a convolution kernel. So the diagrams showing one set of weights per input channel for each filter are correct. The first diagram also shows clearly that the results of applying those kernels are combined by summing them up and adding bias for each output channel.

This can also be viewed as using a 3D convolution for each output channel, that happens to have the same depth as the input. Which is what your second diagram is showing, and also what many libraries will do internally. Mathematically this is the same result (provided the depths match exactly), although the layer type is typically labelled as "Conv2D" or similar. Similarly if your input type is inherently 3D, such as voxels or a video, then you might use a "Conv3D" layer, but internally it could well be implemented as a 4D convolution.

Neil Slater

Posted 2018-03-22T02:36:20.950

Reputation: 14 632

I would like to note that, in that source (, filters (weights or kernesl) are volumes (i.e. 3-dimensional), and they have the same 3rd dimension has the one of the input volume. Furthermore, as it is (at least) now stated in that source, the volumes have been sliced across the 3rd dimension in order to visualize better the application of the filter to the input volume. I don't think that, in general, "there is a separate kernel defined for each input channel / output channel combination." is correct.

– nbro – 2019-02-13T22:32:35.807

Note that the filters (or kernels) are the weights that need to be learned (i.e. they are not fixed, but they are actually the parameters of the CNN). It might be that they are (i.e. the slices of the filter), at the end, the same across the 3rd dimension. – nbro – 2019-02-13T22:39:06.623

@nbro: Yes you can implement a 2D convolution across multiple 2D slices as a single 3D convolution with the kernel depth same as number of channels. Mathematically this is identical to my description. You can also view it as a truncated fully-connected feed forward network with shared weights (many of which are zero). This answer focuses on what the view of 2D filters is, because the OP is asking about how the 2D filters are arranged. They may in fact be arranged into a larger 3D kernel, but they are still applied as 2D kernels using the "trick" that the 3D convolution is equivalent. – Neil Slater – 2019-02-14T07:46:09.403

thanks for this explanation. It sounds like each filter actually has number of input_channels versions with different weights. Do you have an "official" source that confirms this understanding? – Ryan Chase – 2018-03-22T23:28:50.120

@RyanChase: Yes that is correct. I would just point you at Andrew Ng's course on CNNs - starting here with how colour image would be processed:

– Neil Slater – 2018-03-23T10:26:05.630


I'm following up on the answers above with a concrete example in the hope to further clarify how the convolution works with respect to the input and output channels and the weights, respectively:

Let the example be as follows (wrt to 1 convolutional layer):

  • the input tensor is 9x9x5, i.e. 5 input channels, so input_channels=5
  • the filter/kernel size is 4x4 and the stride is 1
  • the output tensor is 6x6x56, i.e. 56 output channels, so output_channels=56
  • the padding type is 'VALID' (i.e. no padding)

We note that:

  • since the input has 5 channels, the filter dimension becomes 4x4x5, i.e. there are 5 separate, unique 2D filters of size 4x4 (i.e. each has 16 weights); in order to convolve over the input of size 9x9x5 the filter becomes 3D and must be of size 4x4x5
  • therefore: for each input channel, there exists a distinct 2D filter with 16 different weights each. In other words, the number of 2D filters matches the number of input channels
  • since there are 56 output channels, there must be 56 3-dimensional filters W0, W1, ..., W55 of size 4x4x5 (cf. in the CS231 graphic there are 2 3-dimensional filters W0, W1 to account for the 2 output channels), where the 3rd dimension of size 5 represents the link to the 5 input channels (cf. in the CS231 graphic each 3D filter W0, W1 has the 3rd dimension 3, which matches the 3 input channels)
  • therefore: the number of 3D filters equals the number of output channels

That convolutional layer thus contains:

56 3-dimensional filters of size 4x4x5 (= 80 different weights each) to account for the 56 output channels where each has a value for the 3rd dimension of 5 to match the 5 input channels. In total there are


2D filters of size 4x4 (i.e. 280x16 different weights in total).

Lukas Z.

Posted 2018-03-22T02:36:20.950

Reputation: 71


Just to make two details absolutely clear:

Say you have $N$ 2D input channels going to $N$ 2D output channels. The total number of 2D $3\times3$ filter weights is actually $N^2$. But how is the 3D convolution affected, i.e., if every input channel contributes one 2D layer to every output channel, then each output channel is composed initially of $N$ 2D layers, how are they combined?

This tends to be glossed over in almost every publication I've seen, but the key concept is the $N^2$ 2D output channels are interleaved with each other to form the $N$ output channels, like shuffled card decks, before being summed together. This is all logical when you realize that along the channel dimensions of a convolution (which is never illustrated), you actually have a fully connected layer! Every input 2D channel, multiplied by a unique $3\times 3$ filter, yields a 2D output layer contribution to a single output channel. Once combined, every output layer is a combination of every input layer $\times$ a unique filter. It's an all to all contribution.

The easiest way to convince yourself of this is to imagine what happens in other scenarios and see that the computation becomes degenerate - that is, if you don't interleave and recombine the results, then the different outputs wouldn't actually do anything - they'd have the same effect as a single output with combined weights.


Posted 2018-03-22T02:36:20.950

Reputation: 1


For anyone trying to understand how convolutions are calculated, here is a useful code snippet in Pytorch:

batch_size = 1
height = 3 
width = 3
conv1_in_channels = 2
conv1_out_channels = 2
conv2_out_channels = 2
kernel_size = 2
# (N, C_in, H, W) is shape of all tensors. (batch_size, channels, height, width)
input = torch.Tensor(np.arange(0, batch_size*height*width*in_channels).reshape(batch_size, in_channels, height, width))
conv1 = nn.Conv2d(in_channels, conv1_out_channels, kernel_size, bias=False) # no bias to make calculations easier
# set the weights of the convolutions to make the convolutions easier to follow
nn.init.constant_(conv1.weight[0][0], 0.25)
nn.init.constant_(conv1.weight[0][1], 0.5)
nn.init.constant_(conv1.weight[1][0], 1) 
nn.init.constant_(conv1.weight[1][1], 2) 
out1 = conv1(input) # compute the convolution

conv2 = nn.Conv2d(conv1_out_channels, conv2_out_channels, kernel_size, bias=False)
nn.init.constant_(conv2.weight[0][0], 0.25)
nn.init.constant_(conv2.weight[0][1], 0.5)
nn.init.constant_(conv2.weight[1][0], 1) 
nn.init.constant_(conv2.weight[1][1], 2) 
out2 = conv2(out1) # compute the convolution

for tensor, name in zip([input, conv1.weight, out1, conv2.weight, out2], ['input', 'conv1', 'out1', 'conv2', 'out2']):
    print('{}: {}'.format(name, tensor))
    print('{} shape: {}'.format(name, tensor.shape))

Running this gives the following output:

input: tensor([[[[ 0.,  1.,  2.],
          [ 3.,  4.,  5.],
          [ 6.,  7.,  8.]],

         [[ 9., 10., 11.],
          [12., 13., 14.],
          [15., 16., 17.]]]])
input shape: torch.Size([1, 2, 3, 3])
conv1: Parameter containing:
tensor([[[[0.2500, 0.2500],
          [0.2500, 0.2500]],

         [[0.5000, 0.5000],
          [0.5000, 0.5000]]],

        [[[1.0000, 1.0000],
          [1.0000, 1.0000]],

         [[2.0000, 2.0000],
          [2.0000, 2.0000]]]], requires_grad=True)
conv1 shape: torch.Size([2, 2, 2, 2])
out1: tensor([[[[ 24.,  27.],
          [ 33.,  36.]],

         [[ 96., 108.],
          [132., 144.]]]], grad_fn=<MkldnnConvolutionBackward>)
out1 shape: torch.Size([1, 2, 2, 2])
conv2: Parameter containing:
tensor([[[[0.2500, 0.2500],
          [0.2500, 0.2500]],

         [[0.5000, 0.5000],
          [0.5000, 0.5000]]],

        [[[1.0000, 1.0000],
          [1.0000, 1.0000]],

         [[2.0000, 2.0000],
          [2.0000, 2.0000]]]], requires_grad=True)
conv2 shape: torch.Size([2, 2, 2, 2])
out2: tensor([[[[ 270.]],

         [[1080.]]]], grad_fn=<MkldnnConvolutionBackward>)
out2 shape: torch.Size([1, 2, 1, 1])

Notice how the each channel of the convolution sums over all previous channels outputs.

Simon Alford

Posted 2018-03-22T02:36:20.950

Reputation: 146


There are only restriction in 2D. Why?

Imagine a fully connected layer.

It'd be awfully huge, each neuron would be connected to maybe 1000x1000x3 inputs neurons. But we know that processing nearby pixel makes sense, therefore we limit ourselves to a small 2D-neighborhood, so each neuron is connected to only a 3x3 near neurons in 2D. We don't know such a thing about channels, so we connect to all channels.

Still, there would be too many weights. But because of the translation invariance, a filter working well in one area is most probably useful in a different area. So we use the same set of weights across 2D. Again, there's no such translation invariance between channels, so there's no such restriction there.


Posted 2018-03-22T02:36:20.950

Reputation: 459


Refer to "Local Connectivity" section in and slide 7-18.

"Receptive Field" hyperparameter of filter is defined by height & width only, as depth is fixed by preceding layer's depth.

NOTE that "The extent of the connectivity along the depth axis is always equal to the DEPTH of the input volume" -or- DEPTH of activation map (in case of later layers).

Intuitively, this must be due to the fact that image channels data are interleaved, not planar. This way, applying filter can be achieved simply by column vectors multiplication.

NOTE that Convolutional Network learns all the filter parameters (including the depth dimension) and they are total "hwinput_layer_depth + 1 (bias)".


Posted 2018-03-22T02:36:20.950

Reputation: 1


I recommend chapter 2.2.1 of my masters thesis as an answer. To add to the remaining answers:

Keras is your friend to understand what happens:

from keras.models import Sequential
from keras.layers import Conv2D

model = Sequential()
model.add(Conv2D(32, input_shape=(28, 28, 3),
          kernel_size=(5, 5),
model.add(Conv2D(17, (3, 3), padding='same', use_bias=False))
model.add(Conv2D(13, (3, 3), padding='same', use_bias=False))
model.add(Conv2D(7, (3, 3), padding='same', use_bias=False))
model.compile(loss='categorical_crossentropy', optimizer='adam')



Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 28, 28, 32)        2400      
conv2d_2 (Conv2D)            (None, 28, 28, 17)        4896      
conv2d_3 (Conv2D)            (None, 28, 28, 13)        1989      
conv2d_4 (Conv2D)            (None, 28, 28, 7)         819       
Total params: 10,104

Try to formulate your options. What would that mean for the parameters if something else would be the case?

Hint: $2400 = 32 \cdot (3 \cdot 5 \cdot 5)$

This approach also helps you with other layer types, not only convolutional layers.

Please also note that you are free to implement different solutions, that might have other numbers of parameters.

Martin Thoma

Posted 2018-03-22T02:36:20.950

Reputation: 1 023