## Can you explain me this CNN architecture?

2

I am starting to get my head around convolutional neural networks, and I have been working with the CIFAR-10 dataset and some research papers that used it. In one of these papers, they mention a network architecture notation for a CNN, and I am not sure how to interpret that exactly in terms of how many layers are there and how many neurons in each.

This is an image of their structure notation.

1. Can some give me an explanation as to what exactly this structure looks like?

2. In the CIFAR-10 dataset, each image is $$32 \times 32$$ pixels, represented by 3072 integers indicating the red, green, blue values for each pixel.

Does that not mean that my input layer has to be of size 3072? Or is there some way to group the inputs into matrices and then feed them into the network?

3It would be better if you share the link to the paper that you're referring to. – Aniket Velhankar – 2020-05-12T09:22:40.483

0

While it would certainly help if the link to the paper could also be posted, I will give it a shot based on what I understand from this picture.

1) For any convolutional layer, there are few important things to configure, namely, the kernel (or, filter) size, number of kernels, stride. Padding is also important but it is generally defined to be zero unless mentioned otherwise. Let us consider the picture block-by-block.

The first block contains 3 convolutional layers: (i) 2 conv layers with 96 filters each and the size of each filter is $$3 \times 3$$ (and stride $$=1$$ by default since it is not mentioned), and (ii) another conv layer with same configurations as above but with stride $$=2$$.

The second block is pretty much the same as the previous except the number of filters is increased to 192 for each layer that is defined.

The only considerable change in the third block is the introduction of $$1 \times 1$$ convolutional filters instead of $$3 \times 3$$.

And finally, a global average pooling layer is used (instead of a fully connected layer).

2) As for your analysis, it is exactly the case in fully connected layers, wherein the number of units in the input layer must match the vectorized dimensions of the input data. But, in the case of CNN, we give the images directly as the input to the network. The whole idea of a CNN is to understand the spatial structure of the data by analyzing patches of the image at a time (which is what the filter size defines). This PyTorch tutorial should give an idea as to how exactly the input is given to CNN.