What do the numbers in this CNN architecture stand for?


So I've got a neural net model (ResNet-18) and made a diagram according to the literature (https://arxiv.org/abs/1512.03385).

I think I understand most of the format of the convolutional layers: filter dims ,conv, unknown number ,stride(if applicable)

What does the number after 'conv' in the convolutional layers indicate? is it the number of neurons in the layer?

ResNet-18 architecture

bonus q: this is being used for unsupervised learning of images, i.e the embedding output a network produces for an image is used for clustering. Would this make it incorrect for my architecture to have an FC layer at the end (which would be used for classifcation)?


Posted 2019-08-07T09:59:10.067

Reputation: 21



This number refers to the number of kernels (or feature maps) that are convolved with the input. So, for example, in the first convolutional layer, $64$ $3 \times 3$ kernels are convolved with the image.

The ResNet presented in Deep Residual Learning for Image Recognition is used for image classification. Furthermore, note that your diagram already contains a fully connected layer at the end.


Posted 2019-08-07T09:59:10.067

Reputation: 19 783

Are those all distinct kernels or is that the total number of convolutions that a particular 3x3 filter must perform when passed over all the pixels in an image? I added the FC layer at the end because I copied the format of their model, however mine isn't used for classification so should I remove the FC layer and possibly the average pooling? – thatsnotmyname71 – 2019-08-07T10:24:45.447

@thatsnotmyname71 The kernels in a CNN are learned, so they will likely all be distinct. What do you mean by the embedding output a network produces for an image? Maybe you could ask a separate question (in a separate post) with more details regarding your specific problem. – nbro – 2019-08-07T10:54:55.910

ahh, thank you, I understand the kernels now. The network is used for unsupervised learning, so an output embedding for image triplets (which allows the use of a loss function encouraging embeddings which separate the 'anchor' image from its 'distant' image. The concept comes from Tile2Vec/Word2Vec (https://arxiv.org/abs/1805.02855).

– thatsnotmyname71 – 2019-08-08T08:29:47.037

@thatsnotmyname71 I will have to read your linked paper before trying to give you an appropriate answer. – nbro – 2019-08-09T11:04:21.257