42

33

This question boils down to "how do convolution layers *exactly* work.

Suppose I have an $n \times m$ greyscale image. So the image has one channel. In the first layer, I apply a $3\times 3$ convolution with $k_1$ filters and padding. Then I have another convolution layer with $5 \times 5$ convolutions and $k_2$ filters. How many feature maps do I have?

## Type 1 convolution

The first layer gets executed. After that, I have $k_1$ feature maps (one for each filter). Each of those has the size $n \times m$. Every single pixel was created by taking $3 \cdot 3 = 9$ pixels from the padded input image.

Then the second layer gets applied. Every single filter gets applied separately to **each of the feature maps**. This results in $k_2$ feature maps for every of the $k_1$ feature maps. So there are $k_1 \times k_2$ feature maps after the second layer. Every single pixel of each of the new feature maps got created by taking $5 \cdot 5 = 25$ "pixels" of the padded feature map from before.

The system has to learn $k_1 \cdot 3 \cdot 3 + k_2 \cdot 5 \cdot 5$ parameters.

## Type 2.1 convolution

Like before: The first layer gets executed. After that, I have $k_1$ feature maps (one for each filter). Each of those has the size $n \times m$. Every single pixel was created by taking $3 \cdot 3 = 9$ pixels from the padded input image.

Unlike before: Then the second layer gets applied. Every single filter gets applied to the same region, but **all feature maps** from before. This results in $k_2$ feature maps in total after the second layer got executed. Every single pixel of each of the new feature maps got created by taking $k_2 \cdot 5 \cdot 5 = 25 \cdot k_2$ "pixels" of the padded feature maps from before.

The system has to learn $k_1 \cdot 3 \cdot 3 + k_2 \cdot 5 \cdot 5$ parameters.

## Type 2.2 convolution

Like above, but instead of having $5 \cdot 5 = 25$ parameters per filter which have to be learned and get simply copied for the other input feature maps, you have $k_1 \cdot 3 \cdot 3 + k_2 \cdot k_1 \cdot 5 \cdot 5$ paramters which have to be learned.

## Question

- Is type 1 or type 2 typically used?
- Which type is used in Alexnet?
- Which type is used in GoogLeNet?
- If you say type 1: Why do $1 \times 1$ convolutions make any sense? Don't they only multiply the data with a constant?
- If you say type 2: Please explain the quadratic cost ("For example, in a deep vision network, if two convolutional layers are chained, any uniform increase in the number of their filters results in a quadratic increase of computation")

For all answers, please give some evidence (papers, textbooks, documentation of frameworks) that your answer is correct.

## Bonus question 1

Is the pooling applied always only per feature map or is it also done over multiple feature maps?

## Bonus question 2

I'm relatively sure that type 1 is correct and I got something wrong with the GoogLe paper. But there a 3D convolutions, too. Lets say you have 1337 feature maps of size $42 \times 314$ and you apply a $3 \times 4 \times 5$ filter. How do you slide the filter over the feature maps? (Left to right, top to bottom, first feature map to last feature map?) Does it matter as long as you do it consistantly?

## My research

- I've read the two papers from above, but I'm still not sure what is used.
- I've read the lasagne documentation
- I've read the theano documentation
- I've read the answers on Understanding convolutional neural networks (without following all links)
- I've read Convolutional Neural Networks (LeNet). Especially figure 1 makes me relatively sure that Type 2.1 is the right one. This would also fit to the "quadratic cost" comment in GoogLe Net and to some practical experience I had with Caffee.

1

A while later: Analysis and Optimization of Convolutional Neural Network Architectures, especially chapter 2 and Figure 2.2 and Figure 2.3.

– Martin Thoma – 2018-02-22T08:44:16.693