When is max pooling exactly applied in convolutional neural networks?



When using convolutional networks on images with multiple channels, do we max pool after we sum the feature map from each channel, or do we max pool each feature map separately and then sum?

What's the intuition behind this, or is there a difference between the two?


Posted 2020-02-05T11:10:30.670

Reputation: 41



The pooling operation is applied to the output of the convolution layer. More precisely, it is applied separately for each of the input channels (or slices). So, if the pooling layer receives an input volume of $H_i \times W_i \times D$, then it will produce an output volume $H_o \times W_o \times D$, so the depth of the output volume is equal to the depth of the input volume. There is no sum involved in the pooling operation. For example, in the case of max pooling, you will choose the maximum number of a certain 2D window of values. You do this for each of the input channels.

In the article (which is part of some course notes) Convolutional Neural Networks: Architectures, Convolution / Pooling Layers, Andrej Karpathy says

Pooling layer downsamples the volume spatially, independently in each depth slice of the input volume.

The following diagram (a screenshot of one of the figures from the mentioned article) should provide some intuition behind the pooling operation.

enter image description here


Posted 2020-02-05T11:10:30.670

Reputation: 19 783


The pooling operation in a CNN is applied independently to each layer and the resulting feature maps are disjoint. This is the very reason that in most schematics depicting a certain CNN architecture, we obtain three output maps from an input image (corresponding to the convolutions and pooling operations performed on the RGB channels separately).


Each channel in an image captures a set of information that might not be demonstrated by the other channels owing to the color receptivity. Hence, pooling performed independently is intuitive, in the sense that a group of pixels in the Red Channel may not provide similar features as the same set of pixels in a Blue or Green channel. Thus comparison of pixels is restricted within a channel.

The following image provides a visualization of the RGB channels converted to greyscale. Note the brightness of the Red Color (comparing with the color-image) in the three different channels for a better intuition: Channel [Image Source]

Finally, to feed the multi-dimensional outputs of the Convolutional Layers to a Fully Connected layer, the feature maps are stacked along the depth dimension and flattened to form a vector. This ensures that the Dense layer learns to classify images (using a Softmax activation in the final Dense layer) based on the non-linear combinations of these high-level features.


Posted 2020-02-05T11:10:30.670

Reputation: 335