Why convolute if Max Pooling is just going to downsample the image anyway?



The idea of applying filters to do something like identify edges, is a pretty cool idea.

For example, you can take an image of a 7. With some filters, you can end up with transformed images that emphasize different characteristics of the original image. The original 7:

enter image description here

can be experienced by the network as:

enter image description here

Notice how each image has extracted a different edge of the original 7.

This is all great, but then, say the next layer in your network is a Max Pooling layer.

My question is, generally, doesn't this seem a little bit like overkill? We just were very careful and deliberate with identifying edges using filters -- now, we no longer care about any of that, since we've blasted the hell out of the pixel values! Please correct me if I'm wrong, but we went from 25 X 25 to 2 X 2! Why not just go straight to Max Pooling then, won't we end up with basically the same thing?

As an extension the my question, I can't help but wonder what would happen if, coincidentally, each of the 4 squares all just happen to have a pixel with the same max value. Surely this isn't a rare case, right? Suddenly all your training images look the exact same.

Monica Heddneck

Posted 2016-09-21T07:55:38.103

Reputation: 605



Max pooling doesn't down-sample the image. It down-samples the features (such as edges) that you have just extracted. Which means you get more approximately where those edges or other features are. Often this is just what the network needs for generalisation - in order to classify it doesn't need to know there is a vertical edge running from 10,5 to 10,20, but that there is an approximately vertical edge about 1/3 from left edge about 2/3 height of the image.

These rougher categories of features inherently cover more variations in the input image for very little cost, and the reduction in size of the feature map is a nice side effect too, making the network faster.

For this to work well, you still need to extract features to start with, which max pooling does not do, so the convolutional layer is necessary. You should find you can down-sample the original image (to 14x14) instead of using the first max-pooling layer, and you will still get pretty reasonable accuracy. How much pooling to do, and where to add those layers is yet another hyper-parameter problem when building a deep neural network.

Neil Slater

Posted 2016-09-21T07:55:38.103

Reputation: 24 613


We cannot go directly from input layer to max pooling because of the convolution layer in between. The reason for convolution is to extract features. Max pooling down-samples the features that have been extracted. If you think there are features which are missing because of the direct jump from a large matrix to a max pooling layer, you can add more layers of convolution in between till you seem satisfied with a size and then do max pooling onto it so that it is not an overkill.

Max pooling, which is a form of down-sampling is used to identify the most important features. But average pooling and various other techniques can also be used. I normally work with text and not images. For me, the values are not normally all same. But if they are too, it wouldn't make much difference because it just picks the largest value.

A very good understanding from wiki - The intuition is that once a feature has been found, its exact location isn't as important as its rough location relative to other features. The function of the pooling layer is to progressively reduce the spatial size of the representation to reduce the amount of parameters and computation in the network, and hence to also control overfitting. It is common to periodically insert a pooling layer in-between successive conv layers in a CNN architecture. The pooling operation provides a form of translation invariance.

Hima Varsha

Posted 2016-09-21T07:55:38.103

Reputation: 2 146

Can you explain the last sentence The pooling operation provides a form of translation invariance? – SmallChess – 2016-09-21T09:40:56.413

@StudentT It means that the output of the max-pool will be about the same if the feature is detected anywhere in the image. Move the thing in the image that is activating the feature and a different input to the max-pool will be maximal but the out put of the max-pool should be the same. – mrmcgreg – 2016-09-23T13:12:42.083

@mrmcgreg I believe that is true for global pooling, not max pooling. Max pooling provides a kind of invariance to local translations within the pool region (e.g. 2x2). This allows for some jitter in the features. – geometrikal – 2017-08-28T07:49:59.430


Convolution is basically filtering the image with a smaller pixel filter to reduce the size of the image without losing the relationship between pixels (parameters of the network), Pooling also reduces the spatial size by extracting Max, Avg or Sum of the pixels to the size of the filter however it may miss important parameter in the process which convolution re-achieve by not reducing size significantly.

siddharth parmar

Posted 2016-09-21T07:55:38.103

Reputation: 1