Actually I guess the question is a bit broad! Anyway.
Understanding Convolution Nets
What is learned in
ConvNets tries to minimize the cost function to categorize the inputs correctly in classification tasks. All parameter changing and learned filters are in order to achieve the mentioned goal.
Learned Features in Different Layers
They try to reduce the cost by learning low level, sometimes meaningless, features like horizontal and vertical lines in their first layers and then stacking them to make abstract shapes, which often have meaning, in their last layers. For illustrating this fig. 1, which has been used from here, can be considered. The input is the bus and the gird shows the activations after passing the input through different filters in the first layer. As it can be seen the red frame which is the activation of a filter, which its parameters have been learned, has been activated for relatively horizontal edges. The blue frame has been activated for relatively vertical edges. It is possible that
ConvNets learn unknown filters that are useful and we, as e.g. computer vision practitioners, have not discovered that they may be useful. The best part of these nets is that they try to find appropriate filters by their own and don't use our limited discovered filters. They learn filters to reduce the amount of cost function. As mentioned these filters are not necessarily known.
In deeper layers, the features learned in previous layers come together and make shapes which often have meaning. In this paper it has been discussed that these layers may have activations which are meaningful to us or the concepts which have meaning to us, as human beings, may be distributed among other activations. In fig. 2 the green frame shows the activatins of a filter in the fifth layer of a
ConvNet. This filter cares about the faces. Suppose that the red one cares about hair. These have meaning. As it can be seen there are other activations that have been activated right in the position of typical faces in the input, the green frame is one of them; The blue frame is another example of these. Accordingly, abstraction of shapes can be learned by a filter or numerous filters. In other words, each concept, like face and its components, can be distributed among the filters. In cases where the concepts are distributed among different layers, if someone look at each of them, they may be sophisticated. The information is distributed among them and for understanding that information all of those filters and their activations have to be considered although they may seem so much complicated.
CNNs should not be considered as black boxes at all. Zeiler et all in this amazing paper have discussed the development of better models is reduced to trial and error if you don't have understanding of what is done inside these nets. This paper tries to visualize the feature maps in
Capability to Handle Different Transformations to Generalize
pooling layers not only to reduce the number of parameters but also to have the capability to be insensitive to the exact position of each feature. Also the use of them enables the layers to learn different features which means first layers learn simple low level features like edges or arcs, and deeper layers learn more complicated features like eyes or eyebrows.
Max Pooling e.g. tries to investigate whether a special feature exists in a special region or not. The idea of
pooling layers is so useful but it is just capable to handle transition among other transformations. Although filters in different layers try to find different patterns, e.g. a rotated face is learned using different layers than a usual face,
CNNs by there own do not have any layer to handle other transformations. To illustrate this suppose that you want to learn simple faces without any rotation with a minimal net. In this case your model may do that perfectly. suppose that you are asked to learn all kind of faces with arbitrary face rotation. In this case your model has to be much more bigger than the previous learned net. The reason is that there have to be filters to learn these rotations in the input. Unfortunately these are not all transformations. Your input may also be distorted too. These cases made Max Jaderberg et all angry. They composed this paper to deal with these problems in order to settle down our anger as theirs.
Convolutional Neural Networks Do Work
Finally after referring to these points, they work because they try to find patterns in the input data. They stack them to make abstract concepts by there convolution layers. They try to find out whether the input data has each of these concepts or not in there dense layers to figure out which class the input data belongs to.
I add some links which are helpful: