Understanding the results of "Visualizing and Understanding Convolutional Networks"



I am trying to understand the results of the paper Visualizing and Understanding Convolutional Networks, in particular the following image:

enter image description here

What are these 3x3 blocks and their 9 cells representing?

From my understanding, each 3x3 block of the i-th layer corresponds to a randomly chosen feature map in that layer (e.g. for the layer-1 they randomly chose 9 feature maps, for layer-2 16 feature maps etc). On the left part (grayish images), the j-th 3x3 block shows 9 visualizations obtained by mapping the top-9 activations (single values) of that particular feature map to the "pixel space" (using a deconvolutional network). On the right part, the j-th block shows the 9 patches of input images, corresponding to the top-9 activations (e.g. in the first layer and i-th feature map, the j-th image patch is the local region of input image which is seen by the j-th neuron of that feature map). Is my understanding correct?

However, it's not entirely clear to me how the top-9 activations are chosen. It seems that for each layer and each feature-map, an activation is picked for a different input image (that's why we see e.g. different persons in layer-3, row-1, col-1, and different cars in layer-3, row-2, col-2). So within each block, the top-9 activations are obtained from 9 different images (but images of the same class) of the entire dataset (but in principle it could be that more than one activations are coming from the same image).

Andreas K.

Posted 2020-04-11T07:22:50.233

Reputation: 81

No answers