I am trying to understand the results of the paper Visualizing and Understanding Convolutional Networks, in particular the following image:
What are these 3x3 blocks and their 9 cells representing?
From my understanding, each 3x3 block of the i-th layer corresponds to a randomly chosen feature map in that layer (e.g. for the layer-1 they randomly chose 9 feature maps, for layer-2 16 feature maps etc). On the left part (grayish images), the j-th 3x3 block shows 9 visualizations obtained by mapping the top-9 activations (single values) of that particular feature map to the "pixel space" (using a deconvolutional network). On the right part, the j-th block shows the 9 patches of input images, corresponding to the top-9 activations (e.g. in the first layer and i-th feature map, the j-th image patch is the local region of input image which is seen by the j-th neuron of that feature map). Is my understanding correct?
However, it's not entirely clear to me how the top-9 activations are chosen. It seems that for each layer and each feature-map, an activation is picked for a different input image (that's why we see e.g. different persons in layer-3, row-1, col-1, and different cars in layer-3, row-2, col-2). So within each block, the top-9 activations are obtained from 9 different images (but images of the same class) of the entire dataset (but in principle it could be that more than one activations are coming from the same image).