The pooling operation in a CNN is applied independently to each layer and the resulting feature maps are disjoint. This is the very reason that in most schematics depicting a certain CNN architecture, we obtain three output maps from an input image (corresponding to the convolutions and pooling operations performed on the RGB channels separately).
Each channel in an image captures a set of information that might not be demonstrated by the other channels owing to the color receptivity. Hence, pooling performed independently is intuitive, in the sense that a group of pixels in the Red Channel may not provide similar features as the same set of pixels in a Blue or Green channel. Thus comparison of pixels is restricted within a channel.
The following image provides a visualization of the RGB channels converted to greyscale. Note the brightness of the Red Color (comparing with the color-image) in the three different channels for a better intuition:
Finally, to feed the multi-dimensional outputs of the Convolutional Layers to a Fully Connected layer, the feature maps are stacked along the depth dimension and flattened to form a vector. This ensures that the Dense layer learns to classify images (using a Softmax activation in the final Dense layer) based on the non-linear combinations of these high-level features.