Say I have a CNN with this structure:
- input = 1 image (say, 30x30 RGB pixels)
- first convolution layer = 10 5x5 convolution filters
- second convolution layer = 5 3x3 convolution filters
- one dense layer with 1 output
So a graph of the network will look like this:
Am I correct in thinking that the first convolution layer will create 10 new images, i.e. each filter creates a new intermediary 30x30 image (or 26x26 if I crop the border pixels that cannot be fully convoluted).
Then the second convolution layer, is that supposed to apply the 5 filters on all of the 10 images from the previous layer? So that would result in a total of 50 images after the second convolution layer.
And then finally the last FC layer will take all data from these 50 images and somehow combine it into one output value (e.g. the probability that the original input image was a cat).
Or am I mistaken in how convolution layers are supposed to operate?
Also, how to deal with channels, in this case RGB? Can I consider this entire operation to be separate for all red, green and blue data? I.e. for one full RGB image, I essentially run the entire network three times, once for each color channel? Which would mean I'm also getting 3 output values.