I am starting to get my head around convolutional neural networks, and I have been working with the CIFAR-10 dataset and some research papers that used it. In one of these papers, they mention a network architecture notation for a CNN, and I am not sure how to interpret that exactly in terms of how many layers are there and how many neurons in each.

This is an image of their structure notation.

Can some give me an explanation as to what exactly this structure looks like?

In the CIFAR-10 dataset, each image is $32 \times 32$ pixels, represented by 3072 integers indicating the red, green, blue values for each pixel.

Does that not mean that my input layer has to be of size 3072? Or is there some way to group the inputs into matrices and then feed them into the network?

It would be better if you share the link to the paper that you're referring to.