**NOTE:** I did these calculations speculatively, so some errors might have crept in. Please inform of any such errors so I can correct it.

In general in any CNN the maximum time of training goes in the Back-Propagation of errors in the Fully Connected Layer (depends on the image size). Also the maximum memory is also occupied by them. Here is a slide from Stanford about VGG Net parameters:

Clearly you can see the fully connected layers contribute to about 90% of the parameters. So the maximum memory is occupied by them.

As far as training time goes, it somewhat depends on the size (pixels*pixels) of the image being used. In FC layers it is straightforward that the number of derivatives you have to calculate is equal to the number of parameters. As far as the convolutional layer goes, let's see an example,let us take the case of 2nd layer:
It has got 64 filters of $(3*3*3)$ dimensions to be updated. The error is propagated from the 3rd layer. Each channel in the 3rd layer propagates it error to its corresponding $(3*3*3)$ filter. Thus $224*224$ pixels will contribute to about $224*224*(3*3*3)$ weight updations. And since there are $64$ such $224*224$ channels we get the total number of calculations to be performed as $64*224*224*(3*3*3) \approx 87*10^6 $ calculations.

Now let us take the last layer of $56*56*256$. It will pass it gradients to the previous layer. Each $56*56$ pixel will update a $(3*3*256)$ filter. And since there are 256 such $56*56$ the total calculations required would be $256 * 56 * 56 * (3*3*256) \approx 1850 *10^6$ calculations.

So number of calculations in a convolutional layer really depends on the number of filters and the size of the picture. In general I have used the following formulae to calculate the number of updations required for filters in a layer, also I have considered $stride = 1$ since it will be the worst case:

$channels_{output} * (pixelOutput_{height} * pixelOutput_{width}) * (filter_{height} * filter_{width} * channels_{input})$

Thanks to fast GPU's we are easily able to handle these huge calculations. But in FC layers the entire matrix needs to be loaded which causes memory problems which is generally not the case of convolutional layers, so training of convolutional layers is still easy. Also all these have to be loaded in the GPU memory itself and not the RAM of the CPU.

Also here is the parameter chart of AlexNet:

And here is a performance comparison of various CNN architectures:

I suggest you check out the CS231n Lecture 9 by Stanford University for better understanding of the nooks and crannies of CNN architectures.

Indeed, purely empirical experience tells me that fully connected layers train much faster, even with hugely more parameters. The operations in them are simple, as is the backprop. CNNs appear to use huge amounts of memory to store all the intermediate gradient terms. – Mastiff – 2020-05-19T22:54:56.113