Which layer in a CNN consumes more training time: convolution layers or fully connected layers?



In a convolutional neural network, which layer consumes more training time: convolution layers or fully connected layers?

We can take AlexNet architecture to understand this. I want to see the time breakup of the training process. I want a relative time comparison so we can take any constant GPU configuration.

Ruchit Dalwadi

Posted 2018-09-06T23:27:52.350

Reputation: 297



NOTE: I did these calculations speculatively, so some errors might have crept in. Please inform of any such errors so I can correct it.

In general in any CNN the maximum time of training goes in the Back-Propagation of errors in the Fully Connected Layer (depends on the image size). Also the maximum memory is also occupied by them. Here is a slide from Stanford about VGG Net parameters:

enter image description here

enter image description here

Clearly you can see the fully connected layers contribute to about 90% of the parameters. So the maximum memory is occupied by them.

As far as training time goes, it somewhat depends on the size (pixels*pixels) of the image being used. In FC layers it is straightforward that the number of derivatives you have to calculate is equal to the number of parameters. As far as the convolutional layer goes, let's see an example,let us take the case of 2nd layer: It has got 64 filters of $(3*3*3)$ dimensions to be updated. The error is propagated from the 3rd layer. Each channel in the 3rd layer propagates it error to its corresponding $(3*3*3)$ filter. Thus $224*224$ pixels will contribute to about $224*224*(3*3*3)$ weight updations. And since there are $64$ such $224*224$ channels we get the total number of calculations to be performed as $64*224*224*(3*3*3) \approx 87*10^6 $ calculations.

Now let us take the last layer of $56*56*256$. It will pass it gradients to the previous layer. Each $56*56$ pixel will update a $(3*3*256)$ filter. And since there are 256 such $56*56$ the total calculations required would be $256 * 56 * 56 * (3*3*256) \approx 1850 *10^6$ calculations.

So number of calculations in a convolutional layer really depends on the number of filters and the size of the picture. In general I have used the following formulae to calculate the number of updations required for filters in a layer, also I have considered $stride = 1$ since it will be the worst case:

$channels_{output} * (pixelOutput_{height} * pixelOutput_{width}) * (filter_{height} * filter_{width} * channels_{input})$

Thanks to fast GPU's we are easily able to handle these huge calculations. But in FC layers the entire matrix needs to be loaded which causes memory problems which is generally not the case of convolutional layers, so training of convolutional layers is still easy. Also all these have to be loaded in the GPU memory itself and not the RAM of the CPU.

Also here is the parameter chart of AlexNet:

enter image description here

And here is a performance comparison of various CNN architectures:

enter image description here

I suggest you check out the CS231n Lecture 9 by Stanford University for better understanding of the nooks and crannies of CNN architectures.


Posted 2018-09-06T23:27:52.350

Reputation: 4 881


As CNN contains convolution operation, but DNN uses constructive divergence for training. CNN is more complex in terms of Big O notation.

For reference:

  1. See Convolutional Neural Networks at Constrained Time Cost for more details about the time complexity of CNNs

  2. See What is the time complexity of the forward pass algorithm of a neural network? and What is the time complexity for training a neural network using back-propagation? for more details about the time complexity of the forward and backward passes of an MLP

ketul parikh

Posted 2018-09-06T23:27:52.350

Reputation: 51

Indeed, purely empirical experience tells me that fully connected layers train much faster, even with hugely more parameters. The operations in them are simple, as is the backprop. CNNs appear to use huge amounts of memory to store all the intermediate gradient terms. – Mastiff – 2020-05-19T22:54:56.113