Why do very deep non resnet architectures perform worse compared to shallower ones for the same iteration? Shouldn't they just train slower?



My understanding of the vanishing gradient problem in deep networks is that as backprop progresses through the layers the gradients become small, and thus training progresses slower. I'm having a hard time reconciling this understanding with images such as below where the losses for a deeper network are higher than for a shallower one. Should it not just take longer to complete each iteration, but still reach the same level if not higher of accuracy?

enter image description here

Intent Filters

Posted 2019-09-28T18:25:39.067

Reputation: 61

As a guess, it may be that when you have a large number of layers, the randomness of xavier initialisation may cause the numbers to explode/vanish. This is due to the fact that xavier init (and other inits) try to keep the variance at 1, but since they are random, the variance is more likely to be 1.03 or 0.97. With multiple layers stacked, this slight offset may begin to snowball into causing the numbers to explode/vanish. – Recessive – 2019-09-30T03:36:43.820

Another guess: Consider that numbers in a computer are always just approximations. So computing the gradient layer by layer is a lossy process. This might make deeper layers not just slower to learn but actually introduce lower overall thresholds on possible accuracy. – BlindKungFuMaster – 2019-10-02T11:11:49.900



Those graphs do not disprove your 'vanishing gradient' theory. The deeper network may eventually do better than the shallower one, but it might take much longer to do it.

Incidentally, the ReLU activation function was designed to mitigate the vanishing gradient problem.


Posted 2019-09-28T18:25:39.067

Reputation: 605


In theory, deeper architectures can encode more information than shallower ones because they can perform more transformations of the input which results in better results at the output. The training is slower because back propagation is quite expensive, as you increase the depth, you increase the number of parameters and gradients that need to be computed.

Another issue that you need to take into account is the effect of the activation function. Saturating functions like sigmoid and hyperbolic tangent result in very small gradients in their edges, other activation functions are just flat, eg. ReLU is flat on the negatives therefore, there is no error to propagate because the gradient is either very small (as in saturating functions) or zero. Batch Norm greatly assists in this operation because it collapses values in better ranges where the gradients aren't close to zero.

Dimitris Monroe

Posted 2019-09-28T18:25:39.067

Reputation: 113