6

1

My understanding of the vanishing gradient problem in deep networks is that as backprop progresses through the layers the gradients become small, and thus training progresses slower. I'm having a hard time reconciling this understanding with images such as below where the losses for a deeper network are higher than for a shallower one. Should it not just take longer to complete each iteration, but still reach the same level if not higher of accuracy?

As a guess, it may be that when you have a large number of layers, the randomness of xavier initialisation may cause the numbers to explode/vanish. This is due to the fact that xavier init (and other inits) try to keep the variance at 1, but since they are random, the variance is more likely to be 1.03 or 0.97. With multiple layers stacked, this slight offset may begin to snowball into causing the numbers to explode/vanish. – Recessive – 2019-09-30T03:36:43.820

Another guess: Consider that numbers in a computer are always just approximations. So computing the gradient layer by layer is a lossy process. This might make deeper layers not just slower to learn but actually introduce lower overall thresholds on possible accuracy. – BlindKungFuMaster – 2019-10-02T11:11:49.900