It has been shown that "plain" neural networks tend to have an increased amount training error, and accompanied test error, as more layers are added. I am not quite certain as to why this occurs. In the original ResNet paper they hypothesize and verify that this is not due to vanishing gradient.
From what I understand, it is difficult for a model to approximate the identity map between layers and furthermore when this map is optimal the model may tend to approximate the zero function instead. If this is the case, why does this occur? Finally, why does this not occur in shallower networks in which the identity map may also be optimal?
I am new to this topic so I apologize if I may not fully understand this problem. Thank you.