Here the answer refers to vanishing and exploding gradients that has been in
sigmoid-like activation functions but, I guess,
Relu has a disadvantage and it is its expected value. there is no limitation for the output of the
Relu and so its expected value is not zero. I remember the time before the popularity of
tanh was the most popular amongst machine learning experts rather than
sigmoid. The reason was that the expected value of the
tanh was equal to zero and and it helped learning in deeper layers to be more rapid in a neural net.
Relu does not have this characteristic, but why it is working so good if we put its derivative advantage aside. Moreover, I guess the derivative also may be affected. Because the activations (output of
Relu) are involved for calculating the update rules.