Why ReLU is better than the other activation functions

17

14

Here the answer refers to vanishing and exploding gradients that has been in sigmoid-like activation functions but, I guess, Relu has a disadvantage and it is its expected value. there is no limitation for the output of the Relu and so its expected value is not zero. I remember the time before the popularity of Relu that tanh was the most popular amongst machine learning experts rather than sigmoid. The reason was that the expected value of the tanh was equal to zero and and it helped learning in deeper layers to be more rapid in a neural net. Relu does not have this characteristic, but why it is working so good if we put its derivative advantage aside. Moreover, I guess the derivative also may be affected. Because the activations (output of Relu) are involved for calculating the update rules.

Media

Posted 2017-10-03T14:17:09.163

Reputation: 12 077

It is common to have some sort of normalization (e.g. batch normalization, layer normalization) together with ReLU. This adjusts the output range. – noe – 2017-10-03T14:22:09.583

@ncasas But in typical CNN normalizing the output of the relu is not common? At least I have never seen that. – Media – 2017-10-03T14:23:17.740

You are right, in not very deep CNNs it is normal not to have batch normalization. Have you considered the role of weight initial values? (e.g. He initialization) – noe – 2017-10-03T15:15:26.730

yes, actually they are for somehow preventing vanishing/exploding gradients, after some iterations the outputs get larger I guess. – Media – 2017-10-03T15:26:39.653

Answers

24

The biggest advantage of ReLu is indeed non-saturation of its gradient, which greatly accelerates the convergence of stochastic gradient descent compared to the sigmoid / tanh functions (paper by Krizhevsky et al).

But it's not the only advantage. Here is a discussion of sparsity effects of ReLu activations and induced regularization. Another nice property is that compared to tanh / sigmoid neurons that involve expensive operations (exponentials, etc.), the ReLU can be implemented by simply thresholding a matrix of activations at zero.

But I'm not convinced that great success of modern neural networks is due to ReLu alone. New initialization techniques, such as Xavier initialization, dropout and (later) batchnorm also played very important role. For example, famous AlexNet used ReLu and dropout.

So to answer your question: ReLu has very nice properties, though not ideal. But it truly proves itself when combined with other great techniques, which by the way solve non-zero-center problem that you've mentioned.

UPD: ReLu output is not zero-centered indeed and it does hurt the NN performance. But this particular issue can be tackled by other regularization techniques, e.g. batchnorm, which normalizes the signal before activation:

We add the BN transform immediately before the nonlinearity, by normalizing $x = Wu+ b$. ... normalizing it is likely to produce activations with a stable distribution.

Maxim

Posted 2017-10-03T14:17:09.163

Reputation: 770

2I should've stressed this part: I was trying to say that ReLu alone doesn't solve this issue. You are right that ReLu output is not zero-centered and it does hurt the NN performance, unless the weights are regularized. But saturated gradients hurt the NN even more, so mass adoption of ReLu was a step forward despite its disadvantages. – Maxim – 2017-10-03T18:58:57.987

would you please say what do you mean by weights are regularized? in the answer and also the thing that you have emphasized. – Media – 2017-10-03T19:10:39.503

Updated my answer with some details about this particular issue – Maxim – 2017-10-03T19:24:57.087

What i find a bit confusing, why not just use the identity function? Whats the advantage of 0 for neg values? – Alex – 2017-12-28T09:42:14.207

@Alex id is not a non-linearity. It's equivalent to having only linear layers in the NN. See this question - https://stackoverflow.com/q/46659525/712995

– Maxim – 2017-12-28T09:44:44.020

'flexibly squash the input through a function and pave the way recovering a higher-level abstraction structure', ok but why not use squared function for pos values? It will produce a far more intetesting derivative for example – Alex – 2017-12-28T09:54:45.280

I'm not aware if anyone has tried $x^2$ activation function in research. My guess is that it will suffer from gradient saturation as well - see exploding gradients problem. Relu is the first activation that managed to solve it – Maxim – 2017-12-28T10:28:52.260

"suffer from gradient saturation"..."exploding gradients problem" - these are clearly mutually exclusive. Do you mean that the linear derivative will grow unrestrictred? – Alex – 2017-12-28T18:57:55.653