As RELU is not differentiable when it touches the x-axis, doesn't it effect training?



When I read about activation functions , I read that the reason we don't use step function is because, it is non differentiable which leads to problem in gradient descent.
I am a beginner in deep learning , as Relu is almost a linear function and also non differentiable where it touches x-axis , why it performs so much better than tanh or sigmoid functions. And why is it so widely used in Deep learning.
As it is non differentiable doesn't it affect in training?


Posted 2020-09-16T05:30:40.640

Reputation: 544

Pragmatically, it doesn't, you just include zero into the left half (-inf, 0]. You can read about subgradient, which is conceived exactly to solve such problems. – ptyshevs – 2020-09-16T06:05:47.067

For ReLU we use subgradient method. – Media – 2020-09-16T17:26:17.993



A step function is discontinuous and its first derivative is a Dirac delta function. The discontinuity causes the issue for gradient descent. Further the zero slope everywhere leads to issues when attempting to minimize the function. The function is essentially saturated for values greater than and less than zero.

By contrast RELU is continuous and only its first derivative is a discontinuous step function. Since the RELU function is continuous and well defined, gradient descent is well behaved and leads to a well behaved minimization. Further, RELU does not saturate for large values greater than zero. This is in contrast to sigmoids or tanh, which tend to saturate for large value. RELU maintains a nice linear slope as x moves toward infinity.

The issue with saturation is that gradient descent methods take a long time to find the minimum for a saturated function.


  • Step function: discontinuous and saturated at +/- large numbers.
  • Tanh: Continuous and well defined, but saturated at +/- large numbers.
  • Sigmoid: Continuous and well defined, but saturated at +/- large numbers.
  • Relu: Continuous and well defined. Does not saturate at + large number.

Hope this helps!


Posted 2020-09-16T05:30:40.640

Reputation: 6 358

2ReLU does saturate at large negative number, resulting in dead neuron – ptyshevs – 2020-09-16T07:37:19.523

Thank you so much . Does saturation mean that the value does not change at higher values ?? Also does non differentiability of a function does not effect its training ? – Shiv – 2020-09-16T10:20:12.433

2Saturation is more about gradient than about the function value. Basically, gradient of sigmoid(10000) and sigmoid(100000) is almost the same, although the values are very different. Saturation can be an issue if gradient tries to pull function away from the saturation value. It will take a lot of update steps to do so – ptyshevs – 2020-09-16T12:15:10.600

@ptyshevs, good point. This was a typo. Yes, you are absolutely correct that the slope for negative values is identically zero. – AN6U5 – 2020-09-16T16:58:24.290

@ptyshevs thank you so much – Shiv – 2020-09-18T01:46:01.773