A step function is discontinuous and its first derivative is a Dirac delta function. The discontinuity causes the issue for gradient descent. Further the zero slope everywhere leads to issues when attempting to minimize the function. The function is essentially saturated for values greater than and less than zero.
By contrast RELU is continuous and only its first derivative is a discontinuous step function. Since the RELU function is continuous and well defined, gradient descent is well behaved and leads to a well behaved minimization. Further, RELU does not saturate for large values greater than zero. This is in contrast to sigmoids or tanh, which tend to saturate for large value. RELU maintains a nice linear slope as x moves toward infinity.
The issue with saturation is that gradient descent methods take a long time to find the minimum for a saturated function.
- Step function: discontinuous and saturated at +/- large numbers.
- Tanh: Continuous and well defined, but saturated at +/- large numbers.
- Sigmoid: Continuous and well defined, but saturated at +/- large numbers.
- Relu: Continuous and well defined. Does not saturate at + large number.
Hope this helps!