ReLU neurons output zero and have zero derivatives for all negative inputs. So, if the weights in your network always lead to negative inputs into a ReLU neuron, that neuron is effectively not contributing to the network's training. Mathematically, the gradient contribution to the weight updates coming from that neuron is always zero (see the Mathematical Appendix for some details).

What are the chances that your weights will end up producing negative numbers for *all* inputs into a given neuron? It's hard to answer this in general, but one way in which this can happen is when you make too large of an update to the weights. Recall that neural networks are typically trained by minimizing a loss function $L(W)$ with respect to the weights using gradient descent. That is, weights of a neural network are the "variables" of the function $L$ (the loss depends on the dataset, but only implicitly: it is typically the sum over each training example, and each example is effectively a constant). Since the gradient of any function always points in the direction of steepest increase, all we have to do is calculate the gradient of $L$ with respect to the weights $W$ and move in the opposite direction a little bit, then rinse and repeat. That way, we end up at a (local) minimum of $L$. Therefore, if your inputs are on roughly the same scale, a large step in the direction of the gradient can leave you with weights that give similar inputs which can end up being negative.

In general, what happens depends on how information flows through the network. You can imagine that as training goes on, the values neurons produce can drift around and make it possible for the weights to kill all data flow through some of them. (Sometimes, they may leave these unfavorable configurations due to weight updates earlier in the network, though!). I explored this idea in a blog post about weight initialization -- which can also contribute to this problem -- and its relation to data flow. I think my point here can be illustrated by a plot from that article:

The plot displays activations in a 5 layer Multi-Layer Perceptron with ReLU activations after one pass through the network with different initialization strategies. You can see that depending on the weight configuration, the outputs of your network can be choked off.

# Mathematical Appendix

Mathematically if $L$ is your network's loss function, $x_j^{(i)}$ is the output of the $j$-th neuron in the $i$-th layer, $f(s) = \max(0, s)$ is the ReLU neuron, and $s^{(i)}_j$ is the linear input into the $(i+1)$-st layer, then by the chain rule the derivative of the loss with respect to a weight connecting the $i$-th and $(i+1)$-st layers is

$$
\frac{\partial L}{\partial w_{jk}^{(i)}} = \frac{\partial L}{\partial x_k^{(i+1)}} \frac{\partial x_k^{(i+1)}}{\partial w_{jk}^{(i)}}\,.
$$

The first term on the right can be computed recursively. The second term on the right is the only place directly involving the weight $w_{jk}^{(i)}$ and can be broken down into

$$
\begin{align*}
\frac{\partial{x_k^{(i+1)}}}{\partial w_{jk}^{(i)}} &= \frac{\partial{f(s^{(i)}_j)}}{\partial s_j^{(i)}} \frac{\partial s_j^{(i)}}{\partial w_{jk}^{(i)}} \\
&=f'(s^{(i)}_j)\, x_j^{(i)}.
\end{align*}
$$

From this you can see that if the outputs are always negative, the weights leading into the neuron are not updated, and the neuron does not contribute to learning.

4Can someone find a reference to some scientific article about "dead neurons"? As this is the first result on google attempts, it would be great if this question was edited with a reference. – Marek Židek – 2017-07-16T23:54:50.067

2can we prevent the bias by regularization to solve this problem? – Len – 2017-09-11T02:35:04.167

4Dudes I've managed to revitalize dead relu neurons by giving new random (normal distributed) values at each epoch for weights <= 0. I use this method only together with freezing weights at different depths as the training continues to higher epochs (I'm not sure if this is what we call phase transition) Can now use higher learning rates, yields better overall accuracy (only tested at linear regression). It's really easy to implement. – boli – 2018-03-06T19:54:47.020

2@boli, can you share you implementation here? – Anu – 2019-03-29T00:54:30.750