## What is the "dying ReLU" problem in neural networks?

154

114

Referring to the Stanford course notes on Convolutional Neural Networks for Visual Recognition, a paragraph says:

"Unfortunately, ReLU units can be fragile during training and can "die". For example, a large gradient flowing through a ReLU neuron could cause the weights to update in such a way that the neuron will never activate on any datapoint again. If this happens, then the gradient flowing through the unit will forever be zero from that point on. That is, the ReLU units can irreversibly die during training since they can get knocked off the data manifold. For example, you may find that as much as 40% of your network can be "dead" (i.e. neurons that never activate across the entire training dataset) if the learning rate is set too high. With a proper setting of the learning rate this is less frequently an issue."

What does dying of neurons here mean?

Could you please provide an intuitive explanation in simpler terms.

4Can someone find a reference to some scientific article about "dead neurons"? As this is the first result on google attempts, it would be great if this question was edited with a reference. – Marek Židek – 2017-07-16T23:54:50.067

2can we prevent the bias by regularization to solve this problem? – Len – 2017-09-11T02:35:04.167

4Dudes I've managed to revitalize dead relu neurons by giving new random (normal distributed) values at each epoch for weights <= 0. I use this method only together with freezing weights at different depths as the training continues to higher epochs (I'm not sure if this is what we call phase transition) Can now use higher learning rates, yields better overall accuracy (only tested at linear regression). It's really easy to implement. – boli – 2018-03-06T19:54:47.020

2@boli, can you share you implementation here? – Anu – 2019-03-29T00:54:30.750

155

A "dead" ReLU always outputs the same value (zero as it happens, but that is not important) for any input. Probably this is arrived at by learning a large negative bias term for its weights.

In turn, that means that it takes no role in discriminating between inputs. For classification, you could visualise this as a decision plane outside of all possible input data.

Once a ReLU ends up in this state, it is unlikely to recover, because the function gradient at 0 is also 0, so gradient descent learning will not alter the weights. "Leaky" ReLUs with a small positive gradient for negative inputs (y=0.01x when x < 0 say) are one attempt to address this issue and give a chance to recover.

The sigmoid and tanh neurons can suffer from similar problems as their values saturate, but there is always at least a small gradient allowing them to recover in the long term.

7

Good comment and it's also worth mentioning of Exponential Linear Units (ELUs) which can help to addres that issue in a better way: http://arxiv.org/abs/1511.07289

why not just get rid of bias terms? – Alex – 2017-08-13T06:16:42.403

18@alex: Because bias is very important for accuracy. Getting rid of bias is much the same as saying all the decision planes must pass through the origin - with a few exceptions this is a bad choice. In fact getting rid of bias terms in a neural network or related models (like linear regression or logistic regression) will usually mean that your model will suffer from bias! It's one of the few ways you can end up with a model that is both underfit and overfit at the same time , , , – Neil Slater – 2017-08-13T09:19:16.577

Thanks, then why not just set the starting value to some small constant, e.g. 0.1 or 0.2. Would this help? – Alex – 2017-08-13T11:23:20.950

@Alex: No that would not help much, the same bias problem would apply, you have just moved the fixed point. The bias needs to be a learnable parameter, unless you already happen to know the correct value (which would be very unusual situation). – Neil Slater – 2017-08-13T14:37:07.523

I never said it is not learnable. I said it is initialized at 0.1 rather than 0. Would this help to prevent it from blowing up to a large negative number? – Alex – 2017-08-17T02:13:53.020

1@Alex: I think it is common to add a small positive bias to ReLUs. I don't know if that helps with "dying ReLU problem" -it likely would not change gradient values numerically very much (because gradient is either 1 or 0 for the ReLU, and it is when it is 1 that it could overshoot, a small starting bias would appear to make very little difference). Mostly I think it is just a trick to add a small boost to initial learning - but that might help by getting a better start, and having generally lower gradients sooner. – Neil Slater – 2017-08-17T06:43:22.353

"Once a ReLU ends up in this state, it is unlikely to recover, because the function gradient at 0 is also 0, so gradient descent learning will not alter the weights": I think this is incorrect. A single ReLU yielding 0 will simply cut off a single path in the backpropagation. For any given weight, there are many paths along which the gradient can flow. So unless all paths are thus cut, the weight will still change, and so the ReLU is in no way "stuck".

The problem is correctly explained in @MohamedEzz answer, which explains how all paths can be cut. – max – 2018-12-31T05:01:12.653

Also, even if a ReLU is truly dead, it's unclear why this needs to be "fixed" with Leaky RELUs or anything else. The whole point of ReLU was to encourage sparsity in the activations, so having a large number (~50%) of ReLUs dead is the desired outcome. Of course, if ReLU doesn't work, one can try other non-linearities -- but in general dead ReLUs are WAI. – max – 2018-12-31T05:05:45.137

1@max: You are missing the "for any input" part of my answer. No gradient will flow to any weight associated with the "dead" neuron in a feed-forward network, because all paths to those weights are cut - there are no alternative paths for the gradient to flow to the subset of weights feeding that ReLU unit. You might view a ReLU in e.g. a CNN or as having shared weights in which case all locations in the feature map would need to zero at once. However, I'd view that as another instance of "for any input". – Neil Slater – 2018-12-31T12:53:00.193

You're right, since you assume zeros for all feature map locations. Thanks for clarifying! I guess there's one way that a dead ReLU may revive: the weights in the earlier layers still receive gradients (the dead ReLU blocks only one of many backprop paths to them). This will change activations flowing into the dead ReLU, and at some point may cause it to become non-zero. However, this is a purely random mechanism, since there's no gradient force pushing earlier weights in this direction. I have no idea how often this actually happens, so just pointing it out for completeness. – max – 2018-12-31T20:27:19.620

@NeilSlater, Could you please help in understanding:Probably this is arrived at by learning a large negative bias term for its weights., How you have arrived at a large negative bias? any suggestions – Anu – 2019-03-29T01:03:40.107

1@anu: By gradient descent. A large positive gradient, caused by a large error value, can in turn cause a single step of the bias term to be large enough that it "kills" the neuron, so that it reaches a state (for weights and bias) that future inputs to the ReLU function never rise above 0. – Neil Slater – 2019-03-29T08:32:22.177

@max 'the whole point of relus is to increase sparsity' where is this written because as far as I heard it was introduced for it's simplicity and similarity to biological neurons. – DuttaA – 2019-07-24T18:13:42.533

@DuttaA http://proceedings.mlr.press/v15/glorot11a/glorot11a.pdf section 2.2. "We show here that using a rectifying non-linearity gives rise to real zeros of activations and thus truly sparse representations. From a computational point of view, such representations are appealing for the following reasons: ..."

– max – 2019-07-28T00:59:44.987

124

Let's review how the ReLU (Rectified Linear Unit) looks like :

The input to the rectifier for some input $x_n$ is $$z_n=\sum_{i=0}^k w_i a^n_i$$ for weights $w_i$, and activations from the previous layer $a^n_i$ for that particular input $x_n$. The rectifier neuron function is $ReLU = max(0,z_n)$

Assuming a very simple error measure

$$error = ReLU - y$$

the rectifier has only 2 possible gradient values for the deltas of backpropagation algorithm: $$\frac{\partial error}{\partial z_n} = \delta_n = \left\{ \begin{array}{c l} 1 & z_n \geq 0\\ 0 & z_n < 0 \end{array}\right.$$ (if we use a proper error measure, then the 1 will become something else, but the 0 will stay the same) and so for a certain weight $w_j$ : $$\nabla error = \frac{\partial error}{\partial w_j}=\frac{\partial error}{\partial z_n} \times \frac{\partial z_n}{\partial w_j} = \delta_n \times a_j^n = \left\{ \begin{array}{c 1} a_j^n & z_n \geq 0\\ 0 & z_n < 0 \end{array}\right.$$

One question that comes to mind is how actually ReLU works "at all" with the gradient $=$ 0 on the left side. What if, for the input $x_n$, the current weights put the ReLU on the left flat side while it optimally should be on the right side for this particular input ? The gradient is 0 and so the weight will not be updated, not even a tiny bit, so where is "learning" in this case?

The essence of the answer lies in the fact that Stochastic Gradient Descent will not only consider a single input $x_n$, but many of them, and the hope is that not all inputs will put the ReLU on the flat side, so the gradient will be non-zero for some inputs (it may be +ve or -ve though). If at least one input $x_*$ has our ReLU on the steep side, then the ReLU is still alive because there's still learning going on and weights getting updated for this neuron. If all inputs put the ReLU on the flat side, there's no hope that the weights change at all and the neuron is dead.

A ReLU may be alive then die due to the gradient step for some input batch driving the weights to smaller values, making $z_n < 0$ for all inputs. A large learning rate amplifies this problem.

As @Neil Slater mentioned, a fix is to modify the flat side to have a small gradient, so that it becomes $ReLU=max(0.1x,x)$ as below, which is called LeakyReLU.

Aren't you forgetting the bias term in the formula for input to the rectifier? – Tom Hale – 2018-03-14T10:14:26.370

2I think I followed the notation of some textbooks that assume that a_0=1 for all layers, and w_0 is the bias. The bias is not important so it's better to omit it anyway – MohamedEzz – 2018-03-20T12:46:08.180

@MohamedEzz, I didn't understand your point What if, for the input , the current weights put the ReLU on the left flat side while it optimally should be on the right side for this particular input ?, if the input is negative, the gradient would be 0? what's optimal for this case? could you please help in understanding it? – Anu – 2019-03-29T01:10:22.450

1By optimal I meant that, if for the network to do a better prediction for this input it needed to adjust the weights so that the ReLU gives a positive value, it wouldn't be able to do this adjustment due to the 0 gradient it have on the flat side. – MohamedEzz – 2019-03-31T11:21:40.037

Amazing answer. Thanks – Maverick Meerkat – 2019-08-17T10:50:08.153

19

ReLU neurons output zero and have zero derivatives for all negative inputs. So, if the weights in your network always lead to negative inputs into a ReLU neuron, that neuron is effectively not contributing to the network's training. Mathematically, the gradient contribution to the weight updates coming from that neuron is always zero (see the Mathematical Appendix for some details).

What are the chances that your weights will end up producing negative numbers for all inputs into a given neuron? It's hard to answer this in general, but one way in which this can happen is when you make too large of an update to the weights. Recall that neural networks are typically trained by minimizing a loss function $L(W)$ with respect to the weights using gradient descent. That is, weights of a neural network are the "variables" of the function $L$ (the loss depends on the dataset, but only implicitly: it is typically the sum over each training example, and each example is effectively a constant). Since the gradient of any function always points in the direction of steepest increase, all we have to do is calculate the gradient of $L$ with respect to the weights $W$ and move in the opposite direction a little bit, then rinse and repeat. That way, we end up at a (local) minimum of $L$. Therefore, if your inputs are on roughly the same scale, a large step in the direction of the gradient can leave you with weights that give similar inputs which can end up being negative.

In general, what happens depends on how information flows through the network. You can imagine that as training goes on, the values neurons produce can drift around and make it possible for the weights to kill all data flow through some of them. (Sometimes, they may leave these unfavorable configurations due to weight updates earlier in the network, though!). I explored this idea in a blog post about weight initialization -- which can also contribute to this problem -- and its relation to data flow. I think my point here can be illustrated by a plot from that article:

The plot displays activations in a 5 layer Multi-Layer Perceptron with ReLU activations after one pass through the network with different initialization strategies. You can see that depending on the weight configuration, the outputs of your network can be choked off.

# Mathematical Appendix

Mathematically if $L$ is your network's loss function, $x_j^{(i)}$ is the output of the $j$-th neuron in the $i$-th layer, $f(s) = \max(0, s)$ is the ReLU neuron, and $s^{(i)}_j$ is the linear input into the $(i+1)$-st layer, then by the chain rule the derivative of the loss with respect to a weight connecting the $i$-th and $(i+1)$-st layers is

$$\frac{\partial L}{\partial w_{jk}^{(i)}} = \frac{\partial L}{\partial x_k^{(i+1)}} \frac{\partial x_k^{(i+1)}}{\partial w_{jk}^{(i)}}\,.$$

The first term on the right can be computed recursively. The second term on the right is the only place directly involving the weight $w_{jk}^{(i)}$ and can be broken down into

\begin{align*} \frac{\partial{x_k^{(i+1)}}}{\partial w_{jk}^{(i)}} &= \frac{\partial{f(s^{(i)}_j)}}{\partial s_j^{(i)}} \frac{\partial s_j^{(i)}}{\partial w_{jk}^{(i)}} \\ &=f'(s^{(i)}_j)\, x_j^{(i)}. \end{align*}

From this you can see that if the outputs are always negative, the weights leading into the neuron are not updated, and the neuron does not contribute to learning.

great explanation!, could you please help me understanding Therefore, if your inputs are on roughly the same scale, a large step in the direction of the gradient can leave you with weights that give similar inputs which can end up being negative. How weights are getting negative if the inputs are normalized? – Anu – 2019-03-29T01:21:32.880

@anu The weight update is $w - \lambda \cdot \mathrm dw$, so if you take a large step, meaning in this case selecting a large $\lambda$, and if $\mathrm dw$ is positive, then you can see that $w$ may become negative. This is especially bad if we update the bias to be a large negative value. – Johnson – 2019-03-29T19:51:58.893

@JohnsonJia, great, I got it :), one more clarification needed, why it's especially bad in case of bias as compared to weight since the negative contribution could be on both weight & bias, correct me if I am wrong.? – Anu – 2019-03-29T20:00:08.743

Because bias is not modified by the input: $z = w \cdot x + b$, so if $b$ is very negative, $z$ may remain negative for all values of $x$. – Johnson – 2019-03-29T20:03:39.440

7

To be more specific in language, while the local gradient of ReLU (which is $1$) multiply the gradient that flow-back because of back-propagation, the result of the updated gradient could be a large negative number (if the gradient that flow-back is a large negative number).

Such large negative updated gradient produce a large negative $w_i$ when learning rate is relatively big, hence will repress updates that going to happen in this neuron, since is almost impossible to put up a large positive number to offset the large negative number brought by that "broken" $w_i$.

5

The "Dying ReLU" refers to neuron which outputs 0 for your data in training set. This happens because sum of weight * inputs in a neuron (also called activation) becomes <= 0 for all input patterns. This causes ReLU to output 0. As derivative of ReLU is 0 in this case, no weight updates are made and neuron is stuck at outputting 0.

Things to note:

1. Dying ReLU doesn't mean that neuron's output will remain zero at the test time as well. Depending on distribution differences this may or may not be the case.
2. Dying ReLU is not permanent dead. If you add new training data or use pre-trained model for new training, these neurons might kick back!
3. Technically Dying ReLU doesn't have to output 0 for ALL training data. It may happen that it does output non-zero for some data but number of epochs are not enough to move weights significantly.