What is the difference between LeakyReLU and PReLU?

52

15

I thought both, PReLU and Leaky ReLU are $$f(x) = \max(x, \alpha x) \qquad \text{ with } \alpha \in (0, 1)$$

Keras, however, has both functions in the docs.

Leaky ReLU

Source of LeakyReLU:

return K.relu(inputs, alpha=self.alpha)

Hence (see relu code) $$f_1(x) = \max(0, x) - \alpha \max(0, -x)$$

PReLU

Source of PReLU:

def call(self, inputs, mask=None):
    pos = K.relu(inputs)
    if K.backend() == 'theano':
        neg = (K.pattern_broadcast(self.alpha, self.param_broadcast) *
               (inputs - K.abs(inputs)) * 0.5)
    else:
        neg = -self.alpha * K.relu(-inputs)
    return pos + neg

Hence $$f_2(x) = \max(0, x) - \alpha \max(0, -x)$$

Question

Did I get something wrong? Aren't $f_1$ and $f_2$ equivalent to $f$ (assuming $\alpha \in (0, 1)$?)

Martin Thoma

Posted 2017-04-25T11:58:13.553

Reputation: 15 590

Answers

65

Straight from wikipedia:

enter image description here

  • Leaky ReLUs allow a small, non-zero gradient when the unit is not active.

  • Parametric ReLUs take this idea further by making the coefficient of leakage into a parameter that is learned along with the other neural network parameters.

Thomas Wagenaar

Posted 2017-04-25T11:58:13.553

Reputation: 1 039

4Ah, thanks, I always forget that Leaky ReLUs have $\alpha$ as a hyperparameter and Parametric ReLUs have $\alpha$ as a parameter. – Martin Thoma – 2017-04-25T15:42:35.087

1

For the Google-thing: That's ok. (Btw, for me this question is the third result now for "Leaky ReLU vs PReLU")

– Martin Thoma – 2017-04-25T15:47:55.913

3@MartinThoma true! No offense at all for that! The way I found the answer was pretty stupid as well; I didn't know what the 'P' in PReLU was, so I figured that out and then tried to figure out what PReLU was by just typing 'Parametric ReLU', which got me to the wikipedia page. I learned something to day because of your question ;) – Thomas Wagenaar – 2017-04-25T15:50:59.303

1

Nice. Thats how it should be :-) In this case my little activation function overview might be interesting for you as well. The article is (partially) in German, but I guess for that part it shouldn't matter

– Martin Thoma – 2017-04-25T15:57:20.537

3

Pretty old question; but I will add one more detail in case someone else ends up here.

Motivation behind PReLU was to overcome shortcomings of ReLU(dying ReLU problem) and LeakyReLU(inconsistent predictions for negative input values). So the authors of the paper behind PReLU thought why not let the a in ax for x<0 (in LeakyReLU) get learned!!

And here is the catch: if all the channels share the same a that gets learned, it is called channel-shared PReLU. But if each channel learn their own a, it is called channel-wise PReLU.

So what if ReLU or LeakyReLU was better for that problem? That is upto the model to learn:

  1. if a is/are learned as 0 --> PReLU becomes ReLu
  2. if a is/are learned as small number --> PReLU becomes LeakyReLU

Rajesh Timilsina

Posted 2017-04-25T11:58:13.553

Reputation: 31