## Advantages of monotonic activation functions over non-monotonic functions in neural networks?

5

What are the advantages of using monotonic activation functions over non-monotonic functions in neural networks?

• Do they perform better than non-monotonic ones?
• Is this mathematically proven?
• Are there any papers/references that are related to this?

3

I don't know of any papers about this topic, but intuitively it makes a lot of sense to use monotonic activation functions. Let's say we have a non-monotonic activation function, maybe a Gaussian kernel, symmetric around $x=0$ but slides off towards $f(x)=0$ if x strays away from 0 on either side. If we have a sample that we feed into our network that performs poorly when our activation is high, we want to change the input of our node to give a lower activation. In case of a non-monotonic activation, whether we want to decrease or increase the input depends on whether the input was positive or negative, and is mostly dependent on our weight initialization.

This makes learning more difficult, because if another sample also needs it to be lower but was on the other side of the top, backpropogation will attempt to map the input to the other side. Most of the time the best solution will be to put everything on one side of the top, making it monotonic again. Another way of looking at it is that monotonic are somewhat one-to-one (not entirely true, for example ReLU). This means that two very different inputs don't map to the same output unless everything in between also maps there.

Here was a similar question with some links: (Why) do activation functions have to be monotonic?

1