## Why do activation functions have to be monotonic?

16

7

I am currently preparing for an exam on neural networks. In several protocols from former exams I read that the activation functions of neurons (in multilayer perceptrons) have to be monotonic.

I understand that activation functions should be differentiable, have a derivative which is not 0 on most points, and be non-linear. I do not understand why being monotonic is important/helpful.

I know the following activation functions and that they are monotonic:

• ReLU
• Sigmoid
• Tanh
• Softmax: I'm not sure if the definition of monotonicity is applicable for functions $$f: \mathbb{R}^n \rightarrow \mathbb{R}^m$$ with $$n, m > 1$$
• Softplus
• (Identity)

However, I still can't see any reason why for example $$\varphi(x) = x^2$$.

Why do activation functions have to be monotonic?

(Related side question: is there any reason why the logarithm/exponential function is not used as an activation function?)

2@MartinThoma Are you sure softmax is monotonic? – Media – 2018-02-21T07:07:19.670

2Thanks @Media. To answer your question: I'm not sure what "monotonic" even means for functions in $f:R^n \rightarrow R^m$ with $m > 1$. For $m=1$ softmax is constant and thus monotonic. But without defining $<$ for elements in $R^n$ with $n>1$ I don't think monotonic makes any sense. – Martin Thoma – 2018-02-21T19:50:18.063

2@MartinThoma Thanks, actually it was also a question of mine. I didn't know, and still don't know, if there is an extension for monotonic in functions with multiple outputs. Math stuff, you know! – Media – 2018-02-22T14:06:51.987

4

– Franck Dernoncourt – 2015-12-07T01:13:53.543

15

The monotonicity criterion helps the neural network to converge easier into an more accurate classifier. See this stackexchange answer and wikipedia article for further details and reasons.

However, the monotonicity criterion is not mandatory for an activation function - It is also possible to train neural nets with non-monotonic activation functions. It just gets harder to optimize the neural network. See Yoshua Bengio's answer.

0

I will provide a more mathematical reason as to why does a having a monotone function helps!

Using http://mathonline.wikidot.com/lebesgue-s-theorem-for-the-differentiability-of-monotone-fun, assuming our activation function to be monotone, we can say that on the real line, our function will be differentiable. So, the gradient of the activation function will not be a erratic function. It will be easier to find the minima we are looking for. (computationally inexpensive)

Exponential and Logarithmic functions are beautiful functions but are not bounded(So, the converse of Lebesgue Theorem is not true as Exp and Log are differentiable functions which are not bounded on the real line). So, they fail when we want to classify our examples at the final stage. Sigmoid and tanh work really well because they have gradients which are easy to compute and their range is (0,1) and (-1,1) respectively.

3There are infinitely many differentiable, but not monotone functions. So why does having a monotone function help? – Martin Thoma – 2019-08-04T12:41:44.860