How does cost function change by choice of activation function (ReLU, Sigmoid, Softmax)?


I am new to ML and as I take courses for the area DL, I am wondering, by our choice of activation function for the last layer, whether we take sigmoid, relu or softmax, would the formula for calculation of cost function change?

I am grateful for every good reply I can get, have a nice day! :)


Posted 2019-07-06T21:33:09.713


I think it's a broad question. Definitely the formula would change and the corresponding result would change as well and its extend depends on the specific problem we are dealing with. it might be better to take a look at each of these functions specific formula. – Fatemeh Asgarinejad – 2019-07-06T22:07:32.607



You need to discriminate between two types of neural networks. If your output variable is continous you can use linear, ReLU, tanh, logistic-sigmoid,... as activation functions, because these functions map continous inputs to continous outputs. If your output is discrete / categorical you can use the signum (binary) or softmax activation (multiclass) function as activation function for the output layer.

The cost function is often a function that is comparing the real outputs $y_n$ and the predicted outputs $\hat{y}(x_n)$ for the input $x_n$ for all $n=1,...,N$. Let us introduce the comparison function $D(y_n,\hat{y}(x_n))$. The comparison function has a low value if the predicted output is almost equal to the real output and high if the outputs are not similar. Assuming all the observations are equivalently important, we could sum the values of comparison function applied on all observations and obtain the integrated loss


for the whole data set.

In order to see the influence of the activation function $g$ in the last layer we summarize the transfer function from the input to the last layer as $f(x_n)$. Then the predicted output $\hat{y}(x_n)$ can be written as


Hence, the activation function at the output has an effect on the integrated loss $J$. For example if you choose the $\tanh$ as output activation you will bound your outputs in the intervall $(-1,1)$ which will be a bad choice if your outputs can be from $\mathbb{R}$ and your cost function will probably have a very high value while training. A better choice would be a linear activation function at the output layer.


Posted 2019-07-06T21:33:09.713

Reputation: 1 254

thanks for the reply! Does the loss function change if I would choose tanh instead of sigmoid in the last layer? :) – None – 2019-07-08T18:07:13.090

The activation function in the last layer is $g$ and is used in $\hat{y}(x_n)$, hence it will change. – MachineLearner – 2019-07-09T07:39:33.073


The cost function doesn't change the activation function but is limits the activation function you can use on the output layer. For example for a classification problem you will want to output a probability will which is between 0 and 1 so you will take a softmax as the output layer activation function, if you are looking at a regression problem then you will use linear activation function etc

Robin Nicole

Posted 2019-07-06T21:33:09.713

Reputation: 479


Activation functions are just used to squeeze (not numpy's) the output of a layer and cost functions are a way to measure the magnitude discrepancy between predicted output and the original output of net. Cost functions just need to be differentiable and continuous and don't really depend on the activation function.

But in keras, if your last activation function is sigmoid then you must be using their binary classification function and also account for regressions problems, else you are good to go.


Posted 2019-07-06T21:33:09.713

Reputation: 36