Derivative of activation function used in gradient descent algorithms


Why is it necessary to calculate the derivative of activation functions while updating model( regression or NN) parameters? Why is the constant gradient of linear functions considered as a disadvantage?

As far as I know, when we do stochastic gradient descent using the formula:

$$\text{weight} = \text{weight} + (\text{learning rate}\times (\text{actual output} - \text{predicted output}) * \text{input})$$

then also, the weights get updated fine, so why is calculation of derivative considered so important?


Posted 2019-07-13T13:19:06.150

Reputation: 11

Can you give a reference for that version of the weight update formula? – Ben Reiniger – 2019-07-13T21:25:50.900



As the name suggests, Gradient Descent ( GD ) optimization works on the principle of gradients which basically is a vector of all partial derivatives of a particular function. According to Wikipedia,

In vector calculus, the gradient is a multi-variable generalization of the derivative.

At its core, GD computes derivatives ( in terms of Neural Networks ) of a composite function ( a neural network is itself a composite function ) because of the gradient descent update rule, which is,

$\Large \theta = \theta - \alpha \frac{\partial J}{\partial \theta}$

Where $\theta$ is the parameter which needs to be optimized. In a neural network, this parameter could be a weight or a bias. $J$ is the objective function ( loss function in NN ) which needs to be minimized. So for $\frac{\partial J}{\partial \theta}$, we need to repeatedly apply the chain rule till we have a derivative of the loss function with respect to that parameter.


enter image description here

Sorry for the weird image. When GD is far away from the function minima ( where it tends to reach ) the value of $\frac{\partial J}{\partial \theta}$ is greater and therefore the updated value of $\theta$ is smaller than the previous one. This updated value is scaled by the learning rate ( $\alpha$ ). The negative sign indicates that we are moving in the opposite direction of the gradient.

Shubham Panchal

Posted 2019-07-13T13:19:06.150

Reputation: 1 792