## Derivative of activation function used in gradient descent algorithms

1

Why is it necessary to calculate the derivative of activation functions while updating model( regression or NN) parameters? Why is the constant gradient of linear functions considered as a disadvantage?

As far as I know, when we do stochastic gradient descent using the formula:

$$\text{weight} = \text{weight} + (\text{learning rate}\times (\text{actual output} - \text{predicted output}) * \text{input})$$

then also, the weights get updated fine, so why is calculation of derivative considered so important?

Can you give a reference for that version of the weight update formula? – Ben Reiniger – 2019-07-13T21:25:50.900

1

As the name suggests, Gradient Descent ( GD ) optimization works on the principle of gradients which basically is a vector of all partial derivatives of a particular function. According to Wikipedia,

In vector calculus, the gradient is a multi-variable generalization of the derivative.

At its core, GD computes derivatives ( in terms of Neural Networks ) of a composite function ( a neural network is itself a composite function ) because of the gradient descent update rule, which is,

$$\Large \theta = \theta - \alpha \frac{\partial J}{\partial \theta}$$

Where $$\theta$$ is the parameter which needs to be optimized. In a neural network, this parameter could be a weight or a bias. $$J$$ is the objective function ( loss function in NN ) which needs to be minimized. So for $$\frac{\partial J}{\partial \theta}$$, we need to repeatedly apply the chain rule till we have a derivative of the loss function with respect to that parameter.

Intuition:

Sorry for the weird image. When GD is far away from the function minima ( where it tends to reach ) the value of $$\frac{\partial J}{\partial \theta}$$ is greater and therefore the updated value of $$\theta$$ is smaller than the previous one. This updated value is scaled by the learning rate ( $$\alpha$$ ). The negative sign indicates that we are moving in the opposite direction of the gradient.