Derivative of activation function used in gradient descent algorithms

1

Why is it necessary to calculate the derivative of activation functions while updating model( regression or NN) parameters? Why is the constant gradient of linear functions considered as a disadvantage?

As far as I know, when we do stochastic gradient descent using the formula:

$$\text{weight} = \text{weight} + (\text{learning rate}\times (\text{actual output} - \text{predicted output}) * \text{input})$$

then also, the weights get updated fine, so why is calculation of derivative considered so important?

rajarshi

Posted 2019-07-13T13:19:06.150

Reputation: 11

Can you give a reference for that version of the weight update formula? – Ben Reiniger – 2019-07-13T21:25:50.900

Answers

1

As the name suggests, Gradient Descent ( GD ) optimization works on the principle of gradients which basically is a vector of all partial derivatives of a particular function. According to Wikipedia,

In vector calculus, the gradient is a multi-variable generalization of the derivative.

At its core, GD computes derivatives ( in terms of Neural Networks ) of a composite function ( a neural network is itself a composite function ) because of the gradient descent update rule, which is,

$\Large \theta = \theta - \alpha \frac{\partial J}{\partial \theta}$

Where $\theta$ is the parameter which needs to be optimized. In a neural network, this parameter could be a weight or a bias. $J$ is the objective function ( loss function in NN ) which needs to be minimized. So for $\frac{\partial J}{\partial \theta}$, we need to repeatedly apply the chain rule till we have a derivative of the loss function with respect to that parameter.

Intuition:

enter image description here

Sorry for the weird image. When GD is far away from the function minima ( where it tends to reach ) the value of $\frac{\partial J}{\partial \theta}$ is greater and therefore the updated value of $\theta$ is smaller than the previous one. This updated value is scaled by the learning rate ( $\alpha$ ). The negative sign indicates that we are moving in the opposite direction of the gradient.

Shubham Panchal

Posted 2019-07-13T13:19:06.150

Reputation: 1 792