How to understand backpropagation using derivative



Before I was learning about gradient descent, but now I understand this. Now, I have a problem with the backpropagation algorithm. I know the idea - minimalize error in multilayer neural network using chain rule. However, I don't understand the role of the derivative of the sigmoid function. This derivative is described in the algorithm. What is the point of this? Can you explain this step by step using simple language?

enter image description here


Posted 2018-05-16T10:06:21.123

Reputation: 397

Question was closed 2018-05-16T19:14:06.553

@JahKnows as you say I tag you – lukassz – 2018-05-16T10:06:41.853

@lukassz, the similarity of the questions might be bothersome to the site, perhaps you can contact me directly. I'll update my page with some information. – JahKnows – 2018-05-16T10:31:38.287



This is a continuation to this answer. The first part of that answer covered gradient descent. In short this algorithm finds a set of parameters which results in a minimum of a function. From the description of the algorithm you can see where the derivative of the function is needed.

How this applies to a neural network

Let's first consider a single neuron network. This is called the perceptron. We will use the sigmoid activation function. It has inputs $x$ and an associated output

$\hat{y} = \frac{1}{1+e^{w^Tx+b}}$.

The model is parametrized by the weights $w$ and the biases. These are the model parameters we need to tune. So this single neuron takes a vector $x$ and using the weights vector and biases will output some value $y$.

If the weights are randomly chosen then the output will also be random. This is of no use to us. We need to determine a way of calculating a measure of correctness or wrongness in order to know how we should change our weights. I will use a measure of how wrong we are and call this the cost

$C = \frac{1}{2N} \sum_{i=0}^{N}(\hat{y} - y)^2$.

This is takes the squared difference between the predicted output $\hat{y}$ and the ground truth $y$. We will take the sum of this wrongness for all the instances that we try to predict, hence the summation. And we divide by $N$ to normalize the error. The additional division by 2 is simply to make the derivative easier to compute you will see soon.

So we want to minimize the cost function using gradient decent. Thus we need to take the derivative of the cost function with respect to the parameters we can tune, these being the weights $w$. We must thus compute $\frac{\partial C}{\partial w}$. Using calculus we can use chain rule in order to simplify our derivative as

$\frac{\partial C}{\partial w} = \frac{\partial C}{\partial \hat{y}}\frac{\partial \hat{y}}{\partial w}$

Our gradient descent calculation thus becomes

$w^{new} = w^{old} - \nu \frac{\partial C}{\partial \hat{y}}\frac{\partial \hat{y}}{\partial w}$

where $\nu$ is the learning rate.

Let's explicitly spell out this derivative

$\frac{\partial C}{\partial \hat{y}} = \hat{y} - y$

and, the second is the derivative of the activation function which we are using which is the sigmoid function.

$\frac{\partial \hat{y}}{\partial w} = \frac{1}{1+exp(w^Tx + b)} (1 - \frac{1}{1+exp(w^Tx + b)})$


Posted 2018-05-16T10:06:21.123

Reputation: 7 863

thank you for explanation, can you look at my next question ?

– lukassz – 2018-05-17T10:21:03.693