Role derivative of sigmoid function in neural networks



I try to understand role of derivative of sigmoid function in neural networks. enter image description here

First I plot sigmoid function, and derivative of all points from definition using python. What is the role of this derivative exactly? enter image description here

import numpy as np
import matplotlib.pyplot as plt

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def derivative(x, step):
    return (sigmoid(x+step) - sigmoid(x)) / step

x = np.linspace(-10, 10, 1000)

y1 = sigmoid(x)
y2 = derivative(x, 0.0000000000001)

plt.plot(x, y1, label='sigmoid')
plt.plot(x, y2, label='derivative')
plt.legend(loc='upper left')


Posted 2018-04-23T09:38:48.060

Reputation: 397

2If you have more questions don't hesitate to ask – JahKnows – 2018-04-23T10:14:44.973

As a side comment, I those topics I generally advise everyone to read "Efficient Backprop" by LeCun and others. – lcrmorin – 2020-03-28T07:21:12.283



The use of derivatives in neural networks is for the training process called backpropagation. This technique uses gradient descent in order to find an optimal set of model parameters in order to minimize a loss function. In your example you must use the derivative of a sigmoid because that is the activation that your individual neurons are using.

The loss function

The essence of machine learning is to optimize a cost function such that we can either minimize or maximize some target function. This is typically called the loss or cost funtion. We typically want to minimize this function. The cost function, $C$, associates some penalty based on the resulting errors when passing data through your model as a function of the model parameters.

Let's look at the example where we try to label whether an image contains a cat or a dog. If we have a perfect model, we can give the model a picture and it will tell us if it is a cat or a dog. However, no model is perfect and it will make mistakes.

When we train our model to be able to infer meaning from input data we want to minimize the amount of mistakes it makes. So we use a training set, this data contains a lot of pictures of dogs and cats and we have the ground truth label associated with that image. Each time we run a training iteration of the model we calculate the cost (the amount of mistakes) of the model. We will want to minimize this cost.

Many cost functions exist each serving their own purpose. A common cost function that is used is the quadratic cost which is defined as

$C = \frac{1}{N} \sum_{i=0}^{N}(\hat{y} - y)^2$.

This is the square of the difference between the predicted label and the ground truth label for the $N$ images that we trained over. We will want to minimize this in some way.

Minimizing a loss function

Indeed most of machine learning is simply a family of frameworks which are capable of determining a distribution by minimizing some cost function. The question we can ask is "how can we minimize a function"?

Let's minimize the following function

$y = x^2-4x+6$.

If we plot this we can see that there is a minimum at $x = 2$. To do this analytically we can take the derivative of this function as

$\frac{dy}{dx} = 2x - 4 = 0$

$x = 2$.

However, often times finding a global minimum analytically is not feasible. So instead we use some optimization techniques. Here as well many different ways exist such as : Newton-Raphson, grid search, etc. Among these is gradient descent. This is the technique used by neural networks.

Gradient Descent

Let's use a famously used analogy to understand this. Imagine a 2D minimization problem. This is equivalent of being on a mountainous hike in the wilderness. You want to get back down to the village which you know is at the lowest point. Even if you do not know the cardinal directions of the village. All you need to do is continuously take the steepest way down, and you will eventually get to the village. So we will descend down the surface based on the steepness of the slope.

Let's take our function

$y = x^2-4x+6$

we will determine the $x$ for which $y$ is minimized. Gradient descent algorithm first says we will pick a random value for $x$. Let us initialize at $x=8$. Then the algorithm will do the following iteratively until we reach convergence.

$x^{new} = x^{old} - \nu \frac{dy}{dx}$

where $\nu$ is the learning rate, we can set this to whatever value we will like. However there is a smart way to choose this. Too big and we will never reach our minimum value, and too small we will waste soooo much time before we get there. It is analogous to the size of the steps you want to take down the steep slope. Small steps and you will die on the mountain, you'll never get down. Too large of a step and you risk over shooting the village and ending up the other side of the mountain. The derivative is the means by which we travel down this slope towards our minimum.

$\frac{dy}{dx} = 2x - 4$

$\nu = 0.1$

Iteration 1:

$x^{new} = 8 - 0.1(2 * 8 - 4) = 6.8 $
$x^{new} = 6.8 - 0.1(2 * 6.8 - 4) = 5.84 $
$x^{new} = 5.84 - 0.1(2 * 5.84 - 4) = 5.07 $
$x^{new} = 5.07 - 0.1(2 * 5.07 - 4) = 4.45 $
$x^{new} = 4.45 - 0.1(2 * 4.45 - 4) = 3.96 $
$x^{new} = 3.96 - 0.1(2 * 3.96 - 4) = 3.57 $
$x^{new} = 3.57 - 0.1(2 * 3.57 - 4) = 3.25 $
$x^{new} = 3.25 - 0.1(2 * 3.25 - 4) = 3.00 $
$x^{new} = 3.00 - 0.1(2 * 3.00 - 4) = 2.80 $
$x^{new} = 2.80 - 0.1(2 * 2.80 - 4) = 2.64 $
$x^{new} = 2.64 - 0.1(2 * 2.64 - 4) = 2.51 $
$x^{new} = 2.51 - 0.1(2 * 2.51 - 4) = 2.41 $
$x^{new} = 2.41 - 0.1(2 * 2.41 - 4) = 2.32 $
$x^{new} = 2.32 - 0.1(2 * 2.32 - 4) = 2.26 $
$x^{new} = 2.26 - 0.1(2 * 2.26 - 4) = 2.21 $
$x^{new} = 2.21 - 0.1(2 * 2.21 - 4) = 2.16 $
$x^{new} = 2.16 - 0.1(2 * 2.16 - 4) = 2.13 $
$x^{new} = 2.13 - 0.1(2 * 2.13 - 4) = 2.10 $
$x^{new} = 2.10 - 0.1(2 * 2.10 - 4) = 2.08 $
$x^{new} = 2.08 - 0.1(2 * 2.08 - 4) = 2.06 $
$x^{new} = 2.06 - 0.1(2 * 2.06 - 4) = 2.05 $
$x^{new} = 2.05 - 0.1(2 * 2.05 - 4) = 2.04 $
$x^{new} = 2.04 - 0.1(2 * 2.04 - 4) = 2.03 $
$x^{new} = 2.03 - 0.1(2 * 2.03 - 4) = 2.02 $
$x^{new} = 2.02 - 0.1(2 * 2.02 - 4) = 2.02 $
$x^{new} = 2.02 - 0.1(2 * 2.02 - 4) = 2.01 $
$x^{new} = 2.01 - 0.1(2 * 2.01 - 4) = 2.01 $
$x^{new} = 2.01 - 0.1(2 * 2.01 - 4) = 2.01 $
$x^{new} = 2.01 - 0.1(2 * 2.01 - 4) = 2.00 $
$x^{new} = 2.00 - 0.1(2 * 2.00 - 4) = 2.00 $
$x^{new} = 2.00 - 0.1(2 * 2.00 - 4) = 2.00 $
$x^{new} = 2.00 - 0.1(2 * 2.00 - 4) = 2.00 $
$x^{new} = 2.00 - 0.1(2 * 2.00 - 4) = 2.00 $

And we see that the algorithm converges at $x = 2$! We have found the minimum.

Applied to neural networks

The first neural networks only had a single neuron which took in some inputs $x$ and then provide an output $\hat{y}$. A common function used is the sigmoid function

$\sigma(z) = \frac{1}{1+exp(z)}$

$\hat{y}(w^Tx) = \frac{1}{1+exp(w^Tx + b)}$

where $w$ is the associated weight for each input $x$ and we have a bias $b$. We then want to minimize our cost function

$C = \frac{1}{2N} \sum_{i=0}^{N}(\hat{y} - y)^2$.

How to train the neural network?

We will use gradient descent to train the weights based on the output of the sigmoid function and we will use some cost function $C$ and train on batches of data of size $N$.

$C = \frac{1}{2N} \sum_i^N (\hat{y} - y)^2$

$\hat{y}$ is the predicted class obtained from the sigmoid function and $y$ is the ground truth label. We will use gradient descent to minimize the cost function with respect to the weights $w$. To make life easier we will split the derivative as follows

$\frac{\partial C}{\partial w} = \frac{\partial C}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial w}$.

$\frac{\partial C}{\partial \hat{y}} = \hat{y} - y$

and we have that $\hat{y} = \sigma(w^Tx)$ and the derivative of the sigmoid function is $\frac{\partial \sigma(z)}{\partial z} = \sigma(z)(1-\sigma(z))$ thus we have,

$\frac{\partial \hat{y}}{\partial w} = \frac{1}{1+exp(w^Tx + b)} (1 - \frac{1}{1+exp(w^Tx + b)})$.

So we can then update the weights through gradient descent as

$w^{new} = w^{old} - \eta \frac{\partial C}{\partial w}$

where $\eta$ is the learning rate.


Posted 2018-04-23T09:38:48.060

Reputation: 7 863

3please tell me why is this process not so nicely described in books? Do you have a blog? What materials for learning neural networks do you recommend? I have test data and I want to train it. Can I draw a function that I will minimize? I would like to visualize this process to better understand it. – lukassz – 2018-04-23T20:27:51.440

1Can you explain backpropagation in this simple way? – lukassz – 2018-04-23T20:34:54.870

Yeah sure, you can post some questions and tag me and then I will try my best to answer the ones I can. If you have your data make a new question with that data and we can walk you through this process. – JahKnows – 2018-04-24T02:03:49.323

@lukassz, when it comes to how the models are actually trained it is important to know, and you cn dive into it deeply if you want to better understand what you are doing. However, by using packages with pre-built functions most of this math is hidden away from you. – JahKnows – 2018-04-24T02:05:11.563

2Amazing Answer...(+1) – Aditya – 2018-04-24T03:50:12.607

1Backprop is also similar to what JahKnows has Explained above... Its just the gradient is carried all the way to the inputs right from the outputs.. A quick google search will make this clear.. Also the same goes every other activation functions also.. – Aditya – 2018-04-24T03:51:40.670

Thanks for reply, do you recommend some learning materials? – lukassz – 2018-04-24T10:29:50.143

@JahKnows I have one more question. I read and in this example the cost function is layer_1_error = layer_1 - y otherwise cost function depending on problem? Other for neural network, other for linear regression?

– lukassz – 2018-04-25T13:58:48.017

1@lukassz, notice that his equation is the same as the one I have for the weight update in the before last equation. $\frac{\partial C}{\partial w} = (\hat{y} - y) * \text{derivative of sigmoid}$. He uses the same cost function as me, dont forget that you need to take the derivative of the loss function too, that becomes $\hat{y} - y$, where $\hat{y}$ are the predicted labels and $y$ are the ground truth labels. – JahKnows – 2018-04-25T15:47:22.327

Ok, thanks. So in single perceptron we do not need a cost function? – lukassz – 2018-04-25T15:55:58.043

@lukassz, you do need a cost function! Consider that the update rule for gradient descent is $w^{new} = w^{old} - \eta \frac{\partial C}{\partial w}$. This is always the case. Now we can choose a cost function, we choose $C = \frac{1}{2n}\sum(\hat{y} - y)^2$. And we know that the predicted classes depend on our weights, but this becomes complicated. We can simplify by using the chain rule to calculate $\frac{\partial C}{\partial w} = \frac{\partial C}{\partial y} \frac{\partial y}{\partial w}$. That comes out to be what I wrote in the comment above. – JahKnows – 2018-04-25T16:02:08.310

@lukassz, $\frac{\partial C}{\partial y} = \hat{y} - y$ and $\frac{\partial y}{\partial w} = \text{derivative of sigmoid} = \frac{1}{1 + exp(w^Tx+b)}( 1-\frac{1}{1 + exp(w^Tx+b)})$. Now just put both of these together and you have what I wrote above and also what's in the code you linked to. – JahKnows – 2018-04-25T16:04:35.577

Ok. I think I understand this. In Rosenblatt perceptron we have simple learning rule target - outout but in more compilicated neurons we have cost function like you propose. I'm right? Learning rule is a cost function too? Source:

– lukassz – 2018-04-25T17:34:33.590

@lukassz, no, $target - output$ is simply the derivative with respect to the outputs of the cost function I showed above. – JahKnows – 2018-04-26T02:06:56.513

@JahKnows can you do this with my question about backpropagation ?

– lukassz – 2018-05-16T10:07:43.330

@JahKnows - Thanks for this; however one part seems missing here in the chain rule dCost/dWeight = dCost/dYhat * dYhat/dSum * dSum/dWeight Where dSum = layer*weight Expalined here which makes it complete. Can you please add this to your explanation

– Alex Punnen – 2019-10-31T05:21:39.360


During the phase where the neural network generates its prediction, it feeds the input forward through the network. For each layer, the layer's input $X$ goes first through an affine transformation $W \cdot X + b$ and then is passed through the sigmoid function $σ(W \cdot X + b)$.

In order to train the network, the output $\hat y$ is then compared to the expected output (or label) $y$ through a cost function $L(y, \hat y)=L\left(y, σ(W \cdot X + b)\right)$. The goal of the whole training procedure is to minimize that cost function. In order to do that, a technique called gradient descent is performed which calculates how we should change $W$ and $b$ so that the cost reduces.

Gradient Descent requires calculating the derivative of the cost function w.r.t $W$ and $b$. In order to do that we must apply the chain rule, because the derivative we need to calculate is a composition of two functions. As dictated by the chain rule we must calculate the derivative of the sigmoid function.

One of the reasons that the sigmoid function is popular with neural networks, is because its derivative is easy to compute.


Posted 2018-04-23T09:38:48.060



In simple words:

Derivative shows neuron's ability to learn on particular input.

For example if input is 0 or 1 or -2, the derivative (the "learning ability") is high and back-propagation will improve neuron's weights for this sample dramatically.

On other hand, if input is 20, the the derivative will be very close to 0. It means that back-propagation on this sample will not "teach" this neuron to produce a better result.

The things above are valid for a single sample.

Let's look at the bigger picture, for all samples in the training set. Here we have several situations:

  • If derivative is 0 for all samples in your training set AND neuron always produces wrong results - it means the neuron is saturated (dumb) and will not improve.
  • If derivative is 0 for all samples in your training set AND neuron always produces correct results - it means the neuron have been studying really well and already as smart as it could (side note: this case is good but it may indicate potential overfitting, which is not good)

  • If derivative is 0 on some samples, non-0 on other samples AND neuron produces mixed results - it indicates that this neuron doing some good work and potentially may improve from further training (though not necessarily as it depends on other neurons and training data you have)

So, when you are looking at the derivative plot, you can see how much the neuron prepared to learn and absorb the new knowledge, given a particular input.


Posted 2018-04-23T09:38:48.060

Reputation: 111


The derivative you see here is important in neural networks. It's the reason why people generally prefer something else such as rectified linear unit.

Do you see the derivative drop for the two ends? What if your network is on the very left side, but it needs to move to the right side? Imagine you're on -10.0 but you want 10.0. The gradient will be too small for your network to converge quickly. We don't want to wait, we want quicker convergence. RLU doesn't have this problem.

We call this problem "Neural Network Saturation".

Please see here


Posted 2018-04-23T09:38:48.060

Reputation: 3 050