The use of derivatives in neural networks is for the training process called **backpropagation**. This technique uses **gradient descent** in order to find an optimal set of model parameters in order to minimize a loss function. In your example you must use the **derivative of a sigmoid** because that is the activation that your individual neurons are using.

# The loss function

The essence of machine learning is to optimize a cost function such that we can either minimize or maximize some target function. This is typically called the loss or cost funtion. We typically want to minimize this function. The cost function, $C$, associates some penalty based on the resulting errors when passing data through your model as a function of the model parameters.

Let's look at the example where we try to label whether an image contains a cat or a dog. If we have a perfect model, we can give the model a picture and it will tell us if it is a cat or a dog. However, no model is perfect and it will make mistakes.

When we train our model to be able to infer meaning from input data we want to minimize the amount of mistakes it makes. So we use a training set, this data contains a lot of pictures of dogs and cats and we have the ground truth label associated with that image. Each time we run a training iteration of the model we calculate the cost (the amount of mistakes) of the model. We will want to minimize this cost.

Many cost functions exist each serving their own purpose. A common cost function that is used is the quadratic cost which is defined as

$C = \frac{1}{N} \sum_{i=0}^{N}(\hat{y} - y)^2$.

This is the square of the difference between the predicted label and the ground truth label for the $N$ images that we trained over. We will want to minimize this in some way.

# Minimizing a loss function

Indeed most of machine learning is simply a family of frameworks which are capable of determining a distribution by minimizing some cost function. The question we can ask is "how can we minimize a function"?

Let's minimize the following function

$y = x^2-4x+6$.

If we plot this we can see that there is a minimum at $x = 2$. To do this analytically we can take the derivative of this function as

$\frac{dy}{dx} = 2x - 4 = 0$

$x = 2$.

However, often times finding a global minimum analytically is not feasible. So instead we use some optimization techniques. Here as well many different ways exist such as : Newton-Raphson, grid search, etc. Among these is **gradient descent**. This is the technique used by neural networks.

## Gradient Descent

Let's use a famously used analogy to understand this. Imagine a 2D minimization problem. This is equivalent of being on a mountainous hike in the wilderness. You want to get back down to the village which you know is at the lowest point. Even if you do not know the cardinal directions of the village. All you need to do is continuously take the steepest way down, and you will eventually get to the village. So we will descend down the surface based on the steepness of the slope.

Let's take our function

$y = x^2-4x+6$

we will determine the $x$ for which $y$ is minimized. Gradient descent algorithm first says we will pick a random value for $x$. Let us initialize at $x=8$. Then the algorithm will do the following iteratively until we reach convergence.

$x^{new} = x^{old} - \nu \frac{dy}{dx}$

where $\nu$ is the learning rate, we can set this to whatever value we will like. However there is a smart way to choose this. Too big and we will never reach our minimum value, and too small we will waste soooo much time before we get there. It is analogous to the size of the steps you want to take down the steep slope. Small steps and you will die on the mountain, you'll never get down. Too large of a step and you risk over shooting the village and ending up the other side of the mountain. **The derivative is the means by which we travel down this slope towards our minimum.**

$\frac{dy}{dx} = 2x - 4$

$\nu = 0.1$

Iteration 1:

$x^{new} = 8 - 0.1(2 * 8 - 4) = 6.8 $

$x^{new} = 6.8 - 0.1(2 * 6.8 - 4) = 5.84 $

$x^{new} = 5.84 - 0.1(2 * 5.84 - 4) = 5.07 $

$x^{new} = 5.07 - 0.1(2 * 5.07 - 4) = 4.45 $

$x^{new} = 4.45 - 0.1(2 * 4.45 - 4) = 3.96 $

$x^{new} = 3.96 - 0.1(2 * 3.96 - 4) = 3.57 $

$x^{new} = 3.57 - 0.1(2 * 3.57 - 4) = 3.25 $

$x^{new} = 3.25 - 0.1(2 * 3.25 - 4) = 3.00 $

$x^{new} = 3.00 - 0.1(2 * 3.00 - 4) = 2.80 $

$x^{new} = 2.80 - 0.1(2 * 2.80 - 4) = 2.64 $

$x^{new} = 2.64 - 0.1(2 * 2.64 - 4) = 2.51 $

$x^{new} = 2.51 - 0.1(2 * 2.51 - 4) = 2.41 $

$x^{new} = 2.41 - 0.1(2 * 2.41 - 4) = 2.32 $

$x^{new} = 2.32 - 0.1(2 * 2.32 - 4) = 2.26 $

$x^{new} = 2.26 - 0.1(2 * 2.26 - 4) = 2.21 $

$x^{new} = 2.21 - 0.1(2 * 2.21 - 4) = 2.16 $

$x^{new} = 2.16 - 0.1(2 * 2.16 - 4) = 2.13 $

$x^{new} = 2.13 - 0.1(2 * 2.13 - 4) = 2.10 $

$x^{new} = 2.10 - 0.1(2 * 2.10 - 4) = 2.08 $

$x^{new} = 2.08 - 0.1(2 * 2.08 - 4) = 2.06 $

$x^{new} = 2.06 - 0.1(2 * 2.06 - 4) = 2.05 $

$x^{new} = 2.05 - 0.1(2 * 2.05 - 4) = 2.04 $

$x^{new} = 2.04 - 0.1(2 * 2.04 - 4) = 2.03 $

$x^{new} = 2.03 - 0.1(2 * 2.03 - 4) = 2.02 $

$x^{new} = 2.02 - 0.1(2 * 2.02 - 4) = 2.02 $

$x^{new} = 2.02 - 0.1(2 * 2.02 - 4) = 2.01 $

$x^{new} = 2.01 - 0.1(2 * 2.01 - 4) = 2.01 $

$x^{new} = 2.01 - 0.1(2 * 2.01 - 4) = 2.01 $

$x^{new} = 2.01 - 0.1(2 * 2.01 - 4) = 2.00 $

$x^{new} = 2.00 - 0.1(2 * 2.00 - 4) = 2.00 $

$x^{new} = 2.00 - 0.1(2 * 2.00 - 4) = 2.00 $

$x^{new} = 2.00 - 0.1(2 * 2.00 - 4) = 2.00 $

$x^{new} = 2.00 - 0.1(2 * 2.00 - 4) = 2.00 $

And we see that the algorithm converges at $x = 2$! We have found the minimum.

# Applied to neural networks

The first neural networks only had a single neuron which took in some inputs $x$ and then provide an output $\hat{y}$. A common function used is the sigmoid function

$\sigma(z) = \frac{1}{1+exp(z)}$

$\hat{y}(w^Tx) = \frac{1}{1+exp(w^Tx + b)}$

where $w$ is the associated weight for each input $x$ and we have a bias $b$. We then want to minimize our cost function

$C = \frac{1}{2N} \sum_{i=0}^{N}(\hat{y} - y)^2$.

## How to train the neural network?

We will use gradient descent to train the weights based on the output of the sigmoid function and we will use some cost function $C$ and train on batches of data of size $N$.

$C = \frac{1}{2N} \sum_i^N (\hat{y} - y)^2$

$\hat{y}$ is the predicted class obtained from the sigmoid function and $y$ is the ground truth label. We will use gradient descent to minimize the cost function with respect to the weights $w$. To make life easier we will split the derivative as follows

$\frac{\partial C}{\partial w} = \frac{\partial C}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial w}$.

$\frac{\partial C}{\partial \hat{y}} = \hat{y} - y$

and we have that $\hat{y} = \sigma(w^Tx)$ and the derivative of the sigmoid function is $\frac{\partial \sigma(z)}{\partial z} = \sigma(z)(1-\sigma(z))$ thus we have,

$\frac{\partial \hat{y}}{\partial w} = \frac{1}{1+exp(w^Tx + b)} (1 - \frac{1}{1+exp(w^Tx + b)})$.

So we can then update the weights through gradient descent as

$w^{new} = w^{old} - \eta \frac{\partial C}{\partial w}$

where $\eta$ is the learning rate.

2If you have more questions don't hesitate to ask – JahKnows – 2018-04-23T10:14:44.973

As a side comment, I those topics I generally advise everyone to read "Efficient Backprop" by LeCun and others. – lcrmorin – 2020-03-28T07:21:12.283