Gradients for bias terms in backpropagation



I was trying to implement neural network from scratch to understand the maths behind it. My problem is completely related to backpropagation when we take derivative with respect to bias) and I derived all the equations used in backpropagation. Now every equation is matching with the code for neural network except for that the derivative with respect to biases.


#back prop


I looked up online for the code, and i want to know why do we add up the matrix and then the scalar db2=np.sum(dz2,axis=0,keepdims=True) is subtracted from the original bias, why not the matrix as a whole is subtracted. Can anyone help me to give some intuion behind it. If i take partial derivative of loss with respect to bias it will give me upper gradient only which is dz2 because h1 and theta will be 0 and b2 will be 1. So the upper term will be left.



Posted 2017-07-03T17:03:24.397

Reputation: 345



The bias term is very simple, which is why you often don't see it calculated. In fact

db2 = dz2

So your update rules for bias on a single item are:

b2 += -alpha * dz2


b1 += -alpha * dz1

In terms of the maths, if your loss is $J$, and you know $\frac{\partial J}{\partial z_i}$ for a given neuron $i$ which has bias term $b_i$ . . .

$$\frac{\partial J}{\partial b_i} = \frac{\partial J}{\partial z_i} \frac{\partial z_i}{\partial b_i}$$


$$\frac{\partial z_i}{\partial b_i} = 1$$

because $z_i = (\text{something unaffected by } b_i) + b_i$

It looks like the code you copied uses the form


because the network is designed to process examples in (mini-)batches, and you therefore have gradients calculated for more than one example at a time. The sum is squashing the results down to a single update. This would be easier to confirm if you also showed update code for weights.

Neil Slater

Posted 2017-07-03T17:03:24.397

Reputation: 24 613

yes,exactly that is what my thoughts were,because thats how mathematically it looks like.but the code i saw they summed up the matrix and then added it to the b1. – user34042 – 2017-07-03T19:30:10.507

theta1=theta1-alpha*dw1 theta2=theta2-alpha*dw2 i still don't get it.that way same term will be added to all the different terms in the 'b' vector which otherwise would have had different weights for every single terms.that would make significant difference for neural network to achieve minima. – user34042 – 2017-07-03T19:30:33.627

@user34042: Something doesn't seem right to me - could you link the source you got that code from? I wonder if the source got it wrong because it has mixed and matched mini-batch code with simple online gradient descent. – Neil Slater – 2017-07-03T19:35:00.403 here it is. – user34042 – 2017-07-03T19:36:35.437

I think the source has it wrong. The NN will still kind of work with all bias values the same, so they may not have noticed. And as I mentioned, you might actually use that code in a batch-based scenario, so it could just be a cut&paste error. – Neil Slater – 2017-07-03T19:39:37.807

no there was no error actually ,i wrote my own code i was just trying to compare my code with someone else if i got the maths right.i was worried for the sake of mathematics and wondering where i got it wrong,that is why.thank for confirming. – user34042 – 2017-07-03T19:42:05.033


I would like to explain the meaning of db2=np.sum(dz2,axis=0,keepdims=True) as it also confused me once and it didn't get answered.

The derivative of L (loss) w.r.t. b is the upstream derivative multiplied with the local derivate: $$ \frac{ \partial L}{\partial \mathbf{b}} = \frac{ \partial L}{\partial Z} \frac{ \partial Z}{\partial \mathbf{b}} $$

If we have multiple samples Z and L are both matrices. b is still a vector.

The local derivative is simply a vector of ones: $$ \frac{ \partial Z}{\partial \mathbf{b}} = \frac{\partial}{\partial \mathbf{b}} W \times X + \mathbf{b} = \mathbf{1} $$

That means our complete derivative is a matrix multiplication, that looks as follows (e.g. 2 samples with 3 outputs): $$ \frac{\partial L}{\partial Z} \times \mathbf{1} = \begin{bmatrix} . &. &. \\ . &. &. \end{bmatrix} \begin{bmatrix} 1\\ 1\\ 1\\ \end{bmatrix} $$

Note that this is the sum of the rows.

And that's where db2=np.sum(dz2, axis=0, keepdims=True) comes from. It is simply an abbreviation for the matrix multiplication of the local and the upstream derivatives.


Posted 2017-07-03T17:03:24.397

Reputation: 432

This answer does not seem to be correct as pointed out already. axis=0 means we're summing the columns, so the result will be a 1x3 matrix. – Snowball – 2020-10-22T22:37:00.907

I think it should be (1,1,1) * dL/dz. The ones need to be shaped (1,3) and not (3,1) – deekay42 – 2021-01-29T03:29:41.163


first, you must correct your formula for the gradient of the sigmoid function.

The first derivative of sigmoid function is: (1−σ(x))σ(x)

Your formula for dz2 will become: dz2 = (1-h2)*h2 * dh2

You must use the output of the sigmoid function for σ(x) not the gradient.

You must sum the gradient for the bias as this gradient comes from many single inputs (the number of inputs = batch size). Thus, we must accumulate them to update the biases of layer 2. However, for the gradients come to layer 1, since they come from many nodes of layer 2, you have to sum all the gradient for updating the biases and weights in layer 1. This case is different from the sum of biases in layer 2.

My implement for two fully-connected layers with the activation functions are sigmoid functions:

lr = 1e-3
f = lambda x: 1.0/(1.0 + np.exp(-x))
# pass through layer 1
out_l1 =, W_1) + b_1

out_s1 = f(out_l1)

# pass through layer 2
out_l2 =, W_2) + b_2

out_s2 = f(out_l2)

loss = get_loss(out_s2, y)

grad = out_s2 - y

d_h2 = (1 - out_s2) * out_s2 * grad

# Accumulate the gradient come from all examples
d_W2 =
d_b2 = np.sum(d_h2, axis=0, keepdims=True)

# sum of gradient come out from prev node:
grad_1 = np.sum(d_W2.T, axis=0, keepdims=True)
d_h1 = (1 - out_l1) * out_l1 * grad_1

d_W1 =
d_b1 = np.sum(d_h1, axis=0, keepdims=True)

W_1 -= d_W1 * lr
b_1 -= d_b1 * lr

W_2 -= d_W2 * lr
b_2 -= d_b2 * lr


Posted 2017-07-03T17:03:24.397

Reputation: 111


enter image description here

Implementation of derivative of L (loss) w.r.t. b is very confused. In particular, argument axis=0 means that you are calculate the sum of columns not rows. Correct me if I was wrong.

Phước Nguyện Bùi

Posted 2017-07-03T17:03:24.397

Reputation: 1