89

34

This is a small conceptual question that's been nagging me for a while: How can we back-propagate through a max-pooling layer in a neural network?

I came across max-pooling layers while going through this tutorial for Torch 7's nn library. The library abstracts the gradient calculation and forward passes for each layer of a deep network. I don't understand how the gradient calculation is done for a max-pooling layer.

I know that if you have an input ${z_i}^l$ going into neuron $i$ of layer $l$, then ${\delta_i}^l$ (defined as ${\delta_i}^l = \frac{\partial E}{\partial {z_i}^l}$) is given by: $$ {\delta_i}^l = \theta^{'}({z_i}^l) \sum_{j} {\delta_j}^{l+1} w_{i,j}^{l,l+1} $$

So, a max-pooling layer would receive the ${\delta_j}^{l+1}$'s of the next layer as usual; but since the activation function for the max-pooling neurons takes in a vector of values (over which it maxes) as input, ${\delta_i}^{l}$ isn't a single number anymore, but a vector ($\theta^{'}({z_j}^l)$ would have to be replaced by $\nabla \theta(\left\{{z_j}^l\right\})$). Furthermore, $\theta$, being the max function, isn't differentiable with respect to it's inputs.

So....how should it work out exactly?

12Oh right, there is no point back-propagating through the non-maximum neurons - that was a crucial insight.

So if I now understand this correctly, back-propagating through the max-pooling layer simply selects the max. neuron from the previous layer (on which the max-pooling was done) and continues back-propagation only through that. – shinvu – 2016-05-13T05:35:39.633

But don't you need to multiply with the derivative of the activation function? – Jason – 2018-03-19T04:02:51.057

2@Jason: The max function is locally linear for the activation that got the max, so the derivative of it is constant 1. For the activations that didn't make it through, it's 0. That's conceptually very similar to differentiating the ReLU(x) = max(0,x) activation function. – Chrigi – 2019-02-05T12:48:19.173

What is the stride is less than the kernel width for max pooling ? – Vatsal – 2019-03-04T05:52:15.460

2Great answer! What about the edge case where multiple entries have the same max value (for example 2 values have 0 from a ReLU, and the other two are negative)? – DankMasterDan – 2019-04-23T17:11:42.293

2@DankMasterDan After some experimentation, it looks like tensorflow will pick the first entry with the max value. – Swier – 2019-12-04T20:42:49.587

1@DankMasterDan It's also valid to just not give any gradient to these values (verified through grad checking). The only thing you

shouldn'tdo is pass gradients back to all values that match the max. – Recessive – 2020-01-28T06:05:40.773