Why the sigmoid activation function results in sub-optimal gradient descent?


I need some help understanding the second shortcoming of the sigmoid activation function as described in this video from Stanford. She says that because the output of sigmoid is always positive, that any gradients flowing back from a neuron following a sigmoid will all share the same sign as the upstream gradient flowing into that neuron. She then says that a consequence of these weight updates sharing the same sign is a sub-optimal zigzag gradient descent path.

I understand this phenomenon when zoomed in on a single neuron. However, since upstream gradients flowing into a layer can be of different signs, it's still possible to get a healthy mixture of positive and negative weight updates in a layer. Therefore, I'm having trouble understanding how using sigmoid results in this zigzag descent path, except for in the case where the upstream gradients are all of the same sign (which intuitively seems uncommon). It seems to me that if this suboptimal descent is important enough to be highlighted in the lecture, that it must be more common than that.

I'm wondering if the issue is "reduced entropy" among the weight updates, rather than all weight updates in the network sharing the same sign. That is, zigzagging in a subset of the dimensions. For example, say a network using sigmoid has four weights in a layer with two neurons: w1, w2, w3, and w4. The updates to w1 and w2 could be positive, while the updates to w3 and w4 could be negative if the two upstream gradients differ in sign. However, it wouldn't be possible for w1 and w3 to be positive, and w2 and w4 to be negative. Is this the limitation of sigmoid that the Stanford lecture is referring to, assuming the second combination of weight updates was the optimal one?


Posted 2020-10-30T17:16:53.030

Reputation: 31

1Yes, the lecturer is refering to a single neuron. It's true that in the same layer we could have different signs of updates. However, by using sigmoid function, all the weights connecting to a single neuron will be updated increasing its value or decreasing it (and not both at the same time) $\rightarrow$ zig-zagging in order to reach the optimal value of the weights. – Javier TG – 2020-10-31T08:26:57.380

1Thanks Javier. To make sure I understand your response, are you saying it’s correct that with sigmoid we zigzag in a subset of dimensions? That is, constraining the gradients flowing back from a single node to share the same sign is enough to cause the behavior, regardless of what the rest of the network is doing? – Churchjm – 2020-10-31T15:57:36.753

1Yes, I think we are on the same page. Concretely, this happens because the update of a weight that connects a neuron $j$ with a neuron $k$ is given by a quantity proportional to: $$\frac{\partial C}{\partial w^l_{kj}}= \delta^l_k ,,a_j^{l-1}$$ Where $C$ is the cost function, $a_j^{l-1}$ the activation of the neuron $j$ and $\delta^l_k$ the "error" term for the neuron $k$. $\delta^l_k$ is just a scalar, hence, if all $a_j^{l-1}$ are positive (this happens with sigmoid), then the updates of all the weights that connect to a neuron $k$ will have the same sign – Javier TG – 2020-10-31T16:12:50.633

1Thanks for your comments Javier. It seems I can't mark a comment as an answer though. If you'd like to re-post this as an answer, I'll accept it. – Churchjm – 2020-11-01T18:01:31.177

No answers