How should the bias be initialized and regularized?



I've read a couple of papers about kernel initialization and many papers mention that they use L2 regularization of the kernel (often with $\lambda = 0.0001$).

Does anybody do something different than initializing the bias with constant zero and not regularizing it?

Kernel initialization papers

Martin Thoma

Posted 2017-03-30T04:40:43.763

Reputation: 15 590



From the Stanford CS231N Notes:

Initializing the biases. It is possible and common to initialize the biases to be zero, since the asymmetry breaking is provided by the small random numbers in the weights. For ReLU non-linearities, some people like to use small constant value such as 0.01 for all biases because this ensures that all ReLU units fire in the beginning and therefore obtain and propagate some gradient. However, it is not clear if this provides a consistent improvement (in fact some results seem to indicate that this performs worse) and it is more common to simply use 0 bias initialization.

In LSTMs it's common to initialize the biases to 1 - see for example.

Lukas Biewald

Posted 2017-03-30T04:40:43.763

Reputation: 426