Why is it okay to set the bias vector up with zeros, and not the weight matrices?

2

We do not initialize weight matrices with zeros because the symmetry isn’t broken during the backward pass, and subsequently in the parameter updating process.

But it is safe to set the bias vector up with zeros, and they are updated accordingly.

Why is it safe to do so, and not the opposite?

Why can’t we initialize bias vectors with random numbers and weight matrices with zeros?

My initial thought is that a vector is of rank (n, 1) where $n \in \mathbb{N}$. This is not true for a matrix. And thus symmetry does not really come into play in the case of vectors.

But that does not answer the question that each layer of a deep neural network has its own weight matrix, and there is no need for symmetry across different layers.

So, when we talk about symmetry are we talking about symmetry across different rows of the same matrix?

Column wise symmetry should not matter much as they are for different training examples (for the first hidden layer). Does column-wise symmetry disturb the training process much in the case of hidden layers other than the first one?

truth

Posted 2020-10-21T20:22:41.443

Reputation: 130

Answers

2

As per Efficient Backprop from Lecun (§4.6) weight should be initialized in the linear region of the activation function. If they are too big, activation function will saturate and provide small gradient step to change those weigth. If they are too small they won't really impact the gradient and make the learning too slow.

Yes, if you choose the same weights this will create an artificial symetry that can be problematic. Here 'symmetry' is about neurons of the same layer having the same initial weights, thus being redundant. I think it would be clearer to speak about redundancy than symmetry. This will translate into redundant lines in weight matrices. Of course if all your weights are set to zero, all your weight matrices lines will be the same and you'll have horizontal symmetries in all your weight matrices.

Naturally you want to avoid redundancies in your model, but this is not the main problem. Your main problem is about solving an optimisation problem efficiently, i.e. having a gradient that is sufficiently big relative to your weights to help fast convergence. That's why you set your weights to small but not too small values. The randomness helps to avoid redundancies.

Once you have set your weights to random values, you have some minimal garantee you are in the linear region of the activation function, so you don't really need to add bias.

lcrmorin

Posted 2020-10-21T20:22:41.443

Reputation: 1 874