Ok, I know that this question was asked before on ai.SE, on stats.SE and also on SO. So I did my homework in checking before posting my question, but none of them has an answer that fully satisfies me.
Summary of my knowledge up to now
Suppose you have a layer that is fully connected, and that each neuron performs an operation like
a = g(w^T * x + b)
a is the output of the neuron,
x the input,
g our generic activation function, and finally
b our parameters.
b are initialized with all elements equal to each other, then
a is equal for each unit of that layer.
This means that we have symmetry, thus at each iteration of whichever algorithm we choose to update our parameters, they will update in the same way, thus there is no need for multiple units since they all behave as a single one.
In order to break the symmetry we could randomly initialize the matrix
w and initialize
b to zero (this is the setup that I've seen more often).
a is different for each unit so that all neurons behave differently.
Of course randomly initializing both
b would be also okay even if not necessary.
The actual question
Is randomly initializing
w the only choice?
Could we randomly initialize
b instead of
w in order to break the symmetry?
Is the answer dependent on the choice of the activation function and/or the cost function?
My thinking is that we could break the symmetry by randomly initializing
b, since in this way
a would be different for each unit and, since in the backward propagation the derivatives of both
b depend on
a(at least this should be true for all the activation functions that I have seen so far), each unit would behave differently. Obviously, this is only a thought, and I'm not sure that is absolutely true.
Note that this is a theoretical question, not a practical one. This means that I'm not really interested in answers that involves performance, but you're welcome to include them as corollary to the real answer.