Why initial weights in neural network are randomized?


This might sound silly to someone who has plenty of experience with neural networks but it bothers me...

I mean randomizing initial weights might give you better results that would be somewhat closer to what trained network should look like, but it might as well be the exact opposite of what it should be, while 0.5 or some other average for the range of reasonable weight value would sound like a good default setting...

Why do initial weights for neurons are being randomized rather than 0.5 for all of them?

Matas Vaitkevicius

Posted 2017-10-21T06:52:15.723

Reputation: 219

What was the problem with my edit? Do you think you cannot be improved? – nbro – 2019-04-23T18:49:04.830

@nbro adds multiple questions, which makes it too broad... – Matas Vaitkevicius – 2019-04-23T18:49:46.377

What questions did I add that are not present in your post? I just reformulated as questions what you stated as hypotheses. – nbro – 2019-04-23T18:50:07.530

By the way, your wording is not even correct. The weights are not being randomised, but they are being randomly initialised. These are two different concepts and you meant the second one. My edit was meant to improve the wording too. – nbro – 2019-04-23T19:18:45.853

@nbro Hi, look I wasn't unappreciative, and certainly didn't want to offend you. I am bad at asking questions too, wording and everything. So I am sorry if I have offended you. – Matas Vaitkevicius – 2019-04-24T05:06:10.550



You shouldn't assign all to 0.5 because you'd have the "break symmetry" issue.


Perhaps the only property known with complete certainty is that the initial parameters need to “break symmetry” between different units. If two hidden units with the same activation function are connected to the same inputs, then these units must have different initial parameters. If they have the same initial parameters, then a deterministic learning algorithm applied to a deterministic cost and model will constantly update both of these units in the same way. Even if the model or training algorithm is capable of using stochasticity to compute different updates for different units (for example, if one trains with dropout), it is usually best to initialize each unit to compute a different function from all of the other units. This may help to make sure that no input patterns are lost in the null space of forward propagation and no gradient patterns are lost in the null space of back-propagation.


Posted 2017-10-21T06:52:15.723

Reputation: 1 375


The initial weights in a neural network are initialized randomly because the gradient based methods commonly used to train neural networks do not work well when all of the weights are initialized to the same value. While not all of the methods to train neural networks are gradient based, most of them are, and it has been shown in several cases that initializing the neural network to the same value makes the network take much longer to converge on an optimum solution. Also, if you want to retrain your neural network because it got stuck in a local minima, it will get stuck in the same local minima. For the above reasons, we do not set the initial weights to a constant value.

References: Why doesn't backpropagation work when you initialize the weights the same value?

Aiden Grossman

Posted 2017-10-21T06:52:15.723

Reputation: 824

In fact, they break down if all weights are the same. – Quonux – 2019-04-23T18:19:34.300


That is a very deep question. There was series of papers recently with proof of convergence of gradient descent for overparameterized deep network (for example, Gradient Descent Finds Global Minima of Deep Neural Networks, A Convergence Theory for Deep Learning via Over-Parameterization or Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks). All of them condition proof on random Gaussian distribution of weights. It's importance for proofs depend on two factors:

  1. Random weights make ReLU statistically compressive mapping (up to linear transformation)

  2. Random weights preserve separation of input for any input distribution - that is if input samples are distinguishable network propagation will not make them indistinguishable

Those properties very difficult to reproduce with deterministic matrices, and even if they are reproducible with deterministic matrices NULL-space (domain of adversarial examples) would likely make method impractical, and more important preservation of those properties during gradient descent would likely make method impractical. But overall it's very difficult but not impossible, and may warrant some research in that direction. In analogous situation, there were some results for Restricted Isometry Property for deterministic matrices in compressed sensing.


Posted 2017-10-21T06:52:15.723

Reputation: 547