How is the property in eq 15 obtained for Xavier initialization



I am new in this field so please be gentle with terminology. In the original paper; "Understanding the difficulty of training deep feedforward neural networks", I dont understand how equation 15 is obtained, it states that giving eq 1 :

$$ W_{ij} \sim U\left[−\frac{1}{\sqrt{n}},\frac{1}{\sqrt{n}}\right] $$

it gives rise to variance with the following property:

$$ n*Var[W]=1/3 $$

where $n$ is the size of the layer.

How is this last equation(15) obtained?


Laughing Man

Posted 2018-06-13T14:04:44.560

Reputation: 3



The fixed answer of $1/3$ is a result of their decision to use the uniform distribution along with the parameterised arguments, namely $1/\sqrt{n}$.

For a uniform distribution, denoted with lower and upper bounds $a$ and $b$:

$$ U(a, b) $$

the variance is defined as:

$$ \frac{1}{12} (b - a)^2 $$

So in the case of the authors, Glorot and Bengio, the two bounds are simply the square root of the number of neurons in the layer of interest (generally referring to the preceding layer, as they put it). This size is called $n$, and they set the bounds on the uniform distribution as:

$$ a = - \frac{1}{\sqrt{n}} $$ $$ b = \frac{1}{\sqrt{n}} $$

So if we plug these values into equation 15, we get:

$$ Var = \frac{1}{12}(\frac{1}{\sqrt{n}} - -\frac{1}{\sqrt{n}})^2 $$

$$ Var = \frac{1}{12}(\frac{2}{\sqrt{n}})^2 $$

$$ Var = \frac{1}{12} * \frac{4}{n} $$

And so finally:

$$ n * Var = \frac{1}{3} $$


Posted 2018-06-13T14:04:44.560

Reputation: 12 573

Thank you very much :D, if i may abuse of your knowledge, can you help me also with eq 16? – Laughing Man – 2018-06-13T15:09:52.317

You're welcome! You can think of it as a way to take the average of size of two adjacent layers into account, instead of just a single layer. They recognise that there is an effect between neighbouring layers, so the initialization distribution is parameterised using the size of the current layer and the size of the next layer, which is the ($n_j + n_{j+1}$) bit, while the $\sqrt{6}$ on top is probably just a constant factor that helps keep the scale of the values in the range they wanted. – n1k31t4 – 2018-06-13T15:16:35.670

ok ok, so there is no relation between equation 15 and 16? for the 1/3 that was obtained? – Laughing Man – 2018-06-13T15:20:05.727

Equation 16 is more like an improved version of equation 1. If you work through the steps in my answer above, using equation 16 (instead of 15), you will likely get an expression that is something like: $\frac{(n_{j} + n_{j+1})}{2} * Var = \frac{1}{3}$. The $\sqrt{6}$ in equation 16 just ensures the final expression is still equal to $\frac{1}{3}$ – n1k31t4 – 2018-06-13T15:23:29.600