0

0

I am new in this field so please be gentle with terminology. In the original paper; "Understanding the difficulty of training deep feedforward neural networks", I dont understand how equation 15 is obtained, it states that giving eq 1 :

$$ W_{ij} \sim U\left[−\frac{1}{\sqrt{n}},\frac{1}{\sqrt{n}}\right] $$

it gives rise to variance with the following property:

$$ n*Var[W]=1/3 $$

where $n$ is the size of the layer.

How is this last equation(15) obtained?

Thanks!!

Thank you very much :D, if i may abuse of your knowledge, can you help me also with eq 16? – Laughing Man – 2018-06-13T15:09:52.317

You're welcome! You can think of it as a way to take the average of size of two adjacent layers into account, instead of just a single layer. They recognise that there is an effect between neighbouring layers, so the initialization distribution is parameterised using the size of the current layer and the size of the next layer, which is the ($n_j + n_{j+1}$) bit, while the $\sqrt{6}$ on top is probably just a constant factor that helps keep the scale of the values in the range they wanted. – n1k31t4 – 2018-06-13T15:16:35.670

ok ok, so there is no relation between equation 15 and 16? for the 1/3 that was obtained? – Laughing Man – 2018-06-13T15:20:05.727

Equation 16 is more like an improved version of equation 1. If you work through the steps in my answer above, using equation 16 (instead of 15), you will likely get an expression that is something like: $\frac{(n_{j} + n_{j+1})}{2} * Var = \frac{1}{3}$. The $\sqrt{6}$ in equation 16 just ensures the final expression is still equal to $\frac{1}{3}$ – n1k31t4 – 2018-06-13T15:23:29.600