Why is sqrt(6) used to calculate epsilon for random initialisation of neural networks?

In the week 5 lecture notes for Andrew Ng’s Coursera Machine Learning Class, the following formula is given for calculating the value of \epsilon used to initialise \Theta with random values:

Forumla for calculating epsilon-init for random initialisation

In the exercise, further clarification is given:

One effective strategy for choosing \epsilon_{init}
is to base it on the number of units in the
network. A good choice of \epsilon_{init} is \epsilon_{init} = \frac{\sqrt{6}}{\sqrt{L_{in} – L_{out}}}, where L_{in}
= s_l
and L_{out} = s_{l+1} are
the number of units in the layers adjacent to \Theta^{(l)}.

Why is the constant \sqrt 6 used here? Why not \sqrt 5, \sqrt 7 or \sqrt {6.1}?

Answer

I believe this is Xavier normalized initialization (implemented in several deep learning frameworks eg Keras, Cafe, …)
from Understanding the difficulty of training deep feedforward neural networks by Xavier Glorot & Yoshua Bengio.

See equations 12, 15 and 16 in the paper linked: they aim to satisfy equation 12:
\text{Var}[W_i] = \frac{2}{n_i + n_{i+1}}

and the variance of a uniform RV in [-\epsilon,\epsilon] is \epsilon^2/3 (mean is zero, pdf = 1/(2\epsilon) so variance =\int_{-\epsilon}^{\epsilon}x^2 \frac{1}{2\epsilon}dx

Attribution
Source : Link , Question Author : Tom Hale , Answer Author : seanv507

Leave a Comment