In the week 5 lecture notes for Andrew Ng’s Coursera Machine Learning Class, the following formula is given for calculating the value of \epsilon used to initialise \Theta with random values:

In the exercise, further clarification is given:

One effective strategy for choosing \epsilon_{init}

is to base it on the number of units in the

network. A good choice of \epsilon_{init} is \epsilon_{init} = \frac{\sqrt{6}}{\sqrt{L_{in} – L_{out}}}, where L_{in}

= s_l and L_{out} = s_{l+1} are

the number of units in the layers adjacent to \Theta^{(l)}.Why is the constant \sqrt 6 used here? Why not \sqrt 5, \sqrt 7 or \sqrt {6.1}?

**Answer**

I believe this is Xavier *normalized initialization* (implemented in several deep learning frameworks eg Keras, Cafe, …)

from Understanding the difficulty of training deep feedforward neural networks by Xavier Glorot & Yoshua Bengio.

See equations 12, 15 and 16 in the paper linked: they aim to satisfy equation 12:

\text{Var}[W_i] = \frac{2}{n_i + n_{i+1}}

and the variance of a uniform RV in [-\epsilon,\epsilon] is \epsilon^2/3 (mean is zero, pdf = 1/(2\epsilon) so variance =\int_{-\epsilon}^{\epsilon}x^2 \frac{1}{2\epsilon}dx

**Attribution***Source : Link , Question Author : Tom Hale , Answer Author : seanv507*