# Why is sqrt(6) used to calculate epsilon for random initialisation of neural networks?

In the week 5 lecture notes for Andrew Ng’s Coursera Machine Learning Class, the following formula is given for calculating the value of $\epsilon$ used to initialise $\Theta$ with random values:

In the exercise, further clarification is given:

One effective strategy for choosing $\epsilon_{init}$
is to base it on the number of units in the
network. A good choice of $\epsilon_{init}$ is $\epsilon_{init} = \frac{\sqrt{6}}{\sqrt{L_{in} - L_{out}}}$, where $L_{in} = s_l$ and $L_{out} = s_{l+1}$ are
the number of units in the layers adjacent to $\Theta^{(l)}$.

Why is the constant $\sqrt 6$ used here? Why not $\sqrt 5$, $\sqrt 7$ or $\sqrt {6.1}$?

and the variance of a uniform RV in $[-\epsilon,\epsilon]$ is $\epsilon^2/3$ (mean is zero, pdf = $1/(2\epsilon)$ so variance $=\int_{-\epsilon}^{\epsilon}x^2 \frac{1}{2\epsilon}dx$