Xavier initialization seems to be used quite widely now to initialize connection weights in neural networks, especially deep ones (see What are good initial weights in a neural network?).
The original paper by Xavier Glorot and Yoshua Bengio suggests initializing weights using a Uniform distribution between −r and +r with r=√6nin+nout (where nin and nout are the number of connections going in and out of the layer we are initializing), in order to ensure that the variance is equal to σ2=2nin+nout. This helps ensure that the variance of the outputs is roughly equal to the variance of the inputs to avoid the vanishing/exploding gradients problem.
Some libraries (such as Lasagne) seem to offer the option to use the Normal distribution instead, with 0 mean and the same variance.
Is there any reason to prefer the Uniform distribution over the Normal distribution (or the reverse)? Some examples in TensorFlow’s tutorials also use a truncated Normal distribution.
My guess is that the uniform distribution guarantees that no weights will be large (and so does the truncated Normal distribution). Or perhaps it just doesn’t change much at all.