# What is the benefit of the truncated normal distribution in initializing weights in a neural network?

When initializing connection weights in a feedforward neural network, it is important to initialize them randomly to avoid any symmetries that the learning algorithm would not be able to break.

The recommendation I have seen in various places (eg. in TensorFlow’s MNIST tutorial) is to use the truncated normal distribution using a standard deviation of $\dfrac{1}{\sqrt{N}}$, where $N$ is the number of inputs to the given neuron layer.

I believe that the standard deviation formula ensures that backpropagated gradients don’t dissolve or amplify too quickly. But I don’t know why we are using a truncated normal distribution as opposed to a regular normal distribution. Is it to avoid rare outlier weights?