What is the benefit of the truncated normal distribution in initializing weights in a neural network?

When initializing connection weights in a feedforward neural network, it is important to initialize them randomly to avoid any symmetries that the learning algorithm would not be able to break.

The recommendation I have seen in various places (eg. in TensorFlow’s MNIST tutorial) is to use the truncated normal distribution using a standard deviation of 1N, where N is the number of inputs to the given neuron layer.

I believe that the standard deviation formula ensures that backpropagated gradients don’t dissolve or amplify too quickly. But I don’t know why we are using a truncated normal distribution as opposed to a regular normal distribution. Is it to avoid rare outlier weights?


I think its about saturation of the neurons. Think about you have an activation function like sigmoid.

enter image description here

If your weight val gets value >= 2 or <=-2 your neuron will not learn. So, if you truncate your normal distribution you will not have this issue(at least from the initialization) based on your variance. I think thats why, its better to use truncated normal in general.

Source : Link , Question Author : MiniQuark , Answer Author : Güngör Basa

Leave a Comment