In some tutorials I found it was stated that “Xavier” weight initialization (paper: Understanding the difficulty of training deep feedforward neural networks) is an efficient way to initialize the weights of neural networks.
For fully-connected layers there was a rule of thumb in those tutorials:
where Var(W) is the variance of the weights for a layer, initialized with a normal distribution and nin, nout is the amount of neurons in the parent and in the current layer.
Are there similar rules of thumb for convolutional layers?
I am struggling to figure out what would be best to initialize the weights of a convolutional layer. E.g. in a layer where the shape of the weights is
(5, 5, 3, 8), so the kernel size is
5x5, filtering three input channels (RGB input) and creating
8feature maps…would be
3considered the amount of input neurons? Or rather
75 = 5*5*3, because the input are
5x5patches for each color channel?
I would accept both, a specific answer clarifying the problem or a more “generic” answer explaining the general process of finding the right initialization of weights and preferably linking sources.
In this case the amount of neurons should be
I found it especially useful for convolutional layers. Often a uniform distribution over the interval [−c/(in+out),c/(in+out)] works as well.
It is implemented as an option in almost all neural network libraries. Here you can find the source code of Keras’s implementation of Xavier Glorot’s initialization.