For deep neural networks using ReLU neurons, the recommended connection weight initialization strategy is to pick a random uniform number between -r and +r with:
fan-outare the number of connections going in and out of the layer being initialized. This is called “He initialization” (paper).
My question is: what’s the recommended weights initialization strategy when using ELU neurons (paper)?
Since ELUs look a lot like ReLUs, I’m tempted to use the same logic, but I’m not sure it’s the optimal strategy.
There is a fairly similar question but this one is more specifically about the ELU activation function (which is not covered by the answers to the other question).
I think the initialization should be roughly √1.55nin
The He et al. 2015 formula was made for ReLU units. The key idea is that the variance of f(y) with y = W * x + b should be roughly equal to the variance of y. Let’s first go over the case of taking a ReLU activation, and see if we can ammend it for ELU units.
In the paper they show show that:
They express the last expectation E[x2l] in terms of Var[yl−1]. For ReLUs we have that E[x2l]=12Var[yl−1], simply because ReLUs put half the values in x to 0 on average. Thus we can write
We apply this to all layers, taking the product over l, all the way to the first layer. This gives:
Now this is stable only when 12nlVar[wl] is close to 1. So they set it to 1 and find Var[Wl]=2nl
Now for ELU units, the only thing we have to change is the expression of E[x2l] in terms of Var[yl−1]. Sadly, this is not as straight-forward for ELU units as for RelU units as it involves calculating E[(e(N)2)] for only the negative values of N. This is not a pretty formula, I don’t even know if there’s a good closed form solution, so let’s sample to get an approximation. We want Var[yl] to roughly be equal to 1 (most inputs are variance 1, batch norm makes layers variance 1 etc.). Thus we can sample from a normal distribution, apply the elu function with alpha = 1, square and calculate the mean. This gives ≈0.645. The inverse of this is ≈1.55.
Thus following the same logic, we can set Var[wl] to √1.55n to get a variance that doesn’t increase in magnitude.
I reckon that would be the optimal value for the ELU function. It fits in between the value for the ReLU function (1/2, which is lower than 0.645 because the values that are mapped to 0 now get mapped to some minus value), and what you would have for any function with mean 0 (which is just 1).
Take care that if the variance of Var[yl−1] is different, the optimal constant is also different. When this variance tends to 0, then the function becomes more and more like a unit function, thus the constant will tend to 1. If the variance becomes really big, the value tends towards the original ReLU value, thus 0.5.
Edit: Did the theoretical analysis of the variance of ELU(x) if x is normally distributed. It involves the some derivations of the log-normal distribution and not so pretty integrals. The eventual answer for the variance is 0.5σ (the part of the linear function) +
Which is not very solvable for σ unfortunately. You can fill in for σ and get the estimate I gave above however, which is pretty cool.