What’s the recommended weight initialization strategy when using the ELU activation function?

For deep neural networks using ReLU neurons, the recommended connection weight initialization strategy is to pick a random uniform number between -r and +r with:

r=12fan-in+fan-out

Where fan-in and fan-out are the number of connections going in and out of the layer being initialized. This is called “He initialization” (paper).

My question is: what’s the recommended weights initialization strategy when using ELU neurons (paper)?

Since ELUs look a lot like ReLUs, I’m tempted to use the same logic, but I’m not sure it’s the optimal strategy.

Note

There is a fairly similar question but this one is more specifically about the ELU activation function (which is not covered by the answers to the other question).

Answer

I think the initialization should be roughly 1.55nin

The He et al. 2015 formula was made for ReLU units. The key idea is that the variance of f(y) with y = W * x + b should be roughly equal to the variance of y. Let’s first go over the case of taking a ReLU activation, and see if we can ammend it for ELU units.

In the paper they show show that:
Var[yl]=nlVar[wl]E[x2l]
They express the last expectation E[x2l] in terms of Var[yl1]. For ReLUs we have that E[x2l]=12Var[yl1], simply because ReLUs put half the values in x to 0 on average. Thus we can write

Var[yl]=nlVar[wl]12Var[yl1]
We apply this to all layers, taking the product over l, all the way to the first layer. This gives:
Var[yL]=Var[y1]Li=212nlVar[wl]
Now this is stable only when 12nlVar[wl] is close to 1. So they set it to 1 and find Var[Wl]=2nl

Now for ELU units, the only thing we have to change is the expression of E[x2l] in terms of Var[yl1]. Sadly, this is not as straight-forward for ELU units as for RelU units as it involves calculating E[(e(N)2)] for only the negative values of N. This is not a pretty formula, I don’t even know if there’s a good closed form solution, so let’s sample to get an approximation. We want Var[yl] to roughly be equal to 1 (most inputs are variance 1, batch norm makes layers variance 1 etc.). Thus we can sample from a normal distribution, apply the elu function with alpha = 1, square and calculate the mean. This gives 0.645. The inverse of this is 1.55.

Thus following the same logic, we can set Var[wl] to 1.55n to get a variance that doesn’t increase in magnitude.

I reckon that would be the optimal value for the ELU function. It fits in between the value for the ReLU function (1/2, which is lower than 0.645 because the values that are mapped to 0 now get mapped to some minus value), and what you would have for any function with mean 0 (which is just 1).

Take care that if the variance of Var[yl1] is different, the optimal constant is also different. When this variance tends to 0, then the function becomes more and more like a unit function, thus the constant will tend to 1. If the variance becomes really big, the value tends towards the original ReLU value, thus 0.5.

Edit: Did the theoretical analysis of the variance of ELU(x) if x is normally distributed. It involves the some derivations of the log-normal distribution and not so pretty integrals. The eventual answer for the variance is 0.5σ (the part of the linear function) +
a2(b)2+(2b1)2
where
a=12eσ22(erfc(σ2)+1σ2σ1)b=12e2σ2(erfc(2σ)+1σ2σ1)
Which is not very solvable for σ unfortunately. You can fill in for σ and get the estimate I gave above however, which is pretty cool.

Attribution
Source : Link , Question Author : MiniQuark , Answer Author : 0-_-0

Leave a Comment