For deep neural networks using ReLU neurons, the recommended connection weight initialization strategy is to pick a random uniform number between -r and +r with:

r=√12fan-in+fan-out

Where

`fan-in`

and`fan-out`

are the number of connections going in and out of the layer being initialized. This is called “He initialization” (paper).My question is: what’s the recommended weights initialization strategy when using ELU neurons (paper)?

Since ELUs look a lot like ReLUs, I’m tempted to use the same logic, but I’m not sure it’s the optimal strategy.

NoteThere is a fairly similar question but this one is more specifically about the ELU activation function (which is not covered by the answers to the other question).

**Answer**

I think the initialization should be roughly √1.55nin

The He et al. 2015 formula was made for ReLU units. The key idea is that the variance of f(y) with y = W * x + b should be roughly equal to the variance of y. Let’s first go over the case of taking a ReLU activation, and see if we can ammend it for ELU units.

In the paper they show show that:

Var[yl]=nlVar[wl]E[x2l]

They express the last expectation E[x2l] in terms of Var[yl−1]. For ReLUs we have that E[x2l]=12Var[yl−1], simply because ReLUs put half the values in x to 0 on average. Thus we can write

Var[yl]=nlVar[wl]12Var[yl−1]

We apply this to all layers, taking the product over l, all the way to the first layer. This gives:

Var[yL]=Var[y1]L∏i=212nlVar[wl]

Now this is stable only when 12nlVar[wl] is close to 1. So they set it to 1 and find Var[Wl]=2nl

Now for ELU units, the only thing we have to change is the expression of E[x2l] in terms of Var[yl−1]. Sadly, this is not as straight-forward for ELU units as for RelU units as it involves calculating E[(e(N)2)] for only the negative values of N. This is not a pretty formula, I don’t even know if there’s a good closed form solution, so let’s sample to get an approximation. We want Var[yl] to roughly be equal to 1 (most inputs are variance 1, batch norm makes layers variance 1 etc.). Thus we can sample from a normal distribution, apply the elu function with alpha = 1, square and calculate the mean. This gives ≈0.645. The inverse of this is ≈1.55.

Thus following the same logic, we can set Var[wl] to √1.55n to get a variance that doesn’t increase in magnitude.

I reckon that would be the optimal value for the ELU function. It fits in between the value for the ReLU function (1/2, which is lower than 0.645 because the values that are mapped to 0 now get mapped to some minus value), and what you would have for any function with mean 0 (which is just 1).

Take care that if the variance of Var[yl−1] is different, the optimal constant is also different. When this variance tends to 0, then the function becomes more and more like a unit function, thus the constant will tend to 1. If the variance becomes really big, the value tends towards the original ReLU value, thus 0.5.

Edit: Did the theoretical analysis of the variance of ELU(x) if x is normally distributed. It involves the some derivations of the log-normal distribution and not so pretty integrals. The eventual answer for the variance is 0.5σ (the part of the linear function) +

a−2(b)2+(2b−1)2

where

a=12eσ22(erfc(σ√2)+√1σ2σ−1)b=12e2σ2(erfc(√2σ)+√1σ2σ−1)

Which is not very solvable for σ unfortunately. You can fill in for σ and get the estimate I gave above however, which is pretty cool.

**Attribution***Source : Link , Question Author : MiniQuark , Answer Author : 0-_-0*