# What’s the recommended weight initialization strategy when using the ELU activation function?

For deep neural networks using ReLU neurons, the recommended connection weight initialization strategy is to pick a random uniform number between -r and +r with:

$r = \sqrt{\dfrac{12}{\text{fan-in} + \text{fan-out}}}$

Where fan-in and fan-out are the number of connections going in and out of the layer being initialized. This is called “He initialization” (paper).

My question is: what’s the recommended weights initialization strategy when using ELU neurons (paper)?

Since ELUs look a lot like ReLUs, I’m tempted to use the same logic, but I’m not sure it’s the optimal strategy.

Note

There is a fairly similar question but this one is more specifically about the ELU activation function (which is not covered by the answers to the other question).

I think the initialization should be roughly $$√1.55nin\sqrt{\frac{1.55}{n_{in}}}$$

The He et al. 2015 formula was made for ReLU units. The key idea is that the variance of f(y) with y = W * x + b should be roughly equal to the variance of y. Let’s first go over the case of taking a ReLU activation, and see if we can ammend it for ELU units.

In the paper they show show that:
$$Var[yl]=nlVar[wl]E[x2l] Var[y_l] = n_l Var[w_l] \mathbb{E}[x^2_l]$$
They express the last expectation $$E[x2l]\mathbb{E}[x^2_l]$$ in terms of $$Var[yl−1]Var[y_{l-1}]$$. For ReLUs we have that $$E[x2l]=12Var[yl−1]\mathbb{E}[x^2_l] = \frac{1}{2} Var[y_{l-1}]$$, simply because ReLUs put half the values in $$xx$$ to $$00$$ on average. Thus we can write

$$Var[yl]=nlVar[wl]12Var[yl−1] Var[y_l] = n_l Var[w_l] \frac{1}{2} Var[y_{l-1}]$$
We apply this to all layers, taking the product over $$ll$$, all the way to the first layer. This gives:
$$Var[yL]=Var[y1]L∏i=212nlVar[wl] Var[y_L] = Var[y_1] \prod_{i=2}^L \frac{1}{2} n_l Var[w_l]$$
Now this is stable only when $$12nlVar[wl]\frac{1}{2} n_l Var[w_l]$$ is close to 1. So they set it to 1 and find $$Var[Wl]=2nlVar[W_l] = \frac{2}{n_l}$$

Now for ELU units, the only thing we have to change is the expression of $$E[x2l]\mathbb{E}[x^2_l]$$ in terms of $$Var[yl−1]Var[y_{l-1}]$$. Sadly, this is not as straight-forward for ELU units as for RelU units as it involves calculating $$E[(e(N)2)]\mathbb{E}[({e^{(\mathcal{N})}}^2)]$$ for only the negative values of $$N\mathcal{N}$$. This is not a pretty formula, I don’t even know if there’s a good closed form solution, so let’s sample to get an approximation. We want $$Var[yl]Var[y_l]$$ to roughly be equal to 1 (most inputs are variance 1, batch norm makes layers variance 1 etc.). Thus we can sample from a normal distribution, apply the elu function with alpha = 1, square and calculate the mean. This gives $$≈0.645\approx 0.645$$. The inverse of this is $$≈1.55\approx 1.55$$.

Thus following the same logic, we can set $$Var[wl]Var[w_l]$$ to $$√1.55n\sqrt{\frac{1.55}{n}}$$ to get a variance that doesn’t increase in magnitude.

I reckon that would be the optimal value for the ELU function. It fits in between the value for the ReLU function (1/2, which is lower than 0.645 because the values that are mapped to 0 now get mapped to some minus value), and what you would have for any function with mean 0 (which is just 1).

Take care that if the variance of $$Var[yl−1]Var[y_{l-1}]$$ is different, the optimal constant is also different. When this variance tends to 0, then the function becomes more and more like a unit function, thus the constant will tend to 1. If the variance becomes really big, the value tends towards the original ReLU value, thus 0.5.

Edit: Did the theoretical analysis of the variance of ELU(x) if x is normally distributed. It involves the some derivations of the log-normal distribution and not so pretty integrals. The eventual answer for the variance is $$0.5σ0.5 \sigma$$ (the part of the linear function) +
$$a−2(b)2+(2b−1)2 a - 2(b)^2 + (2b - 1)^2$$
where
$$a=12eσ22(erfc(σ√2)+√1σ2σ−1)b=12e2σ2(erfc(√2σ)+√1σ2σ−1) a = \frac{1}{2} e^{\frac{\sigma^2}{2}} \left(\text{erfc}\left(\frac{\sigma}{\sqrt{2}}\right) + \sqrt{\frac{1}{\sigma^2}} \sigma -1\right)\\ b = \frac{1}{2} e^{2\sigma^2} \left(\text{erfc}\left(\sqrt{2} \sigma\right) + \sqrt{\frac{1}{\sigma^2}} \sigma -1\right)\\$$
Which is not very solvable for $$σ\sigma$$ unfortunately. You can fill in for $$σ\sigma$$ and get the estimate I gave above however, which is pretty cool.