Range of lambda in elastic net regression

$\def\l{|\!|}$
Given the elastic net regression

$$\min_b \frac{1}{2}\l y – Xb \l^2 + \alpha\lambda \l b\l_2^2 + (1 – \alpha) \lambda \l b\l_1$$

how can an appropriate range of $\lambda$ be chosen for cross-validation?

In the $\alpha=1$ case (ridge regression) the formula

$$\textrm{dof} = \sum_j \frac{s_j^2}{s_j^2+\lambda}$$

can be used to give an equivalent degrees of freedom for each lambda (where $s_j$ are the singular values of $X$), and degrees of freedom can be chosen in a sensible range.

In the $\alpha=0$ case (lasso) we know that

$$\lambda > \lambda_{\textrm{max}} = \max_j|\sum_t y_t X_{tj}|$$

will result in all $b_j$ being zero, and $\lambda$ can be chosen in some range $(0, \lambda_\textrm{max})$.

But how to handle the mixed case?

Answer

I think you should use a range of $0$ to

$$\lambda_\text{max}^\prime = \frac{1}{1-\alpha}\lambda_\text{max}$$

My reasoning comes from extending the lasso case, and a full derivation is below. The qualifier is it doesn’t capture the $\text{dof}$ constraint contributed by the $\ell_2$ regularization. If I work out how to fix that (and decide whether it actually needs fixing), I’ll come back and edit it in.


Define the objective

$$f(b) = \frac{1}{2} \|y – Xb\|^2 + \frac{1}{2} \gamma \|b\|^2 + \delta \|b\|_1$$

This is the objective you described, but with some parameters substituted to improve clarity.

Conventionally, $b=0$ can only be a solution to the optimization problem $\min f(b)$ if the gradient at $b = 0$ is zero. The term $\|b\|_1$ is non-smooth though, so the condition is actually that $0$ lies in the subgradient at $b = 0$.

The subgradient of $f$ is

$$\partial f = -X^T(y – Xb) + \gamma b + \delta \partial \|b\|_1$$

where $\partial$ denotes the subgradient with respect to $b$. At $b=0$, this becomes

$$\partial f|_{b=0} = -X^Ty + \delta[-1, 1]^d$$

where $d$ is the dimension of $b$, and a $[-1,1]^d$ is a $d$-dimensional cube. So for the optimization problem to have a solution of $b = 0$, it must be that

$$(X^Ty)_i \in \delta [-1, 1]$$

for each component $i$. This is equivalent to

$$\delta > \max_i \left|\sum_j y_j X_{ij} \right|$$

which is the defintion you gave for $\lambda_\text{max}$. If $\delta = (1-\alpha)\lambda$ is now swapped in, the formula from the top of the post falls out.

Attribution
Source : Link , Question Author : Chris Taylor , Answer Author : Andy Jones

Leave a Comment