# Lasso penalty only applied to subset of regressors

This question has been asked before but there were no responses, so I thought I might ask again.

I’m interested in applying a Lasso penalty to some subset of the regressors, i.e. with objective function

$E = ||\mathbf{y} – \mathbf{X}_1 \boldsymbol{\beta}_1 – \mathbf{X}_2 \boldsymbol{\beta}_2||^2 + \lambda ||\boldsymbol{\beta}_1||_1$

where the Lasso is only applied to $\boldsymbol{\beta}_1$ but $\boldsymbol{\beta}_2$ is involved in the reconstruction.

Is there any theory behind this? Secondly, is there anyway to do this in sklearn?

Let $H_2$ be an orthogonal projector onto the column space of $X_2$. We have that
\begin{align*}
& \min_{\beta_1, \beta_2} \left\{ \|y – X_1\beta_1 – X_2\beta_2\|_2^2 + \lambda \|\beta_1\|_1 \right\} \\
= & \, \min_{\beta_1, \beta_2} \left\{ \|H_2\left(y – X_1\beta_1 \right) – X_2 \beta_2\|_2^2 + \|\left(I-H_2\right)\left(y – X_1\beta_1 \right) \|_2^2 + \lambda \|\beta_1 \|_1 \right\} \\
= & \, \min_{\beta_1 | \beta_2} \min_{\beta_2} \left\{ \|H_2\left(y – X_1\beta_1 \right) – X_2 \beta_2\|_2^2 + \|\left(I-H_2\right)\left(y – X_1\beta_1 \right) \|_2^2 + \lambda \|\beta_1 \|_1 \right\},
\end{align*}
where
\begin{align*}
\hat\beta_2
& = \arg\min_{\beta_2} \left\{ \|H_2\left(y – X_1\beta_1 \right) – X_2 \beta_2\|_2^2 + \|\left(I-H_2\right)\left(y – X_1\beta_1 \right) \|_2^2 + \lambda \|\beta_1 \|_1 \right\} \\
& = \arg\min_{\beta_2} \left\{ \|H_2\left(y – X_1\beta_1 \right) – X_2 \beta_2\|_2^2 \right\}
\end{align*}
satisfies $X_2 \hat\beta_2 = H_2 (y – X_1 \beta_1)$ for all $\beta_1$ since $H_2 (y – X_1 \beta_1) \in \mathrm{col}(X_2)$ for all $\beta_1$. Considering in this sentence the case that $X_2$ is full rank, we further have that $$\hat\beta_2 = (X_2^T X_2)^{-1} X_2^T (y – X_1 \beta_1),$$ since $H_2 = X_2 (X_2^T X_2)^{-1} X_2$ in this case.

Plugging this into the first optimization problem, we see that
\begin{align*}
\hat\beta_1
& = \arg\min_{\beta_1} \left\{ 0 + \|\left(I-H_2\right)\left(y – X_1\beta_1 \right) \|_2^2 + \lambda \|\beta_1 \|_1 \right\} \\
& =\arg\min_{\beta_1} \left\{ \|\left(I-H_2\right)y – \left(I-H_2\right)X_1\beta_1 \|_2^2 + \lambda \|\beta_1 \|_1 \right\}, \tag{*}
\end{align*}
which can be evaluated through the usual lasso computational tools. As whuber suggests in his comment, this result is intuitive since the unrestricted coefficients $\beta_2$ can cover the span of $X_2$, so that only the part of space orthogonal to the span of $X_2$ is of concern when evaluating $\hat\beta_1$.

Despite the notation being slightly more general, nearly anyone who has ever used lasso is familiar with this result. To see this, suppose that $X_2 = \mathbf{1}$ is the (length $n$) vectors of ones, representing the intercept. Then, the projection matrix $H_2 = \mathbf{1} \left( \mathbf{1}^T \mathbf{1} \right)^{-1} \mathbf{1}^T = \frac{1}{n} \mathbf{1} \mathbf{1}^T$, and, for any vector $v$, the orthogonal projection $\left( I – H_2 \right) v = v – \bar{v} \mathbf{1}$ just demeans the vector. Considering equation $(*)$, this is exactly what people do when they compute the lasso coefficients! They demean the data so that the intercept doesn’t have to be considered.