Following up on this question…

In ordinary least squares, the predictions and residuals are orthogonal. $$\sum_{i=1}^n\hat{y}_i (y_i – \hat{y}_i) = 0$$

If we estimate the regression coefficients using some other method but the same model, such as using regularization, why, intuitively, should that wreck the orthogonality?

**Answer**

I wrote a comprehensive explanation on this question in my site.

It might be useful for readers.

I’ll talk about the ridge regularization here because it can be shown to neatly use the same equations used to derive the OLS solution (see this answer).

The coefficients in ridge regression (with penalty weighting $\lambda$) are simply:

$$\beta = (X^TX+\lambda\mathbb I)^{-1}X^Ty$$

The solution to the OLS can be obtained just as well by setting $\lambda = 0$.

The use of the normal equations to the ridge problem can be recovered from and correspond to an augmentation of $X$.

Concatenating new virtual samples formed by a identity matrix:

$$

\matrix{

X_\text{new}=\left[\matrix{

X_\text{old} \\ \sqrt{\lambda}\mathbb I_{p\times p}

}\right]

\qquad

Y_\text{new}=\left[\matrix{

Y_\text{old} \\ \mathbf 0_{p\times1}

}\right]

}$$

If we do that, it can be quite straightforwardly shown that:

$$\beta = (X_\text{old}^TX_\text{old}+\lambda\mathbb I)^{-1}X^T_\text{old} y_\text{old} = (X_\text{new}^TX_\text{new})^{-1}X_\text{new}^T y_\text{new}$$

Thus, since we are using the normal equations to derive the solution to ridge regression, the property of orthogonal residuals and predictions is kept intact.

But notice that, now, predictions involve these virtual samples.

That’s why, when looking only at the real samples, this orthogonality is not guaranteed: you are missing part of the puzzle by not taking into account these “virtual” samples.

**Attribution***Source : Link , Question Author : Dave , Answer Author : Firebug*