Consider linear regression with some regularization:

E.g. Find x that minimizes ||Ax−b||2+λ||x||1Usually, columns of A are standardized to have zero mean and unit norm, while b is centered to have zero mean. I want to make sure if my understanding of the reason for standardizing and centering is correct.

By making the means of columns of A and b zero, we don’t need an intercept term anymore. Otherwise, the objective would have been ||Ax−x01−b||2+λ||x||1. By making the norms of columns of A equal to 1, we remove the possibility of a case where just because one column of A has very high norm, it gets a low coefficient in x, which might lead us to conclude incorrectly that that column of A doesn’t “explain” x well.

This reasoning is not exactly rigorous but intuitively, is that the right way to think?

**Answer**

You are correct about zeroing the means of the columns of A and b.

However, as for adjusting the norms of the columns of A, consider what would happen if you started out with a normed A, and all the elements of x were of roughly the same magnitude. Then let us multiply one column by, say, 10−6. The corresponding element of x would, in an unregularized regression, be increased by a factor of 106. See what would happen to the regularization term? The regularization would, for all practical purposes, apply only to that one coefficient.

By norming the columns of A, we, writing intuitively, put them all on the same scale. Consequently, differences in the magnitudes of the elements of x are directly related to the “wiggliness” of the explanatory function (Ax), which is, loosely speaking, what the regularization tries to control. Without it, a coefficient value of, e.g., 0.1 vs. another of 10.0 would tell you, in the absence of knowledge about A, nothing about which coefficient was contributing the most to the “wiggliness” of Ax. (For a linear function, like Ax, “wiggliness” is related to deviation from 0.)

To return to your explanation, if one column of A has a very high norm, and for some reason gets a low coefficient in x, we would not conclude that the column of A doesn’t “explain” x well. A doesn’t “explain” x at all.

**Attribution***Source : Link , Question Author : rk2 , Answer Author : jbowman*