For linear classifiers, do larger coefficients imply more important features?

I’m a software engineer working on machine learning. From my understanding, linear regression (such as OLS) and linear classification (such as logistic regression and SVM) make a prediction based on an inner product between trained coefficients $\vec{w}$ and feature variables $\vec{x}$:

$$
\hat{y} = f(\vec{w} \cdot \vec{x}) = f(\sum_{i} w_i x_i)
$$

My question is: After the model has been trained (that is, after the coefficients $w_i$ have be computed), is it the case that the coefficients will be larger for feature variables that are more important for the model to predict more accurately?

In other words, I am asking whether the relative magnitudes of the coefficients can be used for feature selection by just ordering the variables by coefficient value and then selecting the features with the highest coefficients? If this approach is valid, then why is it not mentioned for feature selection (along with wrapper and filter methods, etc.).

The reason I ask this is because I came across a discussion on L1 vs. L2 regularization. There is a blurb that says:

Built-in feature selection is frequently mentioned as a useful
property of the L1-norm, which the L2-norm does not. This is actually
a result of the L1-norm, which tends to produces sparse coefficients
(explained below). Suppose the model have 100 coefficients but only 10
of them have non-zero coefficients, this is effectively saying that
“the other 90 predictors are useless in predicting the target values”.

Reading between the lines, I would guess that if a coefficient is close to 0, then the feature variable with that coefficient must have little predictive power.

EDIT: I am also applying z-scaling to my numeric variables.

Answer

Not at all. The magnitude of the coefficients depends directly on the scales selected for the variables, which is a somewhat arbitrary modeling decision.

To see this, consider a linear regression model predicting the petal width of an iris (in centimeters) given its petal length (in centimeters):

summary(lm(Petal.Width~Petal.Length, data=iris))
# Call:
# lm(formula = Petal.Width ~ Petal.Length, data = iris)
# 
# Residuals:
#      Min       1Q   Median       3Q      Max 
# -0.56515 -0.12358 -0.01898  0.13288  0.64272 
# 
# Coefficients:
#               Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  -0.363076   0.039762  -9.131  4.7e-16 ***
# Petal.Length  0.415755   0.009582  43.387  < 2e-16 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 0.2065 on 148 degrees of freedom
# Multiple R-squared:  0.9271,  Adjusted R-squared:  0.9266 
# F-statistic:  1882 on 1 and 148 DF,  p-value: < 2.2e-16

Our model achieves an adjusted R^2 value of 0.9266 and assigns coefficient value 0.415755 to the Petal.Length variable.

However, the choice to define Petal.Length in centimeters was quite arbitrary, and we could have instead defined the variable in meters:

iris$Petal.Length.Meters <- iris$Petal.Length / 100
summary(lm(Petal.Width~Petal.Length.Meters, data=iris))
# Call:
# lm(formula = Petal.Width ~ Petal.Length.Meters, data = iris)
# 
# Residuals:
#      Min       1Q   Median       3Q      Max 
# -0.56515 -0.12358 -0.01898  0.13288  0.64272 
# 
# Coefficients:
#                     Estimate Std. Error t value Pr(>|t|)    
# (Intercept)         -0.36308    0.03976  -9.131  4.7e-16 ***
# Petal.Length.Meters 41.57554    0.95824  43.387  < 2e-16 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 0.2065 on 148 degrees of freedom
# Multiple R-squared:  0.9271,  Adjusted R-squared:  0.9266 
# F-statistic:  1882 on 1 and 148 DF,  p-value: < 2.2e-16

Of course, this doesn’t really affect the fitted model in any way — we simply assigned a 100x larger coefficient to Petal.Length.Meters (41.57554) than we did to Petal.Length (0.415755). All other properties of the model (adjusted R^2, t-statistics, p-values, etc.) are identical.

Generally when fitting regularized linear models one will first normalize variables (for instance, to have mean 0 and unit variance) to avoid favoring some variables over others based on the selected scales.

Assuming Normalized Data

Even if you had normalized all variables, variables with higher coefficients may still not be as useful in predictions because the independent variables are rarely set (have low variance). As an example, consider a dataset with dependent variable Z and independent variables X and Y taking binary values

set.seed(144)
dat <- data.frame(X=rep(c(0, 1), each=50000),
                  Y=rep(c(0, 1), c(1000, 99000)))
dat$Z <- dat$X + 2*dat$Y + rnorm(100000)

By construction, the coefficient for Y is roughly twice as large as the coefficient for X when both are used to predict Z via linear regression:

summary(lm(Z~X+Y, data=dat))
# Call:
# lm(formula = Z ~ X + Y, data = dat)
# 
# Residuals:
#     Min      1Q  Median      3Q     Max 
# -4.4991 -0.6749 -0.0056  0.6723  4.7342 
# 
# Coefficients:
#              Estimate Std. Error t value Pr(>|t|)    
# (Intercept) -0.094793   0.031598   -3.00   0.0027 ** 
# X            0.999435   0.006352  157.35   <2e-16 ***
# Y            2.099410   0.031919   65.77   <2e-16 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 0.9992 on 99997 degrees of freedom
# Multiple R-squared:  0.2394,  Adjusted R-squared:  0.2394 
# F-statistic: 1.574e+04 on 2 and 99997 DF,  p-value: < 2.2e-16

Still, X explains more of the variance in Z than Y (the linear regression model predicting Z with X has R^2 value 0.2065, while the linear regression model predicting Z with Y has R^2 value 0.0511):

summary(lm(Z~X, data=dat))
# Call:
# lm(formula = Z ~ X, data = dat)
# 
# Residuals:
#     Min      1Q  Median      3Q     Max 
# -5.2587 -0.6759  0.0038  0.6842  4.7342 
# 
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept) 1.962629   0.004564   430.0   <2e-16 ***
# X           1.041424   0.006455   161.3   <2e-16 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 1.021 on 99998 degrees of freedom
# Multiple R-squared:  0.2065,  Adjusted R-squared:  0.2065 
# F-statistic: 2.603e+04 on 1 and 99998 DF,  p-value: < 2.2e-16

versus:

summary(lm(Z~Y, data=dat))
# Call:
# lm(formula = Z ~ Y, data = dat)
# 
# Residuals:
#     Min      1Q  Median      3Q     Max 
# -5.0038 -0.7638 -0.0007  0.7610  5.2288 
# 
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept) -0.09479    0.03529  -2.686  0.00724 ** 
# Y            2.60418    0.03547  73.416  < 2e-16 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 1.116 on 99998 degrees of freedom
# Multiple R-squared:  0.05114, Adjusted R-squared:  0.05113 
# F-statistic:  5390 on 1 and 99998 DF,  p-value: < 2.2e-16

The Case of Multi-Collinearity

A third case where large coefficient values may be deceiving would be in the case of significant multi-collinearity between variables. As an example, consider a dataset where X and Y are highly correlated but W is not highly correlated to the other two; we are trying to predict Z:

set.seed(144)
dat <- data.frame(W=rnorm(100000),
                  X=rnorm(100000))
dat$Y <- dat$X + rnorm(100000, 0, 0.001)
dat$Z <- 2*dat$W+10*dat$X-11*dat$Y + rnorm(100000)
cor(dat)
#              W             X             Y          Z
# W 1.000000e+00  5.191809e-05  5.200434e-05  0.8161636
# X 5.191809e-05  1.000000e+00  9.999995e-01 -0.4079183
# Y 5.200434e-05  9.999995e-01  1.000000e+00 -0.4079246
# Z 8.161636e-01 -4.079183e-01 -4.079246e-01  1.0000000

These variables pretty much have the same mean (0) and variance (~1), and linear regression assigns much higher coefficient values (in absolute value) to X (roughly 15) and Y (roughly -16) than it does to W (roughly 2):

summary(lm(Z~W+X+Y, data=dat))
# Call:
# lm(formula = Z ~ W + X + Y, data = dat)
# 
# Residuals:
#     Min      1Q  Median      3Q     Max 
# -4.1886 -0.6760  0.0026  0.6679  4.2232 
# 
# Coefficients:
#               Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  1.831e-04  3.170e-03   0.058    0.954    
# W            2.001e+00  3.172e-03 630.811  < 2e-16 ***
# X            1.509e+01  3.177e+00   4.748 2.05e-06 ***
# Y           -1.609e+01  3.177e+00  -5.063 4.13e-07 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 1.002 on 99996 degrees of freedom
# Multiple R-squared:  0.8326,  Adjusted R-squared:  0.8326 
# F-statistic: 1.658e+05 on 3 and 99996 DF,  p-value: < 2.2e-16

Still, among the three variables in the model W is the most important: If you remove W from the full model, the R^2 drops from 0.833 to 0.166, while if you drop X or Y the R^2 is virtually unchanged.

Attribution
Source : Link , Question Author : stackoverflowuser2010 , Answer Author : amoeba

Leave a Comment