# Is using correlation matrix to select predictors for regression correct?

A few days ago, a psychologist-researcher of mine told me about his method to select variables to linear regression model. I guess it’s not good, but I need to ask someone else to make sure. The method is:

Look at correlation matrix between all variables (including Dependent
Variable Y) and choose those predictors Xs, that correlate most with Y.

He didn’t mention any criterion.
Q: Was he right?

[I think that this selection method is wrong, because of many things, like it’s the theory that says which predictors should be selected, or even omitted variable bias (OVB).]

If, for some reason, you are going to include only one variable in your model, then selecting the predictor which has the highest correlation with $y$ has several advantages. Out of the possible regression models with only one predictor, then this model is the one with the highest standardized regression coefficient and also (since $R^2$ is the square of $r$ in a simple linear regression) the highest coefficient of determination.

But it’s not clear why you would want to restrict your regression model to one predictor if you have data available for several. As mentioned in the comments, just looking at the correlations doesn’t work if your model might include several variables. For example, from this scatter matrix, you might think that the predictors for $y$ you should include in your model are $x_1$ (correlation 0.824) and $x_2$ (correlation 0.782) but that $x_3$ (correlation 0.134) is not a useful predictor.

But you’d be wrong – in fact in this example, $y$ depends on two independent variables, $x_1$ and $x_3$, but not directly on $x_2$. However $x_2$ is highly correlated with $x_1$, which leads to a correlation with $y$ also. Looking at the correlation between $y$ and $x_2$ in isolation, this might suggest $x_2$ is a good predictor of $y$. But once the effects of $x_1$ are partialled out by including $x_1$ in the model, no such relationship remains.

require(MASS) #for mvrnorm
set.seed(42) #so reproduces same result

Sigma <- matrix(c(1,0.95,0,0.95,1,0,0,0,1),3,3)
N <- 1e4
x <- mvrnorm(n=N, c(0,0,0), Sigma, empirical=TRUE)
data.df <- data.frame(x1=x[,1], x2=x[,2], x3=x[,3])
# y depends on x1 strongly and x3 weakly, but not directly on x2
data.df$y <- with(data.df, 5 + 3*x1 + 0.5*x3) + rnorm(N, sd=2) round(cor(data.df), 3) # x1 x2 x3 y # x1 1.000 0.950 0.000 0.824 # x2 0.950 1.000 0.000 0.782 # x3 0.000 0.000 1.000 0.134 # y 0.824 0.782 0.134 1.000 # Note: x1 and x2 are highly correlated # Since y is highly correlated with x1, it is with x2 too # y depended only weakly on x3, their correlation is much lower pairs(~y+x1+x2+x3,data=data.df, main="Scatterplot matrix") # produces scatter plot above model.lm <- lm(data=data.df, y ~ x1 + x2 + x3) summary(model.lm) # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 4.99599 0.02018 247.631 <2e-16 *** # x1 3.03724 0.06462 47.005 <2e-16 *** # x2 -0.02436 0.06462 -0.377 0.706 # x3 0.49185 0.02018 24.378 <2e-16 ***  This sample size is sufficiently large to overcome multicollinearity issues in the estimation of coefficients for $x_1$ and $x_2$. The coefficient of $x_2$ is estimated near zero, and with non-significant p-value. The true coefficient is zero. The intercept and the slopes for $x_1$ and $x_3$ are estimated near their true values of 5, 3 and 0.5 respectively. Note that $x_3$ is correctly found to be a significant predictor, even though this is less than obvious from the scatter matrix. And here is an example which is even worse: Sigma <- matrix(c(1,0,0,0.5,0,1,0,0.5,0,0,1,0.5,0.5,0.5,0.5,1),4,4) N <- 1e4 x <- mvrnorm(n=N, c(0,0,0,0), Sigma, empirical=TRUE) data.df <- data.frame(x1=x[,1], x2=x[,2], x3=x[,3], x4=x[,4]) # y depends on x1, x2 and x3 but not directly on x4 data.df$y <- with(data.df, 5 + x1 + x2 + x3) + rnorm(N, sd=2)

round(cor(data.df), 3)
#       x1    x2    x3    x4     y
# x1 1.000 0.000 0.000 0.500 0.387
# x2 0.000 1.000 0.000 0.500 0.391
# x3 0.000 0.000 1.000 0.500 0.378
# x4 0.500 0.500 0.500 1.000 0.583
# y  0.387 0.391 0.378 0.583 1.000

pairs(~y+x1+x2+x3+x4,data=data.df, main="Scatterplot matrix")

model.lm <- lm(data=data.df, y ~ x1 + x2 + x3 +x4)
summary(model.lm)
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)
# (Intercept)  4.98117    0.01979 251.682   <2e-16 ***
# x1           0.99874    0.02799  35.681   <2e-16 ***
# x2           1.00812    0.02799  36.016   <2e-16 ***
# x3           0.97302    0.02799  34.762   <2e-16 ***
# x4           0.06002    0.03958   1.516    0.129


Here $y$ depends on the (uncorrelated) predictors $x_1$, $x_2$ and $x_3$ – in fact the true regression slope is one for each. It does not depend on a fourth variable, $x_4$, but because of the way that variable is correlated with each of $x_1$, $x_2$ and $x_3$, it would be $x_4$ that stands out in the scatterplot and correlation matrices (its correlation with $y$ is 0.583, while the others are below 0.4). So selecting the variable with the highest correlation with $y$ can actually find the variable that does not belong in the model at all.