A few days ago, a psychologist-researcher of mine told me about his method to select variables to linear regression model. I guess it’s not good, but I need to ask someone else to make sure. The method is:
Look at correlation matrix between all variables (including Dependent
Variable Y) and choose those predictors Xs, that correlate most with Y.
He didn’t mention any criterion.
Q: Was he right?
[I think that this selection method is wrong, because of many things, like it’s the theory that says which predictors should be selected, or even omitted variable bias (OVB).]
If, for some reason, you are going to include only one variable in your model, then selecting the predictor which has the highest correlation with y has several advantages. Out of the possible regression models with only one predictor, then this model is the one with the highest standardized regression coefficient and also (since R2 is the square of r in a simple linear regression) the highest coefficient of determination.
But it’s not clear why you would want to restrict your regression model to one predictor if you have data available for several. As mentioned in the comments, just looking at the correlations doesn’t work if your model might include several variables. For example, from this scatter matrix, you might think that the predictors for y you should include in your model are x1 (correlation 0.824) and x2 (correlation 0.782) but that x3 (correlation 0.134) is not a useful predictor.
But you’d be wrong – in fact in this example, y depends on two independent variables, x1 and x3, but not directly on x2. However x2 is highly correlated with x1, which leads to a correlation with y also. Looking at the correlation between y and x2 in isolation, this might suggest x2 is a good predictor of y. But once the effects of x1 are partialled out by including x1 in the model, no such relationship remains.
require(MASS) #for mvrnorm set.seed(42) #so reproduces same result Sigma <- matrix(c(1,0.95,0,0.95,1,0,0,0,1),3,3) N <- 1e4 x <- mvrnorm(n=N, c(0,0,0), Sigma, empirical=TRUE) data.df <- data.frame(x1=x[,1], x2=x[,2], x3=x[,3]) # y depends on x1 strongly and x3 weakly, but not directly on x2 data.df$y <- with(data.df, 5 + 3*x1 + 0.5*x3) + rnorm(N, sd=2) round(cor(data.df), 3) # x1 x2 x3 y # x1 1.000 0.950 0.000 0.824 # x2 0.950 1.000 0.000 0.782 # x3 0.000 0.000 1.000 0.134 # y 0.824 0.782 0.134 1.000 # Note: x1 and x2 are highly correlated # Since y is highly correlated with x1, it is with x2 too # y depended only weakly on x3, their correlation is much lower pairs(~y+x1+x2+x3,data=data.df, main="Scatterplot matrix") # produces scatter plot above model.lm <- lm(data=data.df, y ~ x1 + x2 + x3) summary(model.lm) # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 4.99599 0.02018 247.631 <2e-16 *** # x1 3.03724 0.06462 47.005 <2e-16 *** # x2 -0.02436 0.06462 -0.377 0.706 # x3 0.49185 0.02018 24.378 <2e-16 ***
This sample size is sufficiently large to overcome multicollinearity issues in the estimation of coefficients for x1 and x2. The coefficient of x2 is estimated near zero, and with non-significant p-value. The true coefficient is zero. The intercept and the slopes for x1 and x3 are estimated near their true values of 5, 3 and 0.5 respectively. Note that x3 is correctly found to be a significant predictor, even though this is less than obvious from the scatter matrix.
And here is an example which is even worse:
Sigma <- matrix(c(1,0,0,0.5,0,1,0,0.5,0,0,1,0.5,0.5,0.5,0.5,1),4,4) N <- 1e4 x <- mvrnorm(n=N, c(0,0,0,0), Sigma, empirical=TRUE) data.df <- data.frame(x1=x[,1], x2=x[,2], x3=x[,3], x4=x[,4]) # y depends on x1, x2 and x3 but not directly on x4 data.df$y <- with(data.df, 5 + x1 + x2 + x3) + rnorm(N, sd=2) round(cor(data.df), 3) # x1 x2 x3 x4 y # x1 1.000 0.000 0.000 0.500 0.387 # x2 0.000 1.000 0.000 0.500 0.391 # x3 0.000 0.000 1.000 0.500 0.378 # x4 0.500 0.500 0.500 1.000 0.583 # y 0.387 0.391 0.378 0.583 1.000 pairs(~y+x1+x2+x3+x4,data=data.df, main="Scatterplot matrix") model.lm <- lm(data=data.df, y ~ x1 + x2 + x3 +x4) summary(model.lm) # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 4.98117 0.01979 251.682 <2e-16 *** # x1 0.99874 0.02799 35.681 <2e-16 *** # x2 1.00812 0.02799 36.016 <2e-16 *** # x3 0.97302 0.02799 34.762 <2e-16 *** # x4 0.06002 0.03958 1.516 0.129
Here y depends on the (uncorrelated) predictors x1, x2 and x3 – in fact the true regression slope is one for each. It does not depend on a fourth variable, x4, but because of the way that variable is correlated with each of x1, x2 and x3, it would be x4 that stands out in the scatterplot and correlation matrices (its correlation with y is 0.583, while the others are below 0.4). So selecting the variable with the highest correlation with y can actually find the variable that does not belong in the model at all.