How is it possible to obtain a good linear regression model when there is no substantial correlation between the output and the predictors?

I have trained a linear regression model, using a set of variables/features. And the model has a good performance. However, I have realized that there is no variable with a good correlation with the predicted variable. How is it possible?

Answer

A pair of variables may show high partial correlation (the correlation accounting for the impact of other variables) but low – or even zero – marginal correlation (pairwise correlation).

Which means that pairwise correlation between a response, y and some predictor, x may be of little value in identifying suitable variables with (linear) “predictive” value among a collection of other variables.

Consider the following data:

   y  x
1  6  6
2 12 12
3 18 18
4 24 24
5  1 42
6  7 48
7 13 54
8 19 60

The correlation between y and x is 0. If I draw the least squares line, it’s perfectly horizontal and the R2 is naturally going to be 0.

But when you add a new variable g, which indicates which of two groups the observations came from, x becomes extremely informative:

   y  x g
1  6  6 0
2 12 12 0
3 18 18 0
4 24 24 0
5  1 42 1
6  7 48 1
7 13 54 1
8 19 60 1

The R2 of a linear regression model with both the x and g variables in it will be 1.

Plot of y vs x showing a lack of pairwise linear relationship but with color indicating the group; within each group the relationship is perfect

It’s possible for this sort of thing to happen with every one of the variables in the model – that all have small pairwise correlation with the response, yet the model with them all in there is very good at predicting the response.

Additional reading:

https://en.wikipedia.org/wiki/Omitted-variable_bias

https://en.wikipedia.org/wiki/Simpson%27s_paradox

Attribution
Source : Link , Question Author : Zaratruta , Answer Author : amoeba

Leave a Comment