# How is it possible to obtain a good linear regression model when there is no substantial correlation between the output and the predictors?

I have trained a linear regression model, using a set of variables/features. And the model has a good performance. However, I have realized that there is no variable with a good correlation with the predicted variable. How is it possible?

A pair of variables may show high partial correlation (the correlation accounting for the impact of other variables) but low – or even zero – marginal correlation (pairwise correlation).

Which means that pairwise correlation between a response, y and some predictor, x may be of little value in identifying suitable variables with (linear) “predictive” value among a collection of other variables.

Consider the following data:

   y  x
1  6  6
2 12 12
3 18 18
4 24 24
5  1 42
6  7 48
7 13 54
8 19 60


The correlation between y and x is $0$. If I draw the least squares line, it’s perfectly horizontal and the $R^2$ is naturally going to be $0$.

But when you add a new variable g, which indicates which of two groups the observations came from, x becomes extremely informative:

   y  x g
1  6  6 0
2 12 12 0
3 18 18 0
4 24 24 0
5  1 42 1
6  7 48 1
7 13 54 1
8 19 60 1


The $R^2$ of a linear regression model with both the x and g variables in it will be 1.

It’s possible for this sort of thing to happen with every one of the variables in the model – that all have small pairwise correlation with the response, yet the model with them all in there is very good at predicting the response.

https://en.wikipedia.org/wiki/Omitted-variable_bias