When can we speak of collinearity

In linear models we need to check if a relationship exists among the explanatory variables. If they correlate too much then there is collinearity (i.e., the variables partly explain each other). I am currently just looking at the pairwise correlation between each of the explanatory variables.

Question 1:
What classifies as too much correlation? For example, is a Pearson correlation of 0.5 too much?

Question 2:
Can we fully determine whether there is collinearity between two variables based on the correlation coefficient or does it depend on other factors?

Question 3:
Does a graphical check of the scatterplot of the two variables add anything to what the correlation coefficient indicates?

Answer

  1. There is no ‘bright line’ between not too much collinearity and too much collinearity (except in the trivial sense that $r = 1.0$ is definitely too much). Analysts would not typically think of $r = .50$ as too much collinearity between two variables. A rule of thumb regarding multicollinearity is that you have too much when the VIF is greater than 10 (this is probably because we have 10 fingers, so take such rules of thumb for what they’re worth). The implication would be that you have too much collinearity between two variables if $r \ge .95$. You can read more about the VIF and multicollinearity in my answer here: What is the effect of having correlated predictors in a multiple regression model?

  2. This depends on what you mean by “fully determine”. If the correlation between two variables were $r \ge .95$, then most data analysts would say you had problematic collinearity. However, you can have multiple variables where no two variables have a pairwise correlation that high, and still have problematic collinearity hidden amongst the whole set of variables. This is where other metrics, such as the VIFs and condition numbers come in handy. You can read more on this topic at my question here: Is there a reason to prefer a specific measure of multicollinearity?

  3. It is always smart to look at your data, and not simply numerical summaries / test results. The canonical reference here is Anscomb’s quartet.

Attribution
Source : Link , Question Author : Stefan , Answer Author : Community

Leave a Comment