I am currently assessing multicollinearity in my datasets.
What threshold values of VIF and condition index below/above suggest a problem?
I have heard that VIF \geq 10 is a problem.
After removing two problem variables, VIF is \leq 3.96 for each variable.
Do the variables need more treatment or does this VIF seem fine?
I have heard that a Condition Index (CI) of 30 or more is a problem.
My highest CI is 16.66. Is this a problem?
- Are there any other dos/donts that need to be considered?
- Are there any other things that I need to keep in mind?
Multicollinearity problem is well studied in actually most econometric textbooks. Moreover there is a good article in wikipedia which actually summarizes most of the key issues.
In practice one starts to bear in mind the multicollinearity problem if it causes some visual signs of parameter instability (most of them are implied by non (poor) invertability of X^TX matrix):
- large changes in parameter estimates while performing rolling regressions or estimates on smaller sub-samples of the data
- averaging of parameter estimates, the latter may fall to be insignificant (by t tests) even though junk-regression F test shows high joint significance of the results
- VIF statistic (average value of auxiliary regressions) merely depends on your requirements to tolerance level, most practical suggestions put an acceptable tolerance to be lower than 0.2 or 0.1 meaning that corresponding averages of auxiliary regressions R^2 should be higher than 0.9 or 0.8 to detect the problem. Thus VIF should be larger than rule-of-thumb’s 10 and 5 values. In small samples (less than 50 points) 5 is preferable, in larger you can go to larger values.
- Condition index is an alternative to VIF in your case neither VIF nor CI show the problem is left, so you may be satisfied statistically on this result, but…
probably not theoretically, since it may happen (and usually is the case) that you need all variables to be present in the model. Excluding relevant variables (omitted variable problem) will make biased and inconsistent parameter estimates anyway. On the other hand you may be forced to include all focus variables simply because your analysis is based on it. In data-mining approach though you are more technical in searching for the best fit.
So keep in mind the alternatives (that I would use myself):
- obtain more data points (recall that VIF requirements are smaller for larger data set and the explanatory variables if they are slowly varying, may change for some crucial points in time or cross-section)
- search for lattent factors through principal components (the latter are orthogonal combinations so not multi-collinear by the construction, more over involve all explanatory variables)
- ridge-regression (it introduces small bias in parameter estimates, but makes them highly stable)
Some other tricks are in the wiki article noted above.