I am working on some predictive modeling project these days: trying to learn a model and make real-time predictions based on the model I learned offline.
I started using ridge regression recently, because I read that regularization can help to reduce the effect of multicollinearity.
However, I read this blog today. I am totally confused now. According to this blog, multicollinearity does NOT hurt a model’s predictive power that much.
So, in the end, is multicollinearity a problem or not?
It’s a problem for causal inference – or rather, it indicates difficulties in causal inference – but it’s not a particular problem for prediction/forecasting (unless it’s so extreme that it prevents model convergence or results in singular matrices, and then you won’t get predictions anyway). This, I think, is the meaning of that blog post, as well. It sounds like you may be insisting on a yes-or-no answer when the answer is that it depends. Here’s what it depends on, and why it can at least be said that (non-perfect) multicollinearity is never a reason to drop a variable from a model – any problems that multicollinearity indicate won’t go away because you dropped a variable and stopped seeing the collinearity.
Predictors that are highly correlated with each other just don’t do as good a job of improving your predictions as they would if they were not collinear, but still separately correlated with the outcome variable; neither one is doing much more work than the other one is already doing and would do on its own anyway. Maybe they’re so strongly related to each other because they are capturing basically the same underlying construct, in which case neither one is adding much more on top of the other for good reason, and it would be impossible to separate them out ontologically for predictive purposes anyway, by manipulating the units of observation to have different values on each of the two predictor variables so that they work better as predictors. But that doesn’t mean that including both of them in your model as-is is bad or wrong. Throw them both in, why not – any random measurement error that’s involved will also be partly addressed by the fact that you basically have two separate measurements of the same thing, so you might have a marginal increase in predictive power for that good reason (and in this sense I think I disagree with Kjetil’s comment above).
When it comes to causal inference, it’s a problem simply because it prevents us from being able to tell, confidently at least, which of the collinear predictors is doing the predicting, and therefore the explaining and, presumably, causing. With enough observations, you will eventually be able to identify the separate effects of even highly collinear (but never perfectly collinear) variables. This is why Rob Franzese and UMich likes to call multicollinearity “micronumerosity.” There’s always some collinearity between predictors. That’s one of the reasons why we generally just need lots of observations. Sometimes an impossible amount, for our causal-inference needs. But the problem is the complexity of the world and the unfortunate circumstances that prevent us from observing a wider variety of situations where different factors vary more in relation to each other. Multicollinearity is the symptom of that lack of useful data, and multivariate regression is the (imperfect) cure. Yet so many people seem to think of multicollinearity as something they’re doing wrong with their model, and as if it’s a reason to doubt what findings they do have.