Back in April, I attended a talk at the UMD Math Department Statistics group seminar series called “To Explain or To Predict?”. The talk was given by Prof. Galit Shmueli who teaches at UMD’s Smith Business School. Her talk was based on research she did for a paper titled “Predictive vs. Explanatory Modeling in IS Research”, and a follow up working paper titled “To Explain or To Predict?”.
Dr. Shmueli’s argument is that the terms predictive and explanatory in a statistical modeling context have become conflated, and that statistical literature lacks a a thorough discussion of the differences. In the paper, she contrasts both and talks about their practical implications. I encourage you to read the papers.
The questions I’d like to pose to the practitioner community are:
- How do you define a predictive exercise vs an explanatory/descriptive
one? It would be useful if you could talk about the specific
- Have you ever fallen into the trap of using one when meaning to use the other? I certainly have. How do you know which one to use?
In one sentence
Predictive modelling is all about “what is likely to happen?”, whereas explanatory modelling is all about “what can we do about it?”
In many sentences
I think the main difference is what is intended to be done with the analysis. I would suggest explanation is much more important for intervention than prediction. If you want to do something to alter an outcome, then you had best be looking to explain why it is the way it is. Explanatory modelling, if done well, will tell you how to intervene (which input should be adjusted). However, if you simply want to understand what the future will be like, without any intention (or ability) to intervene, then predictive modelling is more likely to be appropriate.
As an incredibly loose example, using “cancer data”.
Predictive modelling using “cancer data” would be appropriate (or at least useful) if you were funding the cancer wards of different hospitals. You don’t really need to explain why people get cancer, rather you only need an accurate estimate of how much services will be required. Explanatory modelling probably wouldn’t help much here. For example, knowing that smoking leads to higher risk of cancer doesn’t on its own tell you whether to give more funding to ward A or ward B.
Explanatory modelling of “cancer data” would be appropriate if you wanted to decrease the national cancer rate – predictive modelling would be fairly obsolete here. The ability to accurately predict cancer rates is hardly likely to help you decide how to reduce it. However, knowing that smoking leads to higher risk of cancer is valuable information – because if you decrease smoking rates (e.g. by making cigarettes more expensive), this leads to more people with less risk, which (hopefully) leads to an expected decrease in cancer rates.
Looking at the problem this way, I would think that explanatory modelling would mainly focus on variables which are in control of the user, either directly or indirectly. There may be a need to collect other variables, but if you can’t change any of the variables in the analysis, then I doubt that explanatory modelling will be useful, except maybe to give you the desire to gain control or influence over those variables which are important. Predictive modelling, crudely, just looks for associations between variables, whether controlled by the user or not. You only need to know the inputs/features/independent variables/etc.. to make a prediction, but you need to be able to modify or influence the inputs/features/independent variables/etc.. in order to intervene and change an outcome.