This question references Galit Shmueli’s paper “To Explain or to Predict”.
Specifically, in section 1.5, “Explaining and Prediction are Different”, Professor Shmueli writes:
In explanatory modeling the focus is on minimizing bias to obtain the most accurate representation of the underlying theory.
This has puzzled me each time I’ve read the paper. In what sense does minimizing the bias in estimates give the most accurate representation of the underlying theory?
I also watched professor Shmueli’s talk here, delivered at JMP Discovery Summit 2017, and she states:
…things that are like shrinkage models, ensembles, you will never see those. Because those models, by design, introduce bias in order to reduce the overall bias/variance. That’s why they won’t be there, it doesn’t make any theoretical sense to do that. Why would you make your model biased on purpose?
This doesn’t really shed light on my question, simply restating the claim that I don’t understand.
If the theory has many parameters, and we have scant data to estimate them, the estimation error will be dominated by variance. Why would it be inappropriate to use a biased estimation procedure like ridge regression (resulting in biased estimates of lower variance) in this situation?
This is indeed a great question, which requires a tour into the world of the use of statistical models in econometric and social science research (from what I have seen, applied statisticians and data miners who do descriptive or predictive work typically don’t deal with bias of this form). The term “bias” that I used in the article is what econometricians and social scientists treat as a serious danger to inferring causality from empirical studies. It refers to the difference between your statistical model and the causal theoretical model that underlies it. A related term is “model specification”, a topic taught heavily in econometrics due to the importance of “correctly specifying your regression model” (with respect to the theory) when your goal is causal explanation. See the Wikipedia article on Specification for a brief description. A major misspecification issue is under-specification, called “Omitted Variable Bias” (OVB), where you omit an explanatory variable from the regression that should have been there (according to theory) – this is a variable that correlates with the dependent variable and with at least one of the explanatory variables. See this neat description) that explains what are the implications of this type of bias. From a theory point of view, OVB harms your ability to infer causality from the model.
In the appendix of my paper To Explain or To Predict? there’s an example showing how an underspecified (“wrong”) model can sometimes have higher predictive power. But now hopefully you can see why that contradicts with the goal of a “good causal explanatory model”.