Do we really need to include “all relevant predictors?”

A basic assumption of using regression models for inference is that “all relevant predictors” have been included in the prediction equation. The rationale is that failure to include an important real-world factor leads to biased coefficients and thus inaccurate inferences (i.e, omitted variable bias).

But in research practice, I have never seen anyone including anything resembling “all relevant predictors.” Many phenomena have a myriad of important causes, and it would be very difficult, if not impossible, to include them all. An off-the-cuff example is modeling depression as an outcome: No one has built anything close to a model that includes “all relevant variables”: e.g., parental history, personality traits, social support, income, their interactions, etc., etc…

Moreover, fitting such a complex model would lead to highly unstable estimates unless there were very large sample sizes.

My question is very simple: Is the assumption/advice to “include all relevant predictors” just something that we “say” but never actually mean?
If not, then why do we give it as actual modeling advice?

And does this mean that most coefficients are probably misleading? (e.g., a study on personality factors and depression that uses only several predictors). In other words, how big of a problem is this for the conclusions of our sciences?


You are right – we are seldom realistic in saying “all relevant predictors”. In practice we can be satisfied with including predictors that explain the major sources of variation in $Y$. In the special case of drawing inference about a risk factor or treatment in an observational study, this is seldom good enough. For that, adjustment for confounding needs to be highly agressive, including variables that might be related to outcome and might be related to treatment choice or to the risk factor you are trying to publicize.

It is interested that with the normal linear model, omitted covariates, especially if orthogonal to included covariates, can be thought of as just enlarging the error term. In nonlinear models (logistic, Cox, many others) omission of variables can bias the effects of all the variables included in the model (due to non-collapsibility of the odds ratio, for example).

Source : Link , Question Author : ATJ , Answer Author : Frank Harrell

Leave a Comment