Why is controlling for too many variables considered harmful?

I am trying to understand the point of the second panel in the following xkcd comic:


Specifically, how can one be misled by controlling too many confounding variables in one’s models?

Any pointers to what this criticism is called in the literature—so I can look into it further—will be appreciated.


There is no such thing as a “sweet spot” for the number of variables to control for in order to get an unbiased estimate of the causal effect. Since we are talking about confounding, we must have in mind the estimation of the causal effect of a particular variable. You use a graphic tool called the DAG to map out the causal relationships and then you condition on a set of variables that will yield you the causal effect. Conditioning on variables generally blocks the flow of association but conditioning on a collider (common effect) will induce association between variables that are not causally related. The more variables you condition on, the more likely you are to condition on a collider and thus induce association without causation; that said the more variables you condition on you are also blocking more backdoor paths, including those with colliders. The reasoning here should not revolve around “how many variables?” but around “which variables?” to condition on.

Below is an example where not conditioning on anything is what you want in order to estimate the direct causal effect of A on B. On the other hand, conditioning on the set {D} or {C,D} will bias the direct causal effect of A on B because it conditions on the collider D and opens backdoor path(s).

enter image description here

This post here can serve as a good introduction to causal reasoning with DAGs.

Source : Link , Question Author : nsimplex , Answer Author : ColorStatistics

Leave a Comment