# When is it inappropriate to control for a variable?

I can think of at least one naive example. Suppose I want to study the relationship between X and Z. I also suspect that Y influences Z, so I control for Y. However, as it turns out, unbeknownst to me, X causes Y, and Y causes Z. Therefore, by controlling for Y, I “cover up” the relationship between X and Z, since X is independent of Z given Y.

Now, in the previous example, it may be the case that the relationships I should be studying are the ones between X and Y, and Y and Z. However, if I knew such things a priori, I wouldn’t be doing science in the first place. The study that I DID do now suggests that there is no relationship between X and Z, which is not the case…. X and Z ARE related.

This is illustrated in the following dependence-diagram. In the right scenario, Z depends on X and Y and X and Y are independent. We rightly control for Y to determine the relationship between X and Z. In the left scenario Z depends on Y which depends on X. X and Z are independent given Y, so the relationship between X and Z is “covered up” by controlling for Y.

My question is basically “When is it appropriate to control for variable Y and when not?”… It may be difficult or impossible to fully investigate the relationship between X and Y, but, for instance, controlling Y at a given level is an option. How do we decide before conducting our study, and what are common pitfalls of controlling too much or too little?

Citations appreciated.

Conditioning (i.e. adjusting) the probabilities of some outcome given some predictor on third variables is widely practiced, but as you rightly point out, may actually introduce bias into the resulting estimate as a representation of causal effects. This can even happen with “classical” definitions of a potential causal confounder, because both the confounder itself, and the predictor of interest may each have further causal confounders upstream. In the DAG below, for example, \$L\$ is a classic confounder of the causal effect of \$E\$ on \$D\$, because (1) it causes and is therefore associated with \$E\$, and (2) is associated with \$D\$ since it is associated with \$U_{2}\$ which is associated with \$D\$. However, either conditioning or stratifying \$P(D|E)\$ on \$L\$ (a “collider”) will produce biased causal estimates of the effect of \$E\$ on \$D\$ because \$L\$ is confounded with \$D\$ by the unmeasured variable \$U_{2}\$, and \$L\$ is confounded with \$E\$ by the unmeasured variable \$U_{1}\$.

Understanding which variables to condition or stratify one’s analysis on to provide an unbiased causal estimate requires careful consideration of the possible DAGs using the criteria for causal effect identifiability—no common causes that are not blocked by backdoor paths—described by Pearl, Robins, and others. There are no shortcuts. Learn common confounding patterns. Learn common selection bias patterns. Practice.

References

Greenland, S., Pearl, J., and Robins, J. M. (1999). Causal diagrams for epidemiologic research. Epidemiology, 10(1):37–48.

Hernán, M. A. and Robins, J. M. (2018). Causal Inference. Chapman & Hall/CRC, Boca Raton, FL

Maldonado, G. and Greenland, S. (2002). Estimating causal effects. International Journal of Epidemiology, 31(2):422–438.

Pearl, J. (2000). Causality: Models, Reasoning, and Inference. Cambridge University Press.