I am currently working on a problem in which we have a small dataset and are interested in the causality effect of a treatment on the outcome.

My advisor has instructed me to perform a univariate regression on each predictor with the outcome as the response, then the treatment assignment as the response. Ie, I am being asked to fit a regression with one variable at a time and make a table of the results. I asked “why should we do this?”, and the answer was something to the effect of “we are interested in which predictors are associated with the treatment assignment and the outcome, as this would likely indicate a confounder”. My advisor is a trained statistician, not a scientist in a different field, so I’m inclined to trust them.

This makes sense, but it’s not clear how to use the result of the univariate analysis. Wouldn’t making model selection choices from this result in significant bias of the estimates and narrow confidence intervals? Why should anyone do this? I’m confused and my advisor is being fairly opaque on the issue when I brought it up. Does anyone have resources on this technique?

(NB: my advisor has said we are NOT using p-values as a cut off, but that we want to consider “everything”.)

**Answer**

The causal context of your analysis is a key qualifier in your question. In forecasting, running univariate regressions before multiple regressions in the spirit of the “purposeful selection method” suggested by Hosmer and Lemenshow has one goal. In your case, where you are building a causal model, running univariate regressions before running multiple regression has a completely different goal. Let me expand on the latter.

You and your instructor must have in mind a certain causal graph. Causal graphs have testable implications. Your mission is to start with the dataset that you have, and reason back to the causal model that might have generated it. The univariate regressions he suggested that you run most likely constitute the first step in the process of testing the implications of the causal graph you have in mind. Suppose that you believe that your data was generated by the causal model depicted in the graph below. Suppose you are interested in the causal effect of D on E. The graph below suggests a host of testable implications, such as:

- E are D are likely
*dependent* - E and A are likely
*dependent* - E and C are likely
*dependent* - E and B are likely
*dependent* - E and N are likely
*independent*

I mentioned that this is only the first step in the causal search process because the real fun starts once you start running multiple regressions, conditioning of different variables and testing whether the result of the regression is consistent with the implication of the graph. For example, the graph above suggest that E and A must be independent once you condition on D. In other words, if you regress E on D and A and find that the coefficient on A is not equal to zero, you’ll conclude that E depends on A, after you condition on D, and therefore that the causal graph must be wrong. It will even give you hints as to how to alter your causal graph, because the result of this regression suggests that there must be a path between A and E that is not d-separated by D. It will become important to know the testable dependence implications that chains, forks, and colliders have.

**Attribution***Source : Link , Question Author : Marcel , Answer Author : ColorStatistics*