# Why does propensity score matching work for causal inference?

Propensity score matching is used for make causal inferences in observational studies (see the Rosenbaum / Rubin paper). What’s the simple intuition behind why it works?

In other words, why if we make sure the probability of participating in the treatment is equal for the two groups, the confounding effects disappear, and we can use the result to make causal conclusions about the treatment?

I’ll try to give you an intuitive understanding with minimal emphasis on the mathematics.

The main problem with observational data and analyses that stem from it is confounding. Confounding occurs when a variable affects not only the treatment assigned but also the outcomes. When a randomized experiment is performed, subjects are randomized to treatments so that, on average, the subjects assigned to each treatment should be similar with respect to the covariates (age, race, gender, etc.). As a result of this randomization, it’s unlikely (especially in large samples) that differences in the outcome are due to any covariates, but due to the treatment applied, since, on average, the covariates in the treatment groups are similar.

On the other hand, with observational data there is no random mechanism that assigns subjects to treatments. Take for example a study to examine the survival rates of patients following a new heart surgery compared to a standard surgical procedure. Typically one cannot randomize patients to each procedure for ethical reasons. As a result patients and doctors self-select into one of the treatments, often due to a number of reasons related to their covariates. For example the new procedure might be somewhat riskier if you are older, and as a result doctors might recommend the new treatment more often to younger patients. If this happens and you look at survival rates, the new treatment might appear to be more effective, but this would be misleading since younger patients were assigned to this treatment and younger patients tend to live longer, all else being equal. This is where propensity scores come in handy.

Propensity scores helps with the fundamental problem of causal inference — that you may have confounding due to the non-randomization of subjects to treatments and this may be the cause of the “effects” you are seeing rather than the intervention or treatment alone. If you were able to somehow modify your analysis so that the covariates (say age, sex, gender, health status) were “balanced” between the treatment groups, you would have strong evidence that the difference in outcomes is due to the intervention/treatment rather than these covariates. Propensity scores, determine each subject’s probability of being assigned to the treatment that they received given the set of observed covarites. If you then match on these probabilities (propensity scores), then what you have done is taken subjects who were equally likely to be assigned to each treatment and compared them with one another, effectively comparing apples to apples.

You may ask why not exactly match on the covariates (e.g. make sure you match 40 year old men in good health in treatment 1 with 40 year old men in good health in treatment 2)? This works fine for large samples and a few covariates, but it becomes nearly impossible to do when the sample size is small and the number of covariates is even moderately sized (see the curse of dimensionality on Cross-Validated for why this is the case).

Now, all this being said, the Achilles heel of propensity score is the assumption of no unobserved confounders. This assumption states that you have not failed to include any covariates in your adjustment that are potential confounders. Intuitively, the reason behind this is that if you haven’t included a confounder when creating your propensity score, how can you adjust for it? There are also additional assumptions such as the stable unit treatment value assumption, which states that the treatment assigned to one subject does not affect the potential outcome of the other subjects.