Propensity Score Matching at a high level uses a framework of:
- Identify potential confounders from the co-variates i.e all factors which can potentially influence the subject being part of experiment group
- Calculate Propensity Score = Pr (Subject being part of treatment | co-variates)
- Create a model to estimate the membership for each of the subjects of being in treatment.
- Subjects get divided into multiple strata of Control / Experiment based on Propensity scores. This would make the groups balanced w.r.t the subjects having similar characteristics in terms of co-variates.
- Calculate effect of treatment by averaging the sum of differences in the dependent variable for each strata.
My question is: why is this better than just matching subjects with a treatment with a similar non-treated subject as measured by a distance measure and co-variate values? And then averaging the sum of differences between all pairs of treated-non-treated subjects?
That seems simpler and would appear to address the self-selection bias that PSM is meant to reduce / eliminate.
What am I missing here?
The procedure you described is not propensity score matching but rather propensity score subclassification. In propensity score matching, pairs of units are selected based on the difference between their propensity scores, and unpaired units are dropped. Both methods are popular ways of using propensity scores to reduce imbalance that causes confounding bias in observational studies.
In propensity score matching, the distance between two units is the difference between their propensity scores, and propensity scores are computed from the covariates, so by propensity score matching, you are matching based on a distance measure and covariate values. There are other distance measures that don’t involve the propensity score that are frequently used in matching, like the Mahalanobis distance. Some studies show the Mahalanobis distance works better than the propensity score difference as a distance measure and some studies show it doesn’t. The relative performance of each depends on the unique characteristics of the dataset; there is no way to provide a single rule that is always true about which method is better. Both should be tried. You can also include the propensity score as a covariate in the Mahalanobis distance.
If your question is more about why we would ever do propensity score subclassification when we could do propensity score matching, there are a few considerations. As before, you should always use whichever method yields the best balance in your sample. Propensity score subclassification may do a better job at achieving balance in some datasets and propensity score matching in others. There is no reason to unilaterally decide to use one method over another. Subclassification allows you to estimate the ATT or ATE, whereas most matching methods only allow the ATT. Subclassification is closely related to propensity score weighting when used in certain ways, whereas matching typically doesn’t assign nonuniform weights to individuals. With matching, you can customize the specification more (e.g., by using a caliper, by changing the ratio of controls to treated, etc.), whereas with subclassification the opportunities for customization are more limited. The distinction between matching and subclassification is blurred in the face of full matching, which is a hybrid between the two that often performs better than each. Some papers have compared the performance of the two methods, but as I mentioned before, it is important not to rely on general results and instead try both methods in your sample.
Check out the documentation for the
MatchIt R package which goes into detail on several matching methods and discusses some of their relative merits and methods of customization.