Why do we do matching for causal inference vs regressing on confounders?

I’m new to the area of causal inference. From what I understand, one of the main concerns that causal inference tries to address is the effect of confounders!

For the sake of reference, let’s denote the feature that we are interested in (a.k.a treatment or exposure) by A, other features by X (Let’s say some of them are confounders) and outcome by Y.

I focus on the case where all our confounders are observable. I also limit to the case when we want to estimate average treatment effect.

We have the below simple DAG:

enter image description here

So there could be two cases:

  1. A is categorical
  2. A is continuous

As I conceptually understand, the whole idea of matching is to marginalize the effect of treatment A on outcome Y, results in ignorability assumption to hold, which conceptually would be like replicating a randomized trial by making the distribution of covariates similiar in an observational data.

I am wondering isn’t it conceptually the same thing that we would have when making multiple regression on all variables? So the interpretation of each coefficient would be the marginal effect of each feature on the outcome!

What is the point that I am missing? What prevents us from going for multiple regression for controlling confounder and turn into matching?

Answer

As I see it, there are two related reasons to consider matching instead of regression. The first is assumptions about functional form, and the second is about proving to your audience that functional form assumptions do not affect the resulting effect estimate. The first is a statistical matter and the second is epistemic. Consider the tale below that attempts to illustrate how the choice between matching and regression could play out.

We’ll assume you have measured a sufficient adjustment set to satisfy the backdoor criterion (i.e., all relevant confounders have been measured) with no measurement error or missing data, and that your goal is to estimate the marginal treatment effect of the treatment on an outcome. We’ll also assume the standard assumptions of positivity and SUTVA hold. We’ll consider a continuous outcome first, but much of the discussion extends to general outcomes.

Part 1: Regression

You decide to run a regression of the outcome on the treatment and confounders as a way to control for confounding by these variables because that is what linear regression is supposed to do. However, the effect estimate is only unbiased under extremely strict circumstances. First, that the treatment effect is constant across levels of the confounders, and second, that the linear model describes the conditional relationship between the outcome and the confounders. For the first, you might include an interaction between the treatment and each confounder, allowing for heterogeneous treatment effects while estimating the marginal effect. This is equivalent to g-computation (1), which involves using the fitted regression model to generate predicted values under treatment and control for all units and using the difference in the means of these predicted values as the effect estimate.

That still assumes a linear model for the outcomes under treatment and control. Okay, we’ll use a flexible machine-learning method like random forests instead. Well, now we can’t claim our estimator is unbiased, only possibly consistent, and it still requires the specific machine learning model to approach the truth at a certain rate. Okay, we’ll use Superlearner (2), a stacking method that takes on the rate of convergence of the fastest of its included models. Well, now we don’t have a way to conduct inference, and the model might still be wrong. Okay, we’ll use a semiparametric efficient doubly-robust estimator like augmented inverse probability weighting (AIPW) (3) or targeted minimum loss-based estimation (TMLE) (4). Well, that’s only consistent if the true models fall in the Donsker class of models. Okay, we’ll use cross-fitting with AIPW or TMLE to relax that requirement (5).

Great. You’ve taken regression to its extreme, relaxing as many assumptions as possible and landing with a multiply-robust estimator (multiply-robust in the sense that if one of many models are correct, the estimator is consistent) with generally good inference properties (but it can be bootstrapped so getting the variance exactly right isn’t a big problem). Have we solved causal inference?

You submit the results of your cross-fit TMLE estimate using Superlearner for the propensity score and potential outcome models with a full library including highly adaptive lasso and many other models, which, under weak assumptions, are all that are required for a truly consistent estimator that converges at a parametric rate.

A reviewer reads the paper and says, “I don’t believe the results of this model.”

“Why not?” you say. “I used the optimal estimator with the best properties; it is consistent and semiparametric efficient with few, if any, assumptions on the functional forms of the models.”

“Your estimator is consistent,” says the reviewer, “but not unbiased. That means I can only trust its results in general and as N goes to infinity. How do I know you have successfully eliminated bias in the effect estimate in this dataset?”

“…”

Part 2: Matching to the Rescue

You read about a hot new method called “propensity score matching” (6). It was big in 1983, and, even in 2021, you see it in almost every paper published in specialized medical journals. You come across King and Nielsen’s influential paper “Why Propensity Scores Should Not Be Used for Matching” (7) and Noah’s answer on CV describing the many drawbacks to using propensity score matching. Okay, you’ll use genetic matching instead (8), and minimize the energy distance between the samples (9), including a flexibly estimated propensity score as a covariate to match on. You find that balance can be improved by using substantive knowledge to incorporate exact matching and caliper constraints that prioritize balance on covariates known to be important to the outcome. You decide to use full matching to relax the requirement of 1:1 matching to include more units in the analysis (10).

You estimate the treatment effect using a simple linear regression of the outcome on the treatment and the covariates, including the matching weights in the regression and using a cluster-robust standard error to account for pair membership (11). You resubmit the result of your full matching analysis using exact matching and calipers for prognostically important variables and a distance matrix estimating using genetic matching on the covariates and a flexibly estimated propensity score.

The reviewer reads your new manuscript. “Wow, you’ve learned a lot. But I still don’t believe you’ve removed bias the in the effect estimate.”

“Look at the balance tables,” you say. “The covariate distributions are almost identical.”

“I see low standardized mean differences,” says the reviewer, “but imbalances could remain on other features of the covariate distribution.”

“Look at the balance tables in the appendix which contain balance statistics for pairwise interactions, polynomials up to the 5th power of each covariate, and Kolmogorov-Smirnov statistics to compare the full covariate distributions. There are no meaningful differences between the samples, and no differences at all on the most highly prognostic covariates because of the exact matching constraints and calipers.”

“I see…”

“Also, I used Branson’s randomization test (12) with the energy distance as the balance statistic to show that my sample is better balanced not only than a hypothetical randomized trial using the same data, but also a block randomized trial, and even a covariate balance-constrained randomized trial.”

“Wow, I guess I don’t have much to say…”

“My outcome regression estimator isn’t just consistent, it’s truly unbiased in this sample. Also, because I incorporated pair membership into the analysis, my standard errors are smaller and more accurate and the resulting estimate is less sensitive to unobserved confounding* (13).”

“I get it!”

Part 3: The criticism

Frank Harrell bursts into the room. “Wait, by discarding so many units in matching, you have thrown away so much useful data and needlessly decimated your precision.” Mark van der Laan follows. “Wait, by using substantive ‘expertise’ you are not letting the analysis method find the true patterns in the data that might have eluded researchers, and your estimator does not converge at a known rate, let alone a parametric one! And there is no guarantee that your inference is valid!” I, your humble narrator, too, join in on the dogpile. “Wait, by using exact matching constraints and calipers, you have shifted your estimand away from the ATE or any a priori describable estimand (14)! Your effect estimate may be unbiased, but unbiased for what?”

You stand there, bewildered, defeated, feeling like you have come nowhere since you asked your simple question on CrossValidated what felt like years ago, no closer to understanding whether you should use matching or regression to estimate causal effects.

The curtains close.

Part 4: Epilogue

In the face of uncertainty and scarcity, we are left with tradeoffs. The choice between a regression-based method and matching to estimate a causal effect depends on how you and your audience choose to manage those tradeoffs and prioritize the advantages and drawbacks of each method.

Standard regression requires strong functional form assumptions, but with advanced methods, those can be relaxed, at the cost of giving up on bias and focusing on consistency and asymptotic inference. Many of these advanced methods work best in large samples, and they still require many choices along the way (e.g., which specific estimator to use, which machine learning methods to include in the Superlearner library, how many folds to use for cross-validation and cross-fitting, etc.). Although the multiply-robust methods may guarantee consistency and fast convergence rates in general data, it is not immediately clear how you can assess how well they eliminated bias in your dataset, potentially leaving one skeptical of their actual performance in your one instance.

Matching methods require few functional form assumptions because no models are required (e.g., when using a distance matrix that doesn’t depend solely on the propensity score, like that resulting from genetic matching). You can control confounding by adjusting the specification of the match, focusing more effort on hard-to-balance or prognostically important variables. You can come close to guaranteeing unbiasedness by ensuring you have achieved covariate balance, which can and should be measured extremely broadly with a skeptic in mind. You can use tools for analyzing randomized trials and trials with more powerful and robust designs. This comes at the cost of possibly decimating your precision by discarding huge amounts of data, changing your estimand so that your effect estimate doesn’t generalize to a meaningful population and isn’t replicable, and relying on ad hoc, “artisanal” methods with no clear path for valid inference.

The advantage matching has over regression, and the reason why I think it is so valuable and why I devoted my graduate training to understanding and improving matching and its use by applied researchers as the author of the R package cobalt, WeightIt, MatchIt, and others, is an epistemic advantage. With matching, you can more effectively convince a reader that what you have done is trustworthy and that you have accounted for all possible objections to the observed result, and can at least point to specific assumptions and explain how their violation might affect results. This all centers on covariate balance, the similarity between covariate distributions across the treatment groups. By reporting balance broadly and submitting the resulting matched data to a battery of tests and balance measures, you can convince yourself and your readers that the resulting effect estimate is unbiased and therefore trustworthy (given the assumptions mentioned at the beginning, though these may be tenuous, and neither matching nor regression can solve that problem).

However, not everyone agrees that this advantage so important, or more important than consistency and valid asymptotic inference. There can never be consensus on this matter, because consensus requires knowing the truth, and science (including statistics research) is about searching for an inherently unknowable truth (i.e., the true parameters that govern or describe our world). That is, if we knew the true causal effect, we could know the best method to estimate it, but we don’t, so we can’t. We can only do our best using the knowledge we have and try to manage the inherent constraints and tradeoffs as well as we can as we fumble around in the dark using the pinpoint of light the universe has shown us.


*Only when using a special method of inference for matched samples.

  1. Snowden JM, Rose S, Mortimer KM. Implementation of G-Computation on a Simulated Data Set: Demonstration of a Causal Inference Technique. Am J Epidemiol. 2011;173(7):731–738.
  2. van der Laan MJ, Polley EC, Hubbard AE. Super Learner. Statistical Applications in Genetics and Molecular Biology [electronic article]. 2007;6(1). (https://www.degruyter.com/view/j/sagmb.2007.6.issue-1/sagmb.2007.6.1.1309/sagmb.2007.6.1.1309.xml). (Accessed October 8, 2019)
  3. Daniel RM. Double Robustness. In: Wiley StatsRef: Statistics Reference Online. American Cancer Society; 2018 (Accessed November 9, 2018):1–14.(http://onlinelibrary.wiley.com/doi/abs/10.1002/9781118445112.stat08068). (Accessed November 9, 2018)
  4. Gruber S, van der Laan MJ. Targeted Maximum Likelihood Estimation: A Gentle Introduction. 2009;17.
  5. Zivich PN, Breskin A. Machine Learning for Causal Inference: On the Use of Cross-fit Estimators. Epidemiology. 2021;32(3):393–401.
  6. Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70(1):41–55.
  7. King G, Nielsen R. Why Propensity Scores Should Not Be Used for Matching. Polit. Anal. 2019;1–20.
  8. Diamond A, Sekhon JS. Genetic matching for estimating causal effects: A general multivariate matching method for achieving balance in observational studies. Review of Economics and Statistics. 2013;95(3):932–945.
  9. Huling JD, Mak S. Energy Balancing of Covariate Distributions. arXiv:2004.13962 [stat] [electronic article]. 2020;(http://arxiv.org/abs/2004.13962). (Accessed December 22, 2020)
  10. Stuart EA, Green KM. Using full matching to estimate causal effects in nonexperimental studies: Examining the relationship between adolescent marijuana use and adult outcomes. Developmental Psychology. 2008;44(2):395–406.
  11. Abadie A, Spiess J. Robust Post-Matching Inference. Journal of the American Statistical Association. 2020;0(ja):1–37.
  12. Branson Z. Randomization Tests to Assess Covariate Balance When Designing and Analyzing Matched Datasets. Observational Studies. 2021;7:44–80.
  13. Zubizarreta JR, Paredes RD, Rosenbaum PR. Matching for balance, pairing for heterogeneity in an observational study of the effectiveness of for-profit and not-for-profit high schools in Chile. The Annals of Applied Statistics. 2014;8(1):204–231.
  14. Greifer N, Stuart EA. Choosing the Estimand When Matching or Weighting in Observational Studies. arXiv:2106.10577 [stat] [electronic article]. 2021;(http://arxiv.org/abs/2106.10577). (Accessed September 17, 2021)

Attribution
Source : Link , Question Author : Ehsan Sh , Answer Author : Noah

Leave a Comment