A Practitioner's Guide to Causal Estimation

You're a product leader at a health-tech company. Your team just ran an A/B test on a new onboarding flow: 10,000 users randomized, treatment gets the new flow, control gets the old one. The treatment group shows a 12% relative lift in 7-day retention. Your CEO wants to know: should we roll this out?

But here's the problem that keeps you up at night: the randomization was stratified by user acquisition channel, and one channel had a 40% imbalance due to a logging bug. The simple difference-in-means estimator is biased, and you know it. You need a method that can recover the true treatment effect despite this failure of perfect randomization, without throwing away data or making heroic assumptions.

This is where modern causal estimation methods—inverse probability weighting (IPW), augmented IPW (AIPW), and matching—enter. They're not academic curiosities. They're tools for exactly this situation.

The Core Problem: Why Simple Comparisons Fail

When randomization breaks down—or was never possible—the naive estimator $\hat{\tau}$ conflates treatment effects with pre-existing differences between groups. The bias is:

$\mathbb{E}[\hat{\tau}_{naive}] - \tau = \mathbb{E}[Y(0) | D=1] - \mathbb{E}[Y(0) | D=0]$

This is the selection bias. It's zero only if treatment assignment is independent of potential outcomes, which randomization guarantees. When it fails, you need to adjust for confounders $X$ —variables that affect both treatment assignment and outcomes.

IPW: Re-weighting the Sample

Inverse probability weighting re-weights each observation by the inverse of its propensity score $e(X) = P(D=1|X)$ , creating a pseudo-population where treatment is independent of covariates. The IPW estimator is:

$\hat{\tau}$

When it works: IPW is consistent if the propensity score model is correctly specified. Hirano, Imbens, and Ridder (2003) showed that with nonparametric propensity score estimation, IPW achieves the semiparametric efficiency bound—meaning it's as efficient as any estimator can be asymptotically.

When it fails: IPW is notoriously sensitive to propensity score misspecification. If your logistic regression for $e(X)$ is wrong, the bias can be worse than the naive estimator. More critically, IPW breaks down when propensity scores approach 0 or 1—the "positivity violation" problem. In our onboarding example, if the buggy channel has near-zero probability of receiving treatment, IPW assigns enormous weights to those few treated users, and the estimator's variance explodes.

Matching: Finding Counterfactual Twins

Matching constructs a synthetic control group by pairing each treated unit with control units that have similar covariates. The simplest version is 1:1 nearest-neighbor matching with replacement:

$\hat{\tau}$

where $j(i)$ is the control unit closest to unit $i$ in covariate space.

When it works: Abadie and Imbens (2006) proved that matching estimators are $\sqrt{n}$ -consistent and asymptotically normal under standard regularity conditions, provided the number of continuous covariates is small (≤ 3). With more covariates, the bias from imperfect matches accumulates.

When it fails: The curse of dimensionality is the killer. With 10+ covariates, finding close matches becomes impossible. The bias from poor matches can be substantial, and standard matching doesn't correct for it. This is why matching is often paired with bias-correction (regression adjustment within matched pairs).

AIPW: The Best of Both Worlds

Augmented inverse probability weighting combines outcome regression and propensity score weighting to achieve "double robustness": the estimator is consistent if either the outcome model or the propensity score model is correctly specified. The AIPW estimator is:

$\hat{\tau}$

where $\hat{\mu}_d(X) = \hat{\mathbb{E}}[Y|D=d, X]$ .

The key theoretical guarantee: Chernozhukov et al. (2018) showed that AIPW with cross-fitting achieves $\sqrt{n}$ -consistency and asymptotic normality under remarkably weak conditions—specifically, the product of the convergence rates of the nuisance estimators must be faster than $n^{-1/2}$ . This means you can use flexible machine learning for both the outcome and propensity score models, as long as each converges at better than $n^{-1/4}$ in $L_2$ norm.

Why this matters for practice: You don't need to get either model perfectly right. You just need both to be moderately good. This is the "debiased machine learning" revolution.

Worked Example: The Onboarding Flow

Let's walk through our health-tech example. We have 10,000 users, 5 acquisition channels, and a logging bug that caused channel 3 to have 60% treatment assignment instead of 50%. We observe baseline covariates: age, device type, and prior app usage.

Step 1: Estimate propensity scores. Fit a logistic regression with channel indicators and covariates. The estimated propensity for channel 3 users is ~0.6, for others ~0.5.

Step 2: Check overlap. All propensity scores are between 0.3 and 0.7—no positivity violation.

Step 3: Fit outcome models. Use gradient-boosted trees to predict retention under treatment and control, separately.

Step 4: Compute AIPW. The AIPW estimate gives a 9.8% relative lift (95% CI: 7.2% to 12.4%). The naive estimate was 12%—the 2.2 percentage point difference is the selection bias from the channel imbalance.

Step 5: Compare with alternatives. IPW gives 10.1% but with 30% wider confidence intervals. Matching (with bias correction) gives 9.5% but is sensitive to the caliper choice. The AIPW estimate is the most stable and efficient.

Common Failure Mode: Ignoring the Neyman Orthogonal Score

The most common mistake practitioners make is using AIPW without cross-fitting. If you estimate the propensity score and outcome models on the same data used to compute the treatment effect, you introduce "overfitting bias"—the machine learning models adapt to idiosyncratic noise, and the AIPW estimator loses its double robustness property.

Chernozhukov et al. (2018) proved that cross-fitting—splitting the data into K folds, estimating nuisance functions on K-1 folds, and computing the treatment effect on the held-out fold—is essential for removing this bias. Without it, the estimator's bias can be $O(n^{-1/2})$ rather than $o(n^{-1/2})$ , invalidating confidence intervals.

The fix is simple: Always use cross-fitting with at least 2 folds (5 is standard). Verify that results are stable across different random splits.

When to Choose What

Choose IPW when you're confident in your propensity score model and have strong overlap. It's simple, transparent, and easy to explain to stakeholders.
Choose matching when you have few covariates (≤ 3 continuous) and want a nonparametric, interpretable comparison. It's excellent for subgroup analysis where you can visualize matched pairs.
Choose AIPW when you have many covariates, are unsure about model specification, and need robust inference. It's the workhorse for modern causal inference with high-dimensional or complex data.
Avoid all three when the overlap assumption is severely violated (propensity scores near 0 or 1 for many units). In that case, consider trimming the sample or using methods designed for limited overlap, like inverse probability weighting with trimming or targeted maximum likelihood estimation.

Practical Checklist

Before deploying any of these methods in a real study, verify:

Overlap diagnostics: Plot the distribution of estimated propensity scores for treated and control groups. If the densities don't overlap substantially (e.g., the 5th percentile of treated scores is below the 95th percentile of control scores), you have a positivity problem. Report the range of estimated propensity scores and the number of units with extreme values (< 0.1 or > 0.9).
Cross-fitting implementation: Confirm you're using at least 2-fold cross-fitting (5-fold is standard). Verify that results are stable across different random splits—if the estimate changes by more than 0.1 standard errors across splits, increase the number of folds or check for data leakage.
Nuisance model diagnostics: Assess the quality of both the propensity score and outcome models. For the propensity score, check calibration (Hosmer-Lemeshow test or calibration curves). For the outcome model, report cross-validated R². If either model performs poorly (e.g., AUC < 0.6 or R² < 0.05), the double robustness property is your only protection—verify that results are similar when using different model classes.
Sensitivity to model specification: Estimate the treatment effect using at least two different combinations of nuisance models (e.g., logistic regression + linear regression, and gradient boosting + random forest). If estimates differ by more than 0.2 standard errors, investigate further—this suggests at least one model is misspecified.
Sample splitting for inference: Ensure that standard errors account for the cross-fitting procedure. Naive standard errors that ignore cross-fitting are typically too small. Use the standard error formula from Chernozhukov et al. (2018, Theorem 3.1) or bootstrap the entire procedure (including cross-fitting).
Balance check on weighted sample: After applying IPW or AIPW weights, check covariate balance using standardized mean differences. A rule of thumb: all standardized differences should be below 0.1 after weighting. If not, the propensity score model may be misspecified, and you should consider more flexible models or covariate balancing propensity score methods.

Part of the DoOperator Research series on Causal Estimation. Browse the full paper corpus at dooperator.ai/research/causal_estimation.