A Practitioner's Guide to Heterogeneous Treatment Effects

You're the head of clinical analytics at a health system. Your team has just rolled out a new remote monitoring program for heart failure patients. The average treatment effect is positive—readmissions dropped 8% overall. Your CEO wants to expand it system-wide. But you have a nagging question: Is this program helping everyone, or is it helping some patients while harming others?

The average hides the distribution. An 8% reduction could mean uniform improvement across all patients, or it could mean a 20% reduction for one subgroup and a 4% increase for another. If you expand the program without knowing which patients benefit and which don't, you're making a decision under avoidable uncertainty.

This is the problem heterogeneous treatment effect (HTE) estimation solves: moving from "does it work on average" to "for whom does it work, by how much, and with what confidence."

Why Naive Approaches Fail

The most common approach to HTE is subgroup analysis: split the data by a pre-specified variable (age, comorbidity count, baseline severity) and estimate the treatment effect within each bin. This fails for two reasons.

First, multiple testing. If you test 20 subgroups, you'll find at least one "significant" effect by chance at the 95% confidence level. Second, data dredging. When you let the data suggest subgroups post-hoc, you capitalize on random patterns that won't replicate. The result is a literature full of subgroup claims that vanish in subsequent studies.

A more sophisticated alternative is to fit two separate models—one for treated outcomes, one for control outcomes—and subtract them. This is called the T-learner (two models). But as Nie & Wager (2017) prove, the T-learner suffers from regularization bias: when you regularize each model separately, you shrink the main effects and the treatment effect simultaneously, and the difference inherits bias from both. Their Theorem 1 shows that the T-learner's error depends on the complexity of the main effects, not just the treatment effect function. If the main effects are complex (which they usually are), your CATE estimates will be noisy and biased.

The R-Learner: A Quasi-Oracle Solution

Nie & Wager (2017) propose the R-learner, which transforms HTE estimation into a single regression problem. The key insight comes from Robinson (1988): under unconfoundedness, the outcome can be decomposed as

$Y_i - m(X_i) = (W_i - e(X_i)) \tau(X_i) + \varepsilon_i$

where $m(x) = E[Y|X=x]$ is the conditional mean outcome, $e(x) = P(W=1|X=x)$ is the propensity score, and $\tau(x)$ is the CATE. The left side is the "residualized outcome" (outcome after removing the main effect). The right side multiplies the "residualized treatment" (treatment after removing the propensity) by the CATE.

This implies a loss function:

$\hat{\tau}(\cdot) = \arg\min_{\tau} \frac{1}{n} \sum_{i=1}^n \left[ (Y_i - \hat{m}(X_i)) - (W_i - \hat{e}(X_i)) \tau(X_i) \right]^2 + \Lambda(\tau)$

where $\Lambda(\tau)$ is a regularizer. The R-learner achieves a quasi-oracle property: its error depends only on the complexity of the treatment effect function $\tau(x)$ , not on the complexity of the nuisance functions $m(x)$ and $e(x)$ . Theorem 2 of Nie & Wager (2017) shows that if the nuisance functions can be estimated at $o(n^{-1/4})$ rates, the R-learner performs as well as an oracle that knows the true nuisance functions.

Worked Example: The Remote Monitoring Program

Let's make this concrete. You have 5,000 heart failure patients. Half were assigned to remote monitoring, half to usual care. You want to know which patients benefit.

Step 1: Estimate nuisance functions. Fit a model for the propensity score $e(x)$ (probability of receiving the program given covariates) and the conditional mean outcome $m(x)$ (expected readmission rate given covariates). Use cross-fitting: split the data, estimate nuisance functions on one fold, predict on the other. This prevents overfitting bias.

Step 2: Construct the transformed outcome. For each patient, compute:

Residualized outcome: $Y_i - \hat{m}(X_i)$
Residualized treatment: $W_i - \hat{e}(X_i)$

Step 3: Fit the CATE model. Regress the residualized outcome on the residualized treatment using your choice of ML method (lasso with interactions, gradient boosting, neural network). The coefficient on the residualized treatment—or more precisely, the function that multiplies it—is your CATE estimate.

Step 4: Validate. The R-learner doesn't give you confidence intervals out of the box. Use the causal forest of Wager & Athey (2017) for inference. Causal forests are random forests built with "honest" trees: the data used to determine splits is separate from the data used to estimate effects within leaves. This honesty property, combined with subsampling, yields asymptotically normal CATE estimates with valid confidence intervals (Theorem 6, Wager & Athey 2017).

In your analysis, you find that the program reduces readmissions by 15% for patients with ejection fraction below 35% and no prior hospitalization, but increases readmissions by 3% for patients with ejection fraction above 50% and three or more prior hospitalizations. The average effect of 8% was masking this heterogeneity.

When to Choose Causal Forests Over the R-Learner

The R-learner is flexible—you can use any ML method for the final regression. But it requires careful cross-fitting and doesn't naturally produce confidence intervals. Causal forests give you both point estimates and inference in a single procedure, but they assume the CATE function is well-approximated by a forest structure.

Choose causal forests when: You need valid confidence intervals for individual-level CATEs, and you expect the CATE function to have discontinuities or interactions (forests excel at this).

Choose the R-learner when: You have strong prior knowledge about the CATE function's structure (e.g., it's sparse linear), or you want to use a specific ML method (e.g., neural networks for image-based covariates).

Both methods require unconfoundedness—the assumption that you've measured all confounders. This is untestable. If you suspect unmeasured confounding, consider the partial identification approach of Schweisthal et al. (2024), which uses multiple environments (e.g., different hospitals) to bound CATEs rather than point-identify them.

The Most Common Failure Mode: Ignoring Overlap

Practitioners obsess over unconfoundedness but neglect overlap (positivity). The R-learner and causal forests both require that for every covariate profile, there's a non-negligible probability of receiving both treatment and control. When overlap fails—e.g., all patients with ejection fraction below 25% receive the program—your CATE estimate for that region is an extrapolation, not an estimate.

Wager & Athey (2017) require $\eta < e(x) < 1 - \eta$ for some $\eta > 0$ . In practice, check the distribution of estimated propensity scores. If you see mass near 0 or 1, you have an overlap problem. Solutions include trimming the sample (removing units with extreme propensities) or restricting your analysis to the region of common support.

A second failure mode is ignoring model uncertainty in the nuisance functions. The R-learner's quasi-oracle property requires $o(n^{-1/4})$ rates for nuisance estimation. If your propensity model is badly misspecified, this condition fails. Always cross-validate your nuisance models and consider using flexible methods (gradient boosting, neural networks) rather than logistic regression for the propensity score.

Practical Checklist

Before deploying HTE methods in a real study, verify:

Unconfoundedness is plausible. List the confounders you've measured. Argue—with domain experts—why unmeasured confounders are unlikely to be strong enough to reverse your conclusions. If you can't make this case, consider sensitivity analysis or partial identification.
Overlap holds in the region of interest. Plot the distribution of estimated propensity scores by treatment group. If you see mass within 0.05 of 0 or 1, restrict your analysis to the region where $0.05 < \hat{e}(x) < 0.95$ .
Nuisance models are cross-fitted. Never estimate nuisance functions and CATEs on the same data. Use K-fold cross-fitting (K=5 or 10) to break the dependence.
You have a plan for inference. Point estimates of CATEs are useless without uncertainty quantification. Use causal forests for built-in confidence intervals, or bootstrap the R-learner with proper accounting for the nuisance estimation step.
You've pre-specified your heterogeneity analysis. Decide in advance which covariates you'll examine for heterogeneity. If you must explore post-hoc, use a holdout set for validation and report both exploratory and confirmatory analyses transparently.
The sample size supports your ambitions. HTE estimation requires substantially more data than ATE estimation. A rule of thumb: you need at least 200 events (e.g., readmissions) in the smaller treatment group to detect moderate heterogeneity. If you're underpowered, consider restricting to a single pre-specified subgroup.

Part of the DoOperator Research series on Heterogeneous Treatment Effects. Browse the full paper corpus at dooperator.ai/research/heterogeneous_effects.