Beyond Average Treatment Effects: Finding Who Benefits From Your Interventions

Your A/B test just came back. The average treatment effect (ATE) is +2%. Statistically significant at p < 0.05. The team is ready to ship. But then you slice the data: power users show +8% lift. New users show -3%. The “average” is not just hiding the story—it’s actively misleading you. If you roll out the feature to everyone based on that +2%, you’ll lose new users while patting yourself on the back.

This is the fundamental problem with relying on the Average Treatment Effect (ATE). It assumes the treatment effect is constant across all units. In reality, treatment effect heterogeneity (HTE) is the norm, not the exception. People, products, and contexts differ. The question isn’t whether effects vary—it’s how to find where they vary, and whether you can trust what you find.

Here are four key insights to move beyond the ATE and start finding who truly benefits from your interventions.

1. Causal Forests: When You Need to Estimate CATE Without Overfitting

The first step beyond ATE is estimating the Conditional Average Treatment Effect (CATE) —the effect for a person with specific characteristics $X$ . The naive approach is to split your data into subgroups and estimate ATEs within each. This fails because you’ll overfit, especially with many subgroups.

Causal forests, developed by Wager & Athey (2018), solve this by adapting Breiman’s random forests to causal inference. Instead of predicting outcomes, each tree in the forest estimates treatment effects by splitting on $X$ in a way that maximizes heterogeneity (difference in treatment effects across leaves), not just outcome prediction. The forest then averages across trees, giving you a stable, honest CATE estimate for each individual.

When they work: Causal forests shine when you have moderate-to-large samples (think 5,000+ observations) and a rich set of covariates. They automatically handle non-linearities and interactions. The catch: they’re data-hungry. With fewer than 1,000 observations, the variance overwhelms the signal. Also, they assume unconfoundedness—you need to have measured all confounders that affect both treatment assignment and the outcome.

Practical tip: Use causal forests as a screening tool, not a final decision. They’ll tell you which subgroups seem to have different effects, but you should validate those subgroups with a holdout set or a separate experiment.

2. Meta-Learners: The Practical Tradeoffs (T, S, X, R)

If causal forests feel like a black box, meta-learners offer a modular approach. You decompose the problem into sub-problems that standard machine learning models can solve.

T-learner (Two models): Train one model for the treated group, one for the control. CATE = $\hat{\mu}_1(x) - \hat{\mu}_0(x)$ . Simple, but each model only sees half the data—inefficient when sample sizes are small or treatment groups are imbalanced.
S-learner (Single model): Train one model with treatment as a feature. CATE = $\hat{\mu}(x, T=1) - \hat{\mu}(x, T=0)$ . Efficient with data, but the model may “ignore” the treatment feature if other features are more predictive of the outcome.
X-learner: Designed for imbalanced treatment groups (e.g., 90% control, 10% treated). It imputes the missing potential outcomes using the other group’s model, then learns the treatment effect directly. Best choice when treatment is rare—think a new feature shown to only 5% of users.
R-learner: Uses a “residualized” approach—it fits a model on the outcome residuals after controlling for covariates, then estimates CATE from the treatment-covariate interaction. Best when you have high-dimensional controls (e.g., 100+ features) because it focuses on the part of the variation that treatment actually explains.

Tradeoff summary: T-learner for balanced, large samples. X-learner when treatment is rare. R-learner when you have many controls. S-learner as a quick baseline, but don’t trust it blindly.

3. GATES and BLP: Testing If Heterogeneity Is Real (and Summarizing It)

You’ve estimated CATEs. Now: is the heterogeneity real, or just noise? And how do you communicate it to stakeholders who don’t want to hear about “CATE distributions”?

Chernozhukov et al. (2018) gave us two tools:

BLP (Best Linear Predictor): Regress the outcome on the predicted CATE and treatment. The coefficient on the interaction tells you if your CATE model predicts actual differences in treatment effects. A significant coefficient means your model found real heterogeneity. Simple, interpretable.
GATES (Group Average Treatment Effects): Divide the sample into $K$ groups based on predicted CATE (e.g., quintiles). Estimate the ATE within each group. If the ATE in the top quintile is significantly different from the bottom, you have evidence of meaningful heterogeneity. This is what you show your VP: “Our model predicts an 8% lift for the top 20% of users, and a -1% lift for the bottom 20%.”

Why this matters: GATES and BLP protect you from over-interpreting noise. They use cross-fitting to avoid overfitting bias. If your causal forest says the top decile has +12% lift but GATES says only +2% (not significant), the forest was likely overfitting.

4. The Multiple Testing Problem: Why 20 Subgroups Is Dangerous

This is where most practitioners get burned. You slice your data by age, region, device, subscription tier, referral source—20 subgroups. You find one with p = 0.03. You celebrate. But with 20 tests, you expect one “significant” result by chance alone (at α = 0.05).

The real danger isn’t just false positives—it’s that you’ll act on them. You’ll target the subgroup that appeared to benefit, but the effect was noise. Next experiment, it disappears.

How to handle it:

Pre-register your subgroups. Before you see the data, specify: “We will test heterogeneity by user tenure (new vs. returning) and device type (mobile vs. desktop).” Limit to 2–3 hypotheses.
Use a global test first. Before slicing, run an omnibus test for heterogeneity (e.g., a likelihood ratio test comparing a model with treatment-covariate interactions vs. one without). If it’s not significant, stop. You don’t have evidence of any heterogeneity.
Adjust for multiple comparisons. If you must test many subgroups, use Bonferroni or Benjamini-Hochberg. Better yet, use a method that controls the False Discovery Rate (FDR) —it’s less conservative and still protects you.
Validate on a holdout set. Split your experiment data into training (70%) and validation (30%). Find subgroups in training, test them in validation. If the effect doesn’t replicate, it’s noise.

When to Look for HTE—and When Not To

Look for HTE when:

You have a large sample (N > 5,000) with rich covariates.
You have strong theoretical reasons that effects might differ (e.g., power users vs. new users).
You’re willing to act on the heterogeneity (e.g., target the feature to a subset).
You can pre-register your analysis plan.

Do not look for HTE when:

Your sample is small (N < 1,000). You’ll find patterns that don’t exist.
Your treatment effect is near zero (ATE ~ 0). Heterogeneity is possible, but you’ll likely overfit to noise.
You’re not willing to restrict the rollout. If you’re going to ship to everyone anyway, don’t bother—you’re just adding complexity and false confidence.

The Bottom Line

The ATE is a starting point, not a destination. The +2% average hides the story of the power user who loves your feature and the new user who churns because of it. Causal forests, meta-learners, and tools like GATES/BLP give you the machinery to find those stories—but only if you use them with discipline.

Pre-register your subgroups. Test globally before slicing. Validate your findings. And always remember: the goal isn’t to find any subgroup with a big effect—it’s to find the right subgroup that you can confidently target.

Your next A/B test will show a +2% average. But this time, you’ll know the story behind it.