DoOperator Research · May 17, 2026
Your A/B test just came back. The average treatment effect (ATE) is +2%. Statistically significant at p < 0.05. The team is ready to ship. But then you slice the data: power users show +8% lift. New users show -3%. The “average” is not just hiding the story—it’s actively misleading you. If you roll out the feature to everyone based on that +2%, you’ll lose new users while patting yourself on the back.
This is the fundamental problem with relying on the Average Treatment Effect (ATE). It assumes the treatment effect is constant across all units. In reality, treatment effect heterogeneity (HTE) is the norm, not the exception. People, products, and contexts differ. The question isn’t whether effects vary—it’s how to find where they vary, and whether you can trust what you find.
Here are four key insights to move beyond the ATE and start finding who truly benefits from your interventions.
The first step beyond ATE is estimating the Conditional Average Treatment Effect (CATE) —the effect for a person with specific characteristics . The naive approach is to split your data into subgroups and estimate ATEs within each. This fails because you’ll overfit, especially with many subgroups.
Causal forests, developed by Wager & Athey (2018), solve this by adapting Breiman’s random forests to causal inference. Instead of predicting outcomes, each tree in the forest estimates treatment effects by splitting on in a way that maximizes heterogeneity (difference in treatment effects across leaves), not just outcome prediction. The forest then averages across trees, giving you a stable, honest CATE estimate for each individual.
When they work: Causal forests shine when you have moderate-to-large samples (think 5,000+ observations) and a rich set of covariates. They automatically handle non-linearities and interactions. The catch: they’re data-hungry. With fewer than 1,000 observations, the variance overwhelms the signal. Also, they assume unconfoundedness—you need to have measured all confounders that affect both treatment assignment and the outcome.
Practical tip: Use causal forests as a screening tool, not a final decision. They’ll tell you which subgroups seem to have different effects, but you should validate those subgroups with a holdout set or a separate experiment.
If causal forests feel like a black box, meta-learners offer a modular approach. You decompose the problem into sub-problems that standard machine learning models can solve.
Tradeoff summary: T-learner for balanced, large samples. X-learner when treatment is rare. R-learner when you have many controls. S-learner as a quick baseline, but don’t trust it blindly.
You’ve estimated CATEs. Now: is the heterogeneity real, or just noise? And how do you communicate it to stakeholders who don’t want to hear about “CATE distributions”?
Chernozhukov et al. (2018) gave us two tools:
Why this matters: GATES and BLP protect you from over-interpreting noise. They use cross-fitting to avoid overfitting bias. If your causal forest says the top decile has +12% lift but GATES says only +2% (not significant), the forest was likely overfitting.
This is where most practitioners get burned. You slice your data by age, region, device, subscription tier, referral source—20 subgroups. You find one with p = 0.03. You celebrate. But with 20 tests, you expect one “significant” result by chance alone (at α = 0.05).
The real danger isn’t just false positives—it’s that you’ll act on them. You’ll target the subgroup that appeared to benefit, but the effect was noise. Next experiment, it disappears.
How to handle it:
Look for HTE when:
Do not look for HTE when:
The ATE is a starting point, not a destination. The +2% average hides the story of the power user who loves your feature and the new user who churns because of it. Causal forests, meta-learners, and tools like GATES/BLP give you the machinery to find those stories—but only if you use them with discipline.
Pre-register your subgroups. Test globally before slicing. Validate your findings. And always remember: the goal isn’t to find any subgroup with a big effect—it’s to find the right subgroup that you can confidently target.
Your next A/B test will show a +2% average. But this time, you’ll know the story behind it.