Estimation and Inference of Heterogeneous Treatment Effects using Random Forests — DoOperator Research

Authors	Stefan Wager, Susan Athey
Year	2015

What Problem It Solves

This paper addresses the challenge of estimating heterogeneous treatment effects (HTE) — how the causal effect of a treatment varies across individuals with different observed characteristics — in settings with many covariates and complex interactions. Classical nonparametric methods for HTE estimation, such as nearest-neighbor matching, kernel methods, and series estimation, perform well with few covariates but break down as the dimensionality increases due to the curse of dimensionality. Meanwhile, machine learning methods like random forests excel at high-dimensional prediction but lack the inferential guarantees needed for causal inference: researchers need confidence intervals and hypothesis tests for treatment effects, not just point predictions. The paper bridges this gap by developing causal forests — a modification of Breiman's random forests — that provide consistent, asymptotically normal estimates of heterogeneous treatment effects under unconfoundedness, along with valid confidence intervals. This enables researchers to explore treatment effect heterogeneity without pre-specifying subgroups, while avoiding the pitfalls of data dredging and false discovery.

What problem it solves

How it works

The core idea of causal forests is to adapt random forests — which average predictions from many regression trees — to directly estimate treatment effects rather than outcomes. The key insight is that a random forest can be interpreted as an adaptive nearest-neighbor method: for a test point (x), the forest prediction is a weighted average of training outcomes, where the weights reflect how often each training point falls into the same leaf as (x) across all trees. Causal forests extend this by constructing weights that are tailored to estimating the difference in conditional expectations between treated and control groups.

Intuition in three steps:

Build many causal trees. Each tree in the forest is grown using a subsample of the data. The tree recursively partitions the covariate space, and at each leaf, it estimates the treatment effect as the difference in mean outcomes between treated and control units within that leaf. Crucially, the splitting criterion is designed to maximize heterogeneity in treatment effects, not in outcomes. The tree uses "honest" splitting: one subsample determines the tree structure, and a separate subsample (or the out-of-bag portion) estimates the leaf-level effects.
Average across trees. For a test point (x), the causal forest prediction (\hat{\tau}(x)) is the average of the leaf-level treatment effect estimates from all trees where (x) falls into a leaf. This averaging reduces variance and smooths the piecewise constant tree predictions.
Construct adaptive weights. The forest induces a weight (w_i(x)) for each training observation (i), measuring how often (i) is in the same leaf as (x). The causal forest estimator can be written as: [ \hat{\tau}(x) = \sum_{i=1}^n w_i(x) \cdot \frac{Y_i \cdot (W_i - \hat{e}(X_i))}{\hat{e}(X_i)(1 - \hat{e}(X_i))} ] where (\hat{e}(x)) is an estimate of the propensity score. This is a weighted version of the augmented inverse-propensity weighting (AIPW) estimator, with weights learned by the forest. The AIPW transformation ensures that the estimator is robust to misspecification of the propensity score and achieves the semiparametric efficiency bound under certain conditions.

Theoretical mechanics:

The paper establishes two main theoretical results for causal forests:

Consistency: Under the assumptions above, (\hat{\tau}(x) \xrightarrow{p} \tau(x)) as (n \to \infty), provided the number of trees grows sufficiently fast and the leaf size grows slower than (n). The rate of convergence depends on the effective dimension of the problem, which can be much smaller than (d) if the true treatment effect function depends only on a few covariates.
Asymptotic normality and inference: The estimator (\hat{\tau}(x)) is asymptotically Gaussian and centered at the true (\tau(x)): [ \frac{\hat{\tau}(x) - \tau(x)}{\sqrt{\hat{V}(x)}} \xrightarrow{d} \mathcal{N}(0, 1) ] where (\hat{V}(x)) is a variance estimate based on the infinitesimal jackknife (Efron, 2014; Wager et al., 2014). This variance estimator accounts for both the sampling variability from the subsampling and the uncertainty from the tree-growing process. The key innovation is that the variance can be estimated consistently using only the forest output, without requiring additional resampling.

The proof strategy builds on the adaptive nearest-neighbor interpretation of random forests (Lin & Jeon, 2006) and uses Hájek projections and Hoeffding decompositions to establish that the forest predictions are asymptotically linear, meaning they can be expressed as a sum of independent contributions plus a negligible remainder. This asymptotic linearity is what enables Gaussian inference.

When to use it

Prefer causal forests over classical methods (nearest-neighbor matching, kernel regression) when:

The number of covariates is moderate to large (say, (d > 5)), especially when many covariates are irrelevant or have complex interactions with the treatment effect. Causal forests automatically perform variable selection by splitting on the most informative covariates.
You need valid confidence intervals for individual-level treatment effects (CATE), not just for subgroup averages. Causal forests provide pointwise confidence intervals with asymptotic coverage guarantees.
The treatment effect function is believed to be smooth but potentially high-dimensional and nonlinear. The forest adapts to the local structure of the data.
You have a randomized experiment or an observational study where unconfoundedness is plausible and you have rich covariate information.

Prefer classical methods over causal forests when:

The number of covariates is very small (say, (d \leq 3)) and the sample size is large. In this regime, kernel methods with bandwidth selection can achieve the optimal nonparametric rate and may be simpler to implement and interpret.
You need inference on the average treatment effect (ATE) rather than heterogeneous effects. For ATE, simpler methods like inverse-propensity weighting or doubly robust estimators are more efficient and have better finite-sample properties.
The treatment effect is known to be linear or has a simple parametric form. In that case, a linear model with interactions is more interpretable and statistically efficient.
You have a very small sample size (say, (n < 500)). Causal forests require sufficient data to grow deep trees and estimate leaf-level effects reliably; with small samples, simpler methods may dominate.

Prefer causal forests over other machine learning methods for HTE (e.g., BART, causal boosting, meta-learners) when:

You need formal statistical inference (confidence intervals, hypothesis tests) with proven asymptotic guarantees. Causal forests are among the few methods with established asymptotic normality results.
You want to avoid tuning priors or MCMC convergence diagnostics. Causal forests have fewer tuning parameters and are more straightforward to fit.
The propensity score is unknown and must be estimated. Causal forests can incorporate propensity score estimates in a doubly robust fashion.

Prefer other methods over causal forests when:

You have strong prior information that can be encoded in a Bayesian framework (e.g., BART with informative priors).
You need to estimate the entire conditional average treatment effect function with high-dimensional sparse structure (e.g., lasso-based methods may be more appropriate).
You are primarily interested in optimal treatment assignment policies rather than effect estimation per se (consider policy learning methods).

Read full paper →PDF ↗More Causal Estimation →