| Authors | Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, James Robins |
| Year | 2016 |
What Problem It Solves
This paper addresses the fundamental challenge of performing valid inference on a low-dimensional causal or structural parameter (e.g., an average treatment effect, a regression coefficient in a partially linear model, or a policy parameter) when the nuisance functions—such as outcome regressions, propensity scores, or instrument propensity models—must be estimated using high-dimensional or flexible machine learning methods. Classical semiparametric theory assumes that nuisance parameters can be estimated at sufficiently fast rates (typically root-n) and that the parameter space has low complexity (e.g., Donsker properties hold). In modern settings with many covariates, ML methods like lasso, random forests, boosting, and neural nets are natural choices for nuisance estimation, but they introduce two sources of bias that destroy root-n consistency and valid inference when naively plugged into estimating equations: (1) **regularization bias** from shrinkage or penalization that prevents the nuisance estimator from converging at root-n rates, and (2) **overfitting bias** from using the same data to estimate both the nuisance functions and the parameter of interest. The paper shows that combining Neyman-orthogonal moment conditions with cross-fitting (sample splitting) eliminates both biases, yielding estimators that are root-n consistent, asymptotically normal, and admit valid confidence intervals under remarkably weak conditions on the nuisance estimators.
This paper addresses the fundamental challenge of performing valid inference on a low-dimensional causal or structural parameter (e.g., an average treatment effect, a regression coefficient in a partially linear model, or a policy parameter) when the nuisance functions—such as outcome regressions, propensity scores, or instrument propensity models—must be estimated using high-dimensional or flexible machine learning methods. Classical semiparametric theory assumes that nuisance parameters can be estimated at sufficiently fast rates (typically root-n) and that the parameter space has low complexity (e.g., Donsker properties hold). In modern settings with many covariates, ML methods like lasso, random forests, boosting, and neural nets are natural choices for nuisance estimation, but they introduce two sources of bias that destroy root-n consistency and valid inference when naively plugged into estimating equations: (1) regularization bias from shrinkage or penalization that prevents the nuisance estimator from converging at root-n rates, and (2) overfitting bias from using the same data to estimate both the nuisance functions and the parameter of interest. The paper shows that combining Neyman-orthogonal moment conditions with cross-fitting (sample splitting) eliminates both biases, yielding estimators that are root-n consistent, asymptotically normal, and admit valid confidence intervals under remarkably weak conditions on the nuisance estimators.
The core insight of DML is that two simple modifications to the standard "plug-in" approach—using Neyman-orthogonal scores and cross-fitting—eliminate the bias that otherwise plagues ML-based causal inference. The intuition proceeds in three steps.
Step 1: The bias problem with naive plug-in. Consider the partially linear regression model: Y = Dθ₀ + g₀(X) + U, where D is the treatment, X are covariates, and θ₀ is the parameter of interest. A naive approach estimates g₀ using ML on one sample split, then estimates θ₀ by regressing Y - ĝ(X) on D in the other split. The estimation error in θ₀ contains a term proportional to (1/√n) Σ D_i (g₀(X_i) - ĝ(X_i)). Because D is correlated with X (through the confounding function m₀(X) = E[D|X]), this term has non-zero mean. Regularized ML estimators of g₀ converge at rates slower than root-n (e.g., n^{-1/3} for random forests under smoothness), so this bias term diverges, making the estimator inconsistent.
Step 2: Orthogonalization via Neyman scores. The solution is to construct an estimating equation that is "orthogonal" to the nuisance parameters. For the partially linear model, the orthogonal score is ψ(W; θ, η) = (D - m₀(X))(Y - Dθ - g₀(X)), where η = (m₀, g₀). The key property is that the Gateaux derivative of E[ψ(W; θ₀, η)] with respect to η, evaluated at the true η₀, is zero. This means that small errors in estimating m₀ and g₀ have only second-order effects on the moment condition. Concretely, the bias term becomes (1/√n) Σ (m̂(X_i) - m₀(X_i))(ĝ(X_i) - g₀(X_i)), which is the product of two estimation errors. If each error converges at rate n^{-φ}, their product converges at rate n^{-2φ}, which can be o(n^{-1/2}) even when φ < 1/4 (i.e., each estimator converges slower than n^{-1/4}).
Step 3: Cross-fitting to control overfitting. Even with orthogonal scores, using the same data to estimate both η and θ creates overfitting bias. The solution is K-fold cross-fitting: (a) Split the data into K folds of roughly equal size. (b) For each fold k, estimate the nuisance functions η̂_k using all data except fold k. (c) Using only fold k data, compute the empirical moment condition ψ(W_i; θ, η̂_k) and solve for θ. (d) Average the K estimates. Cross-fitting ensures that the nuisance estimates are independent of the data used to evaluate the moment condition, so terms like (1/√n) Σ V_i (ĝ(X_i) - g₀(X_i)) have conditional mean zero and variance that shrinks as the nuisance estimation error shrinks.
The DML estimator for the partially linear model takes the form:
θ̂ = (1/n) Σ_{i=1}^n (D_i - m̂_{-k(i)}(X_i)) D_i)^{-1} × (1/n) Σ_{i=1}^n (D_i - m̂_{-k(i)}(X_i)) (Y_i - ĝ_{-k(i)}(X_i))
where k(i) denotes the fold containing observation i, and the subscript -k(i) indicates the nuisance function estimated without that fold. This estimator is root-n consistent and asymptotically normal with variance that can be estimated by the empirical variance of the influence function.
The paper extends this framework to four canonical settings: (1) partially linear regression (ATE under unconfoundedness with continuous treatment), (2) partially linear instrumental variables (endogenous continuous treatment), (3) average treatment effect (ATE) under unconfoundedness with binary treatment, and (4) local average treatment effect (LATE) with binary instrument and binary treatment. For each, they derive the appropriate orthogonal score and verify the product rate condition.
Prefer DML over classical semiparametric methods (e.g., Robinson's double residual regression, AIPW estimators with parametric models) when:
Prefer DML over naive "plug-in ML" approaches (estimate nuisance functions with ML, then plug into standard estimators) when:
Prefer DML over targeted maximum likelihood estimation (TMLE) when:
Prefer DML over instrumental variables approaches (2SLS, LIML) when:
Prefer DML over Bayesian approaches (BART, Gaussian processes for causal inference) when:
Prefer alternatives over DML when:
Related papers
Causal Inference: What If
Miguel A. Hernan, James M. Robins · 2020
PaperEstimation and Inference of Heterogeneous Treatment Effects using Random Forests
Stefan Wager, Susan Athey · 2015
PaperEstimating Average Causal Effects Under General Interference
Cyrus Samii, P. Aronow · 2012
PaperObservational vs. Experimental Data When Making Automated Decisions Using Machine Learning
Carlos Fernández-Loría, F. Provost · 2025