A Practitioner's Guide to Statistical Foundations

You've just finished collecting data from a randomized experiment with 1,200 participants. Your treatment effect estimate looks promising—a 3.2 percentage point reduction in the primary outcome. But when you compute the 95% confidence interval using the standard Wald formula, you get [-0.8%, 7.2%]. The interval is too wide to be useful, and you suspect your variance estimator is inflating the uncertainty.

The problem isn't your sample size. It's that you're using a variance estimator that assumes your outcome model is correctly specified, when in fact you used a flexible machine learning model to adjust for covariates. The standard sandwich estimator doesn't account for the fact that your nuisance function (the outcome regression) was estimated on the same data you're using for inference. This is the central challenge that modern semiparametric inference addresses: how to get valid confidence intervals when you've used flexible methods to estimate nuisance parameters.

The Core Problem: When Plug-In Estimation Fails

Consider a partially linear regression model:

$Y = \theta D + g(X) + \varepsilon$

where $D$ is a binary treatment, $X$ is a high-dimensional vector of covariates, and $g(\cdot)$ is an unknown function you'll estimate with a random forest or neural network. The parameter of interest is $\theta$ , the average treatment effect.

The naive approach—estimate $\hat{g}(X)$ on the full sample, then regress $Y - \hat{g}(X)$ on $D$ —produces an estimator that is biased at rate $O_p(n^{-1/4})$ rather than the desired $O_p(n^{-1/2})$ . This is because the regularization bias from the ML estimator doesn't vanish fast enough. Chernozhukov et al. (2018) prove that this bias can be characterized explicitly: the error in $\hat{\theta}$ contains a term proportional to the product of the estimation errors in the nuisance functions, which converges at rate $r_n \times s_n$ where $r_n$ and $s_n$ are the convergence rates of the two nuisance estimators.

The Solution: Neyman Orthogonality and Cross-Fitting

The Double/Debiased Machine Learning (DML) framework solves this through two mechanisms. First, Neyman orthogonality: the moment condition used to estimate $\theta$ must satisfy the property that the derivative of the moment function with respect to the nuisance parameter, evaluated at the true nuisance function, is zero. This ensures that small errors in nuisance estimation have only second-order effects on the parameter estimate.

Second, cross-fitting: split the sample into $K$ folds (typically $K = 2$ or $K = 5$ ). Estimate the nuisance functions on each fold using the out-of-fold data, then compute the parameter estimate using the in-fold data with the out-of-fold nuisance estimates. This breaks the dependence between nuisance estimation and parameter estimation that causes overfitting bias.

The key theoretical guarantee is that under the product rate condition $r_n \times s_n = o(n^{-1/2})$ , the DML estimator is $\sqrt{n}$ -consistent, asymptotically normal, and admits valid confidence intervals. This is dramatically weaker than requiring each nuisance estimator to be $\sqrt{n}$ -consistent individually. For example, if your outcome regression converges at rate $n^{-1/3}$ and your propensity score converges at rate $n^{-1/3}$ , their product is $n^{-2/3} = o(n^{-1/2})$ , and the DML estimator achieves root-n consistency.

Worked Example: Estimating the Effect of a Job Training Program

Suppose you're evaluating a job training program with $n = 5,000$ participants. You have 200 covariates including demographics, prior earnings, and employment history. You decide to use a gradient-boosted tree to estimate both the outcome regression $\hat{\mu}(X) = \mathbb{E}[Y|X, D]$ and the propensity score $\hat{\pi}(X) = \mathbb{P}(D=1|X)$ .

Step 1: Cross-fitting. Split the data into 5 folds. For fold $k$ , estimate $\hat{\mu}$ and $\hat{\pi}$ {-k}(X) $\overset{π}{^}_{- k} (X)$ using all data not in fold $k$ .

Step 2: Construct the orthogonal score. The doubly robust score for the average treatment effect is:

$\psi(W; \theta, \eta) = \left( \frac{D}{\pi(X)} - \frac{1-D}{1-\pi(X)} \right)(Y - \mu(X)) + \mu_1(X) - \mu_0(X) - \theta$

where $\eta = (\mu, \pi)$ are the nuisance functions and $\mu_d(X) = \mathbb{E}[Y|X, D=d]$ .

Step 3: Estimate $\theta$ . For each observation $i$ in fold $k$ , plug in the out-of-fold nuisance estimates $\hat{\eta}$ . Solve $\frac{1}{n}\sum$ {i=1}^n \psi(W_i; \hat{\theta}, \hat{\eta}_{-k(i)}) = 0 $\frac{1}{n} \sum_{i = 1}^{n} ψ (W_{i}; \hat{θ}, \overset{η}{^}_{- k (i)}) = 0$ to get:

$\hat{\theta} = \frac{1}{n}\sum_{i=1}^n \left[ \left( \frac{D_i}{\hat{\pi}$

Step 4: Compute the variance. The asymptotic variance is $\frac{1}{n}\mathbb{E}[\psi(W; \theta_0, \eta_0)^2]$ , which you estimate by the sample variance of the estimated scores.

In this example, even if your gradient-boosted trees converge at rate $n^{-1/4}$ (typical for high-dimensional nonparametric regression), the product rate is $n^{-1/2}$ , which just barely satisfies the condition. In practice, you'd want to verify this using cross-validated MSE diagnostics.

Contrast with Common Alternatives

Naive plug-in estimation: As discussed, this produces biased estimates and invalid confidence intervals. The bias is $O_p(r_n s_n)$ , which for typical ML rates is $O_p(n^{-1/4})$ —too large for root-n inference.

One-step estimation without cross-fitting: Using the orthogonal score but estimating nuisance functions on the full sample. This eliminates regularization bias but introduces overfitting bias. The resulting estimator has variance inflation and may not achieve nominal coverage. Chernozhukov et al. (2018) show that cross-fitting is essential for the theoretical guarantees to hold without imposing Donsker conditions on the nuisance function classes.

Standard doubly robust estimation with parametric models: If you use correctly specified parametric models for both the outcome regression and propensity score, you get root-n consistency without cross-fitting. But this requires knowing the true functional forms—a luxury rarely available in practice. DML lets you use flexible ML methods while maintaining valid inference.

Bayesian approaches with Gaussian processes: These can provide valid uncertainty quantification under correct model specification, but they scale poorly to high dimensions and require strong prior assumptions. DML is more robust to model misspecification and scales to settings with hundreds of covariates.

The Most Common Failure Mode: Ignoring the Product Rate Condition

Practitioners frequently apply DML without checking whether their nuisance estimators converge fast enough. The product rate condition $r_n \times s_n = o(n^{-1/2})$ is not automatically satisfied—it depends on the complexity of your ML methods and the smoothness of the true functions.

Consider a common scenario: you use a deep neural network with many layers to estimate both the outcome regression and propensity score. Neural networks can achieve fast convergence rates for very smooth functions (e.g., $n^{-1/2}$ for infinitely smooth functions), but for functions with limited smoothness, the rate can be as slow as $n^{-1/8}$ or worse. If both nuisance functions converge at $n^{-1/8}$ , their product is $n^{-1/4}$ , which is not $o(n^{-1/2})$ . Your DML estimator will be biased and your confidence intervals will undercover.

The solution is to diagnose this before trusting your results. Compute cross-validated MSE for each nuisance estimator and compare it to $n^{-1/2}$ . If the MSE decays slower than $n^{-1/2}$ , you have a problem. Options include: (1) using simpler ML methods with faster convergence rates, (2) increasing the sample size, or (3) using a different identification strategy that requires fewer nuisance functions.

Practical Checklist

Before deploying DML in your study, verify the following:

Check the product rate condition. Compute cross-validated MSE for each nuisance estimator and verify that the product of the convergence rates is $o(n^{-1/2})$ . If you can't estimate rates directly, at least verify that both nuisance estimators have MSE decaying faster than $n^{-1/4}$ .
Validate cross-fitting stability. Run the DML procedure with at least 5 different random fold assignments. The estimates should be stable across splits. Large variation indicates that the nuisance estimators are too variable or the sample size is insufficient.
Test Neyman orthogonality empirically. Compute the derivative of the score function with respect to the nuisance parameters at the estimated values. If this derivative is far from zero, your score may not be properly orthogonalized.
Assess overlap. Check that the estimated propensity scores are bounded away from 0 and 1 for all observations. Extreme weights (above 10 or below 0.1) indicate practical positivity violations that can destabilize the estimator.
Compare with a parametric benchmark. Fit a simple parametric model (e.g., logistic regression for the propensity score, linear regression for the outcome) and compare the DML estimate to the parametric estimate. Large discrepancies suggest either model misspecification in the parametric approach or instability in the DML estimator.
Verify the number of folds. Use at least $K = 5$ folds. With $K = 2$ , the effective sample size for nuisance estimation is halved, which can slow convergence rates and violate the product rate condition. With $K = 10$ , the folds are small and the nuisance estimates may be noisy.

Part of the DoOperator Research series on Statistical Foundations. Browse the full paper corpus at dooperator.ai/research/statistical_foundations.