← Blog
NoteStatistical Foundations

A Practitioner's Guide to Statistical Foundations

You've just finished collecting data from a randomized experiment with 1,200 participants. Your treatment effect estimate looks promising—a 3.2 percentage point reduction in the primary outcome. But when you compute the 95% confidence interval using the standard Wald formula, you

DoOperator Research · May 5, 2026

Decision takeaway

You've just finished collecting data from a randomized experiment with 1,200 participants. Your treatment effect estimate looks promising—a 3.2 percentage point reduction in the primary outcome. But when you compute the 95% confidence interval using the standard Wald formula, you

A Practitioner's Guide to Statistical Foundations

You've just finished collecting data from a randomized experiment with 1,200 participants. Your treatment effect estimate looks promising—a 3.2 percentage point reduction in the primary outcome. But when you compute the 95% confidence interval using the standard Wald formula, you get [-0.8%, 7.2%]. The interval is too wide to be useful, and you suspect your variance estimator is inflating the uncertainty.

The problem isn't your sample size. It's that you're using a variance estimator that assumes your outcome model is correctly specified, when in fact you used a flexible machine learning model to adjust for covariates. The standard sandwich estimator doesn't account for the fact that your nuisance function (the outcome regression) was estimated on the same data you're using for inference. This is the central challenge that modern semiparametric inference addresses: how to get valid confidence intervals when you've used flexible methods to estimate nuisance parameters.

The Core Problem: When Plug-In Estimation Fails

Consider a partially linear regression model:

Y=θD+g(X)+εY = \theta D + g(X) + \varepsilon

where DD is a binary treatment, XX is a high-dimensional vector of covariates, and g()g(\cdot) is an unknown function you'll estimate with a random forest or neural network. The parameter of interest is θ\theta, the average treatment effect.

The naive approach—estimate g^(X)\hat{g}(X) on the full sample, then regress Yg^(X)Y - \hat{g}(X) on DD—produces an estimator that is biased at rate Op(n1/4)O_p(n^{-1/4}) rather than the desired Op(n1/2)O_p(n^{-1/2}). This is because the regularization bias from the ML estimator doesn't vanish fast enough. Chernozhukov et al. (2018) prove that this bias can be characterized explicitly: the error in θ^\hat{\theta} contains a term proportional to the product of the estimation errors in the nuisance functions, which converges at rate rn×snr_n \times s_n where rnr_n and sns_n are the convergence rates of the two nuisance estimators.

The Solution: Neyman Orthogonality and Cross-Fitting

The Double/Debiased Machine Learning (DML) framework solves this through two mechanisms. First, Neyman orthogonality: the moment condition used to estimate θ\theta must satisfy the property that the derivative of the moment function with respect to the nuisance parameter, evaluated at the true nuisance function, is zero. This ensures that small errors in nuisance estimation have only second-order effects on the parameter estimate.

Second, cross-fitting: split the sample into KK folds (typically K=2K = 2 or K=5K = 5). Estimate the nuisance functions on each fold using the out-of-fold data, then compute the parameter estimate using the in-fold data with the out-of-fold nuisance estimates. This breaks the dependence between nuisance estimation and parameter estimation that causes overfitting bias.

The key theoretical guarantee is that under the product rate condition rn×sn=o(n1/2)r_n \times s_n = o(n^{-1/2}), the DML estimator is n\sqrt{n}-consistent, asymptotically normal, and admits valid confidence intervals. This is dramatically weaker than requiring each nuisance estimator to be n\sqrt{n}-consistent individually. For example, if your outcome regression converges at rate n1/3n^{-1/3} and your propensity score converges at rate n1/3n^{-1/3}, their product is n2/3=o(n1/2)n^{-2/3} = o(n^{-1/2}), and the DML estimator achieves root-n consistency.

Worked Example: Estimating the Effect of a Job Training Program

Suppose you're evaluating a job training program with n=5,000n = 5,000 participants. You have 200 covariates including demographics, prior earnings, and employment history. You decide to use a gradient-boosted tree to estimate both the outcome regression μ^(X)=E[YX,D]\hat{\mu}(X) = \mathbb{E}[Y|X, D] and the propensity score π^(X)=P(D=1X)\hat{\pi}(X) = \mathbb{P}(D=1|X).

Step 1: Cross-fitting. Split the data into 5 folds. For fold kk, estimate μ^k(X)\hat{\mu}{-k}(X) and π^k(X)\hat{\pi}{-k}(X) using all data not in fold kk.

Step 2: Construct the orthogonal score. The doubly robust score for the average treatment effect is:

ψ(W;θ,η)=(Dπ(X)1D1π(X))(Yμ(X))+μ1(X)μ0(X)θ\psi(W; \theta, \eta) = \left( \frac{D}{\pi(X)} - \frac{1-D}{1-\pi(X)} \right)(Y - \mu(X)) + \mu_1(X) - \mu_0(X) - \theta

where η=(μ,π)\eta = (\mu, \pi) are the nuisance functions and μd(X)=E[YX,D=d]\mu_d(X) = \mathbb{E}[Y|X, D=d].

Step 3: Estimate θ\theta. For each observation ii in fold kk, plug in the out-of-fold nuisance estimates η^k(i)\hat{\eta}{-k(i)}. Solve 1ni=1nψ(Wi;θ^,η^k(i))=0\frac{1}{n}\sum{i=1}^n \psi(W_i; \hat{\theta}, \hat{\eta}_{-k(i)}) = 0 to get:

θ^=1ni=1n[(Diπ^k(i)(Xi)1Di1π^k(i)(Xi))(Yiμ^k(i)(Xi))+μ^1,k(i)(Xi)μ^0,k(i)(Xi)]\hat{\theta} = \frac{1}{n}\sum_{i=1}^n \left[ \left( \frac{D_i}{\hat{\pi}{-k(i)}(X_i)} - \frac{1-D_i}{1-\hat{\pi}{-k(i)}(X_i)} \right)(Y_i - \hat{\mu}{-k(i)}(X_i)) + \hat{\mu}{1,-k(i)}(X_i) - \hat{\mu}_{0,-k(i)}(X_i) \right]

Step 4: Compute the variance. The asymptotic variance is 1nE[ψ(W;θ0,η0)2]\frac{1}{n}\mathbb{E}[\psi(W; \theta_0, \eta_0)^2], which you estimate by the sample variance of the estimated scores.

In this example, even if your gradient-boosted trees converge at rate n1/4n^{-1/4} (typical for high-dimensional nonparametric regression), the product rate is n1/2n^{-1/2}, which just barely satisfies the condition. In practice, you'd want to verify this using cross-validated MSE diagnostics.

Contrast with Common Alternatives

Naive plug-in estimation: As discussed, this produces biased estimates and invalid confidence intervals. The bias is Op(rnsn)O_p(r_n s_n), which for typical ML rates is Op(n1/4)O_p(n^{-1/4})—too large for root-n inference.

One-step estimation without cross-fitting: Using the orthogonal score but estimating nuisance functions on the full sample. This eliminates regularization bias but introduces overfitting bias. The resulting estimator has variance inflation and may not achieve nominal coverage. Chernozhukov et al. (2018) show that cross-fitting is essential for the theoretical guarantees to hold without imposing Donsker conditions on the nuisance function classes.

Standard doubly robust estimation with parametric models: If you use correctly specified parametric models for both the outcome regression and propensity score, you get root-n consistency without cross-fitting. But this requires knowing the true functional forms—a luxury rarely available in practice. DML lets you use flexible ML methods while maintaining valid inference.

Bayesian approaches with Gaussian processes: These can provide valid uncertainty quantification under correct model specification, but they scale poorly to high dimensions and require strong prior assumptions. DML is more robust to model misspecification and scales to settings with hundreds of covariates.

The Most Common Failure Mode: Ignoring the Product Rate Condition

Practitioners frequently apply DML without checking whether their nuisance estimators converge fast enough. The product rate condition rn×sn=o(n1/2)r_n \times s_n = o(n^{-1/2}) is not automatically satisfied—it depends on the complexity of your ML methods and the smoothness of the true functions.

Consider a common scenario: you use a deep neural network with many layers to estimate both the outcome regression and propensity score. Neural networks can achieve fast convergence rates for very smooth functions (e.g., n1/2n^{-1/2} for infinitely smooth functions), but for functions with limited smoothness, the rate can be as slow as n1/8n^{-1/8} or worse. If both nuisance functions converge at n1/8n^{-1/8}, their product is n1/4n^{-1/4}, which is not o(n1/2)o(n^{-1/2}). Your DML estimator will be biased and your confidence intervals will undercover.

The solution is to diagnose this before trusting your results. Compute cross-validated MSE for each nuisance estimator and compare it to n1/2n^{-1/2}. If the MSE decays slower than n1/2n^{-1/2}, you have a problem. Options include: (1) using simpler ML methods with faster convergence rates, (2) increasing the sample size, or (3) using a different identification strategy that requires fewer nuisance functions.

Practical Checklist

Before deploying DML in your study, verify the following:

  1. Check the product rate condition. Compute cross-validated MSE for each nuisance estimator and verify that the product of the convergence rates is o(n1/2)o(n^{-1/2}). If you can't estimate rates directly, at least verify that both nuisance estimators have MSE decaying faster than n1/4n^{-1/4}.

  2. Validate cross-fitting stability. Run the DML procedure with at least 5 different random fold assignments. The estimates should be stable across splits. Large variation indicates that the nuisance estimators are too variable or the sample size is insufficient.

  3. Test Neyman orthogonality empirically. Compute the derivative of the score function with respect to the nuisance parameters at the estimated values. If this derivative is far from zero, your score may not be properly orthogonalized.

  4. Assess overlap. Check that the estimated propensity scores are bounded away from 0 and 1 for all observations. Extreme weights (above 10 or below 0.1) indicate practical positivity violations that can destabilize the estimator.

  5. Compare with a parametric benchmark. Fit a simple parametric model (e.g., logistic regression for the propensity score, linear regression for the outcome) and compare the DML estimate to the parametric estimate. Large discrepancies suggest either model misspecification in the parametric approach or instability in the DML estimator.

  6. Verify the number of folds. Use at least K=5K = 5 folds. With K=2K = 2, the effective sample size for nuisance estimation is halved, which can slow convergence rates and violate the product rate condition. With K=10K = 10, the folds are small and the nuisance estimates may be noisy.


Part of the DoOperator Research series on Statistical Foundations. Browse the full paper corpus at dooperator.ai/research/statistical_foundations.

More from the blog

Correlation Was Never the Problem"Correlation is not causation" is one of the most-repeated phrases in empirical research. It is also, as usually understood, a dramatic understatement of the actual difficulty. The real challenge is not distinguishing correlation from causation — it is identifying which causal story is correct when several are consistent with the same data.May 29, 2026The Illusion of Control: Why Most A/B Tests Mislead More Than They InformOrganizations run thousands of A/B tests every year and congratulate themselves on being data-driven. Most of those tests are statistically invalid. Here is why — and what rigorous experimentation actually requires.May 27, 2026What N-of-1 Trials Get Right That Population Studies Get WrongRandomized trials on populations measure average effects in heterogeneous groups. N-of-1 trials measure what actually happens to one specific person. For individual decision-making, the latter is usually more relevant.May 26, 2026
A Practitioner's Guide to Statistical Foundations — DoOperator Research | DoOperator