Your product team just ran an A/B test. The treatment group saw a 12% lift in conversion. You're about to ship it. But your data scientist says: "The lift only appeared in the last three days of the experiment, and those days had a holiday promotion running for the control group
DoOperator Research · May 2, 2026
Decision takeaway
Your product team just ran an A/B test. The treatment group saw a 12% lift in conversion. You're about to ship it. But your data scientist says: "The lift only appeared in the last three days of the experiment, and those days had a holiday promotion running for the control group
Your product team just ran an A/B test. The treatment group saw a 12% lift in conversion. You're about to ship it. But your data scientist says: "The lift only appeared in the last three days of the experiment, and those days had a holiday promotion running for the control group but not the treatment."
You have a decision to make: ship the feature or kill it. The data is observational now, even though you started with an experiment. The holiday promotion confounded the treatment effect. You need to know: did the feature cause the lift, or was it the promotion?
This is the core problem causal inference solves: separating correlation from causation when the assignment mechanism is not under your control—or when it breaks mid-experiment.
Causal inference is not a single method. It's a framework for answering counterfactual questions: what would have happened to the same units under a different treatment assignment? The fundamental problem is that you never observe the same unit under both treatment and control simultaneously.
The do-calculus (Pearl, 2012) provides the theoretical backbone: given a causal graph encoding your assumptions about the data-generating process, you can determine whether a causal effect is identifiable from observational data. The key result is that the do-calculus is complete—if a causal effect cannot be expressed using its three rules, no nonparametric identification strategy exists without additional assumptions.
But completeness doesn't mean the answer is easy. It means the assumptions are explicit.
Suppose you want to know not just whether your feature works, but for whom. This is heterogeneous treatment effect (HTE) estimation, and it's where most practitioners go wrong.
The standard approach—splitting your data into subgroups and running separate regressions—produces estimates that are biased toward zero (shrinkage) and have no valid confidence intervals. Wager and Athey (2017) solved this with causal forests, which modify Breiman's random forests to produce asymptotically normal HTE estimates.
Here's how it works in practice. You have 10,000 users, half treated, half control, with 50 covariates. You want to know if the treatment effect varies by user engagement level. A causal forest:
Concrete scenario: You run the causal forest and find that the treatment effect is +15% for high-engagement users (95% CI: [8%, 22%]) but -2% for low-engagement users (95% CI: [-7%, 3%]). The forest's asymptotic normality guarantee (Theorem 3, Wager & Athey, 2017) means you can construct valid confidence intervals without assuming a parametric form for the effect heterogeneity.
Linear regression with interactions: This assumes the treatment effect is a linear function of covariates. If the true effect is nonlinear—say, U-shaped in engagement—linear interactions will miss it entirely. Causal forests are nonparametric and adapt to arbitrary nonlinearities.
Propensity score matching with subgroup analysis: Matching estimates the average treatment effect (ATE) but requires pre-specifying subgroups for heterogeneity analysis. This invites data dredging: you test 20 subgroups, find one significant effect, and report it. Causal forests avoid this by estimating effects for every unit and then letting you inspect patterns.
Bayesian additive regression trees (BART): BART can estimate HTE but does not provide the same theoretical guarantees for confidence intervals. Causal forests give asymptotically valid inference; BART gives posterior distributions that are only calibrated under correct model specification.
The single biggest mistake practitioners make is treating a causal forest like a prediction model. You feed in X, Y, and W (treatment indicator), and the forest outputs CATE estimates. But the method assumes unconfoundedness: treatment assignment is independent of potential outcomes conditional on covariates.
This is not testable from data. You must justify it from the study design.
In our opening example—the A/B test with a holiday promotion—unconfoundedness is violated because the promotion affected the control group but not the treatment group. The promotion is a confounder that is not captured by any covariate in your experiment logs.
What do you do? You have three options:
Option 3 is the most common fallback, but it requires the exclusion restriction: the instrument affects the outcome only through the treatment. This is untestable and must be defended on institutional grounds.
Causal inference is harder than prediction because it requires assumptions about the data-generating process. Use it when:
Do not use causal inference when:
Before applying causal inference to your study, verify:
Is the treatment well-defined? Can you describe exactly what changes when a unit moves from control to treatment? If not, the causal effect is ambiguous.
Is unconfoundedness plausible? List all confounders that affect both treatment assignment and the outcome. If any are unmeasured, you need a different identification strategy (IV, DiD, RDD).
Is there sufficient overlap? Check the propensity score distribution. If any covariate profile has near-zero or near-one probability of receiving treatment, your estimates will be unstable or rely on extrapolation.
Is the sample size adequate for heterogeneity analysis? Causal forests require hundreds to thousands of treated and control units per leaf. With small samples, focus on the ATE, not HTE.
Have you separated split data from estimation data? If using tree-based methods, ensure "honesty" is enforced. Many implementations (including the grf R package) do this automatically—check your software's documentation.
Have you validated the method on synthetic data? Before applying to real data, simulate a known treatment effect and verify your method recovers it. This catches implementation errors and assumption violations.
Part of the DoOperator Research series on Causal Inference. Browse the full paper corpus at dooperator.ai/research/causal_inference.