A Practitioner's Guide to Causal Inference

Your product team just ran an A/B test. The treatment group saw a 12% lift in conversion. You're about to ship it. But your data scientist says: "The lift only appeared in the last three days of the experiment, and those days had a holiday promotion running for the control group but not the treatment."

You have a decision to make: ship the feature or kill it. The data is observational now, even though you started with an experiment. The holiday promotion confounded the treatment effect. You need to know: did the feature cause the lift, or was it the promotion?

This is the core problem causal inference solves: separating correlation from causation when the assignment mechanism is not under your control—or when it breaks mid-experiment.

What Causal Inference Actually Requires

Causal inference is not a single method. It's a framework for answering counterfactual questions: what would have happened to the same units under a different treatment assignment? The fundamental problem is that you never observe the same unit under both treatment and control simultaneously.

The do-calculus (Pearl, 2012) provides the theoretical backbone: given a causal graph encoding your assumptions about the data-generating process, you can determine whether a causal effect is identifiable from observational data. The key result is that the do-calculus is complete—if a causal effect cannot be expressed using its three rules, no nonparametric identification strategy exists without additional assumptions.

But completeness doesn't mean the answer is easy. It means the assumptions are explicit.

The Workhorse: Causal Forests for Heterogeneous Effects

Suppose you want to know not just whether your feature works, but for whom. This is heterogeneous treatment effect (HTE) estimation, and it's where most practitioners go wrong.

The standard approach—splitting your data into subgroups and running separate regressions—produces estimates that are biased toward zero (shrinkage) and have no valid confidence intervals. Wager and Athey (2017) solved this with causal forests, which modify Breiman's random forests to produce asymptotically normal HTE estimates.

Here's how it works in practice. You have 10,000 users, half treated, half control, with 50 covariates. You want to know if the treatment effect varies by user engagement level. A causal forest:

Grows "honest" trees: the data used to determine splits is separate from the data used to estimate treatment effects within leaves. This reduces bias and is critical for the asymptotic theory.
Uses a modified splitting criterion that maximizes the variance of treatment effects across leaves, not the variance of outcomes.
Produces point estimates and confidence intervals for each unit's conditional average treatment effect (CATE).

Concrete scenario: You run the causal forest and find that the treatment effect is +15% for high-engagement users (95% CI: [8%, 22%]) but -2% for low-engagement users (95% CI: [-7%, 3%]). The forest's asymptotic normality guarantee (Theorem 3, Wager & Athey, 2017) means you can construct valid confidence intervals without assuming a parametric form for the effect heterogeneity.

Contrast with Common Alternatives

Linear regression with interactions: This assumes the treatment effect is a linear function of covariates. If the true effect is nonlinear—say, U-shaped in engagement—linear interactions will miss it entirely. Causal forests are nonparametric and adapt to arbitrary nonlinearities.

Propensity score matching with subgroup analysis: Matching estimates the average treatment effect (ATE) but requires pre-specifying subgroups for heterogeneity analysis. This invites data dredging: you test 20 subgroups, find one significant effect, and report it. Causal forests avoid this by estimating effects for every unit and then letting you inspect patterns.

Bayesian additive regression trees (BART): BART can estimate HTE but does not provide the same theoretical guarantees for confidence intervals. Causal forests give asymptotically valid inference; BART gives posterior distributions that are only calibrated under correct model specification.

The Most Common Failure Mode: Ignoring the Assignment Mechanism

The single biggest mistake practitioners make is treating a causal forest like a prediction model. You feed in X, Y, and W (treatment indicator), and the forest outputs CATE estimates. But the method assumes unconfoundedness: treatment assignment is independent of potential outcomes conditional on covariates.

This is not testable from data. You must justify it from the study design.

In our opening example—the A/B test with a holiday promotion—unconfoundedness is violated because the promotion affected the control group but not the treatment group. The promotion is a confounder that is not captured by any covariate in your experiment logs.

What do you do? You have three options:

Re-run the experiment with proper randomization and no differential promotions.
Model the confounder if you have data on which users saw the promotion.
Use instrumental variables if you have a valid instrument—something that affects treatment assignment but not the outcome except through treatment.

Option 3 is the most common fallback, but it requires the exclusion restriction: the instrument affects the outcome only through the treatment. This is untestable and must be defended on institutional grounds.

When to Choose Causal Inference Over Prediction

Causal inference is harder than prediction because it requires assumptions about the data-generating process. Use it when:

You need to make a decision about changing a policy, feature, or treatment
You have a clear intervention (treatment) and outcome
You can articulate why treatment assignment might be confounded

Do not use causal inference when:

You only need to predict outcomes, not understand causes
You cannot defend the unconfoundedness assumption
Your treatment is poorly defined (e.g., "using the platform more")

Practical Checklist

Before applying causal inference to your study, verify:

Is the treatment well-defined? Can you describe exactly what changes when a unit moves from control to treatment? If not, the causal effect is ambiguous.
Is unconfoundedness plausible? List all confounders that affect both treatment assignment and the outcome. If any are unmeasured, you need a different identification strategy (IV, DiD, RDD).
Is there sufficient overlap? Check the propensity score distribution. If any covariate profile has near-zero or near-one probability of receiving treatment, your estimates will be unstable or rely on extrapolation.
Is the sample size adequate for heterogeneity analysis? Causal forests require hundreds to thousands of treated and control units per leaf. With small samples, focus on the ATE, not HTE.
Have you separated split data from estimation data? If using tree-based methods, ensure "honesty" is enforced. Many implementations (including the grf R package) do this automatically—check your software's documentation.
Have you validated the method on synthetic data? Before applying to real data, simulate a known treatment effect and verify your method recovers it. This catches implementation errors and assumption violations.

Part of the DoOperator Research series on Causal Inference. Browse the full paper corpus at dooperator.ai/research/causal_inference.