Causal Estimation: Choosing the Right Method When You Can't Randomize

The scenario is painfully familiar: Your product team spent months building Feature X. Leadership greenlit a global rollout—no A/B test, no staggered holdout, just a big switch flipped for everyone. Now the VP wants to know: Did it actually move the needle on retention?

You can't go back in time and randomize. The data you have is observational, messy, and biased by the very decision to launch. But the question is causal: Would retention have been different if we hadn't shipped Feature X?

This is causal estimation. And when you can't randomize, it's your only lifeline.

What Causal Estimation Actually Is

Causal estimation is the art of constructing a credible counterfactual—what would have happened without the treatment—using observational data. It's not magic. Every method makes assumptions. The trick is choosing the method whose assumptions least violate your specific reality.

If you get the method wrong, you're not just imprecise—you're confidently wrong. And that's worse than admitting you don't know.

The Toolkit: 5 Methods for Retrospective Causal Analysis

1. Difference-in-Differences (DiD) with Staggered Rollouts

When to use: You have panel data (repeated observations over time) and a comparison group that didn't receive treatment. Classic example: a state implements a policy, neighboring states don't.

The key assumption: Parallel trends. The treated and untreated groups would have followed the same trajectory in the absence of treatment. This is untestable, but you can check pre-treatment trends for plausibility.

The modern fix: If treatment rolls out at different times for different units (staggered adoption), traditional two-way fixed effects can be biased. Use the Callaway-Sant'Anna estimator, which handles staggered treatment by comparing only "not-yet-treated" units as controls.

First assumption to check: Plot pre-treatment trends. If they're diverging before treatment, DiD is dead in the water.

Biggest mistake: Using already-treated units as controls for later-treated units without accounting for treatment effect heterogeneity over time. This is the silent killer in staggered designs.

2. Synthetic Control

When to use: You have one treated unit (one state, one company, one country) and a pool of potential controls. The treated unit is unique, so no single comparison unit looks like it.

How it works: You construct a weighted combination of control units that mimics the pre-treatment trajectory of the treated unit. That synthetic unit becomes your counterfactual.

The key assumption: The weighted combination captures the latent factors that drive the outcome. If you can match pre-treatment trends well, you have a credible counterfactual.

First assumption to check: Does the synthetic control actually track the treated unit's pre-treatment path? If not, your weights are useless.

Biggest mistake: Including control units that were themselves affected by the treatment (spillover). If your "control" state also changed its policy because your treated state did, your synthetic control is contaminated.

3. Instrumental Variables (IV)

When to use: There's an unobserved confounder affecting both treatment and outcome. You need an "instrument"—a variable that affects treatment but affects the outcome only through treatment.

The key assumptions:

Relevance: The instrument strongly predicts treatment.
Exogeneity: The instrument is as good as randomly assigned (uncorrelated with unobserved confounders).
Exclusion restriction: The instrument affects the outcome only through the treatment.

First assumption to check: Relevance. If your instrument is weak (low F-statistic), your estimates will be biased and unstable.

Biggest mistake: Pretending the exclusion restriction holds when it clearly doesn't. Example: using "distance to a hospital" as an instrument for treatment—distance also affects health outcomes directly through access to emergency care. That's a violation.

4. Regression Discontinuity (RD)

When to use: Treatment is assigned based on whether a continuous variable crosses a known threshold. Examples: scholarship for test scores above 80, subsidy for income below $50k.

The key assumption: Units just above and just below the threshold are comparable. The threshold is arbitrary (not manipulated).

First assumption to check: Is there bunching just below the threshold? If people can manipulate their score to avoid treatment (or get it), RD is invalid.

Biggest mistake: Using a global polynomial fit instead of local linear regression near the cutoff. Global polynomials can produce wildly misleading estimates—stick to local linear with a bandwidth chosen by data-driven methods (e.g., MSE-optimal bandwidth).

5. Double/Debiased Machine Learning (DML)

When to use: You have many covariates (high-dimensional data) and complex relationships, but you still need a causal estimate. Think: user-level telemetry data with hundreds of features.

How it works: Use machine learning to flexibly model both the outcome and the treatment, then combine residuals to get a debiased estimate of the treatment effect. The "double" part means you estimate both nuisance functions; the "debiased" part corrects for regularization bias.

The key assumption: You can estimate the nuisance functions (outcome model and propensity score) reasonably well. DML is robust to some misspecification, but not to complete failure.

First assumption to check: Do you have enough data? DML requires large samples—it's not for small-N settings.

Biggest mistake: Using the same data to fit the ML models and estimate the treatment effect without cross-fitting. This leads to overfitting bias. Always use cross-fitting (sample splitting).

A Decision Framework: 3 Questions Before You Pick a Method

Before you even open a Python notebook, ask yourself these three questions. They'll narrow your options dramatically.

Question 1: Do I have a clear comparison group?

Yes, with panel data → DiD (check staggered rollout → Callaway-Sant'Anna)
Yes, with one treated unit → Synthetic Control
No, but I have a threshold → RD
No, but I have an instrument → IV
No, but I have rich covariates → DML

Question 2: Can I credibly argue the key assumption holds?

This is the hard part. For each candidate method, write down the key assumption in plain English. Then ask: Does my domain knowledge support this? If you can't convince a skeptical colleague, the method won't survive peer review (or your VP's scrutiny).

Question 3: What's my risk tolerance?

Low risk (regulatory, public-facing) → Use synthetic control or RD, which are more transparent and have fewer moving parts.
Medium risk (internal product decisions) → DiD with robustness checks (placebo tests, sensitivity analysis).
High risk (exploratory, early-stage) → DML if you have the data; IV if you have a strong instrument.

The Bottom Line

You can't randomize. But you can still estimate causal effects—if you're honest about your assumptions. The worst mistake isn't choosing the wrong method; it's pretending your chosen method has no assumptions.

Start by understanding your data structure. Then pick the method whose assumptions you can defend. Run robustness checks. And when in doubt, use multiple methods and see if they converge. If they don't, your answer is: We don't know yet.

That's not failure. That's rigor.