Observational vs. Experimental Data When Making Automated Decisions Using Machine Learning — DoOperator Research

Authors	Carlos Fernández-Loría, F. Provost
Journal	INFORMS Journal on Data Science
Year	2025
DOI	10.1287/ijds.2023.0012
Citations	3

What Problem It Solves

This paper addresses a fundamental tension in data-driven decision-making: when should a decision-maker rely on cheap, abundant observational data versus expensive, scarce experimental data to optimize interventions? The core problem is that observational data suffers from confounding bias, while experimental data (e.g., A/B tests, randomized controlled trials) is often cost-prohibitive, logistically infeasible, or too slow to collect at scale. Existing approaches typically frame this as a bias-variance tradeoff, where one must either accept confounding bias or pay for unbiased estimates. However, this paper shows that for *decision-making*—specifically, determining whether a causal effect exceeds a threshold—confounded observational data can sometimes outperform experimental data. The key insight is that the objective is not unbiased estimation of treatment effects, but correct *ranking* of effects relative to a decision threshold. When confounding systematically overestimates larger effects (or when larger sample sizes reduce variance enough to offset bias), observational data can yield better decisions. The paper provides theoretical conditions under which this occurs, empirical heuristics to test for it, and validation across 77 scenarios from the 2016 Atlantic Causal Inference Conference competition.

What problem it solves

This paper addresses a fundamental tension in data-driven decision-making: when should a decision-maker rely on cheap, abundant observational data versus expensive, scarce experimental data to optimize interventions? The core problem is that observational data suffers from confounding bias, while experimental data (e.g., A/B tests, randomized controlled trials) is often cost-prohibitive, logistically infeasible, or too slow to collect at scale. Existing approaches typically frame this as a bias-variance tradeoff, where one must either accept confounding bias or pay for unbiased estimates. However, this paper shows that for decision-making—specifically, determining whether a causal effect exceeds a threshold—confounded observational data can sometimes outperform experimental data. The key insight is that the objective is not unbiased estimation of treatment effects, but correct ranking of effects relative to a decision threshold. When confounding systematically overestimates larger effects (or when larger sample sizes reduce variance enough to offset bias), observational data can yield better decisions. The paper provides theoretical conditions under which this occurs, empirical heuristics to test for it, and validation across 77 scenarios from the 2016 Atlantic Causal Inference Conference competition.

How it works

The paper reframes the decision problem from "estimate the causal effect as accurately as possible" to "determine whether the causal effect exceeds a threshold." This shift is crucial because it changes the loss function from squared error (or similar) to a 0-1 decision loss: you only care about getting the sign of (CATE - τ) correct.

Intuition: Imagine you're deciding whether to show an ad to a customer. The true effect of the ad on purchase probability is 0.05 (5 percentage points), and your threshold is 0.03. An unbiased estimate from a small experiment might be 0.04 with high variance (standard error 0.02), so you might incorrectly decide not to show the ad. A biased observational estimate might be 0.06 (overestimated by 0.01) but with much lower variance (standard error 0.005) due to larger sample size. The biased estimate correctly places the effect above threshold, while the unbiased estimate might not. The bias is irrelevant for the decision—only the ranking relative to τ matters.

Core mechanics:

The paper formalizes this through the concept of decision risk. Let CATE(x) be the true conditional average treatment effect at covariates x. The decision rule is: intervene if CATE(x) > τ. With an estimator CATÊ(x), the decision is intervene if CATÊ(x) > τ. The decision error occurs when the sign of (CATÊ(x) - τ) differs from the sign of (CATE(x) - τ).

The paper shows that the decision error rate for observational data depends on two quantities:

The bias function b(x) = E[CATÊ_obs(x)] - CATE(x)
The variance function v(x) = Var(CATÊ_obs(x))

For experimental data, b(x) = 0 by design, but v(x) is typically larger due to smaller sample size.

The key theoretical result (Theorem 1) provides conditions under which the observational estimator dominates the experimental estimator in terms of decision error rate. The central condition is that the bias is monotone non-decreasing in the true effect: larger true effects are more likely to be overestimated. Under this condition, the bias actually helps decision-making because it pushes estimates away from the threshold in the correct direction for large effects.

The monotone confounding condition:

Formally, let F_x be the distribution of the true CATE across the covariate space. The condition is:

Cov(b(x), CATE(x)) ≥ 0

where b(x) is the bias. This is equivalent to saying that the bias function is non-decreasing in CATE(x). When this holds, the observational estimator's bias amplifies the signal rather than obscuring it.

Heuristic tests:

The paper proposes two empirical heuristics to assess whether observational data satisfies the monotone confounding condition:

Rank correlation test: Compute the rank correlation between the observational CATE estimates and the experimental CATE estimates (on a holdout experimental dataset). If the rank correlation is high (e.g., > 0.7), the observational data preserves the ranking of effects, which is sufficient for threshold-based decisions.
Bias monotonicity test: Partition the covariate space into strata based on experimental CATE estimates. Within each stratum, compute the average bias (observational estimate minus experimental estimate). Test whether the bias is monotone increasing across strata (e.g., using a Jonckheere-Terpstra test or simple linear regression of bias on stratum rank).

These tests require a small experimental dataset for validation, but the paper shows they can be effective with as few as 100-200 experimental observations.

The bias-variance-decision tradeoff:

The paper derives a decision-theoretic bound. Let R_obs and R_exp be the decision error rates for observational and experimental data respectively. Under the monotone confounding condition:

R_obs ≤ R_exp + O(σ²_obs / n_obs - σ²_exp / n_exp)

where σ² are the conditional variances and n are sample sizes. When n_obs >> n_exp, the variance reduction can dominate the bias penalty, making R_obs < R_exp.

When to use it

Prefer observational data over experimental data when:

You are making threshold-based decisions (e.g., "show ad if effect > 3%") rather than estimating effects for scientific understanding.
The observational dataset is at least 10-100x larger than the experimental dataset you could afford.
You have reason to believe confounding is monotone in the true effect (e.g., selection bias where more responsive units are more likely to be treated).
The decision threshold τ is not near zero (i.e., you're looking for effects that are clearly positive or clearly negative, not borderline).
You can validate the monotone confounding condition using a small experimental holdout (e.g., 100-200 observations).

Prefer experimental data over observational data when:

You need unbiased estimates for scientific reporting or regulatory purposes (e.g., FDA submissions, academic publication).
The decision threshold is near zero, where even small bias can flip the decision.
Confounding is likely to be anti-monotone (larger effects are underestimated) or non-monotone.
The observational dataset is only modestly larger than the experimental dataset (e.g., 2-5x).
You are doing continuous optimization (e.g., bandit algorithms) rather than threshold-based decisions.
The cost of a wrong decision is asymmetric and large (e.g., medical treatment with severe side effects).

Prefer hybrid approaches (e.g., doubly robust estimation, data fusion) when:

You have both datasets and can afford the computational overhead.
The monotone confounding condition is questionable but you still want to leverage observational data.
You need both unbiased estimates (for some subgroups) and large sample sizes (for others).

Prefer no intervention at all when:

Both datasets are too small or too noisy to make reliable threshold decisions.
The effect sizes are all near the threshold and variance is high.

Limitations and failure modes

Monotone confounding is a strong, untestable condition: The heuristic tests are suggestive but not definitive. In practice, confounding can be non-monotone (e.g., larger effects are underestimated due to floor/ceiling effects, or confounding varies non-monotonically with effect size). The paper provides no guarantees when the condition fails.
Threshold must be known and fixed: The entire framework collapses if the threshold is uncertain, estimated from data, or changes over time. In many real-world settings, the threshold is itself a decision variable (e.g., "what ROI do we require?"), creating a circular dependency.
Covariate shift between observational and experimental populations: The paper assumes the same covariate distribution, but in practice, experimental data often comes from a different population (e.g., opt-in users, specific geographic regions). The heuristic tests may not detect this shift.
Small experimental validation set: The heuristic tests require experimental data. If the experimental dataset is too small (e.g., < 50 observations), the tests have low power and may give misleading results. The paper's simulations use at least 100 experimental observations.
No dynamic or sequential decisions: The paper considers a single, static decision. In reinforcement learning or bandit settings, decisions affect future data collection, and the bias-variance-decision tradeoff changes fundamentally.
Ignoring interference and spillover: The paper assumes no interference between units. In advertising or social network settings, treatment effects can spill over, violating the consistency assumption.
No uncertainty quantification for decisions: The paper provides decision rules but no confidence intervals

Read full paper →More Causal Estimation →