Why Your Organization's Experiments Are Probably Confounded — DoOperator Research

Most organizational experiments are confounded. Not all of them. But enough that the confident conclusions drawn from them deserve skepticism.

A confounder is a variable that influences both which condition a unit ends up in and what the outcome is. In a well-run randomized controlled trial, confounders are neutralized by the randomization. The coin flip, the random assignment — these break the link between confounders and condition assignment. On average, confounders are equally distributed across conditions. They add noise but not bias.

The problem is that most business experiments are not well-run randomized controlled trials.

The typical A/B test

The canonical organizational experiment: a product team changes a button color, a pricing page, an onboarding flow. They split traffic randomly. They run for two weeks. They find a 4% lift in conversion. They ship it.

This sounds like it should work. Random split, controlled test, measured outcome. What could go wrong?

Several things.

Novelty effects. Users who see the new design respond to its newness, not its quality. The effect is positive for two weeks, then decays as the novelty fades. The 4% lift was measuring surprise, not preference.

Sample ratio mismatch. If the split isn't perfectly 50/50 — if something in the allocation mechanism biases who sees which variant — the comparison is invalid. This happens more often than teams catch, especially when the allocation is based on session cookies that don't respect users who clear cookies or switch devices.

Interaction effects. The organization runs ten experiments simultaneously. Users assigned to variant A in experiment 1 are not randomly distributed across variants in experiments 2 through 10. The experiments interact. The measured effect of any one experiment may be partly the effect of another.

Temporal confounders. Experiments run across a Tuesday-Wednesday cohort versus a Friday-Saturday cohort will often show different results — not because the variant works differently, but because the populations are different. Day-of-week effects, promotional periods, news events, and seasonal patterns all create temporal confounders that a two-week window may not average out.

None of these are exotic problems. They are routine features of how experiments run in practice.

The higher-stakes version

The confounding problem is worse in organizational experiments that fall outside the clean A/B template.

Sales process experiments. A company wants to know whether a new outreach sequence produces better conversion than the standard one. Sales reps are trained on the new sequence. Some reps adopt it enthusiastically. Others use it selectively, on prospects they believe are already warm. The condition — new sequence vs. old — is now correlated with prospect quality, because the reps who choose to use the new sequence are choosing it for prospects where they expect it to work. The experiment is confounded by rep judgment.

Pricing tests. A company tests two price points by routing different customer segments to different pricing pages. But segments are not equivalent. The high-price segment may have higher average intent to purchase, different demographic profiles, and different referral sources. What looks like a price sensitivity test is measuring a combination of price sensitivity and segment differences.

Operational experiments. A warehouse manager wants to test a new picking process. One shift uses the new process; another uses the old one. But shifts differ — in staffing levels, in the product mix being fulfilled, in the experience level of the workers. The shift is a confounder. The experiment cannot tell you whether the process works; it can only tell you how the process and the shift combined.

In each case, the experiment looks like a comparison of two conditions. It is actually a comparison of two conditions plus an unknown mixture of confounders.

Why this matters more than it seems

The standard response to confounding is: run a larger experiment, collect more data, let the noise average out.

This is wrong, and the mistake matters.

Noise is random variation. More data reduces noise. With enough trials, a real effect emerges through the noise. A noisy experiment gives you an uncertain answer.

Confounding is systematic bias. More data makes the biased estimate more precise. You get a very confident wrong answer.

A sufficiently large confounded experiment will produce a statistically significant result in the wrong direction with high certainty. The p-value will be tiny. The confidence interval will be narrow. The conclusion will be wrong. Nothing about the statistical output will flag the problem.

This is why "statistically significant" is not the same as "correctly estimated." Statistical significance is a statement about noise. It says nothing about confounding. An effect that is real but confounded will appear statistically significant. An effect that is purely confounding will appear statistically significant. The test cannot tell the difference.

The organizational implications

Three practices reduce confounding in organizational experiments:

Randomize at the right level. If your experiment randomizes by session but your conversion outcome is user-level, you have a problem — a user who sees both variants is in both conditions. Randomize at the unit that aligns with your outcome measure.

Pre-register your analysis. Decide before the experiment what the primary outcome is, what the exclusion criteria are, and how you will handle edge cases. Post-hoc decisions about what to measure, who to include, and when to stop are the most common source of false positives in organizational experimentation. Pre-registration is not bureaucracy — it is protection against your own motivated reasoning.

Track potential confounders. Before running the experiment, list the variables that might influence both condition assignment and outcomes. Day of week. Traffic source. User tenure. Product line. Measure them. Check, after the experiment, whether they are balanced across conditions. If they are not, adjust.

The goal is not to eliminate confounding — in complex organizations, some confounding is unavoidable. The goal is to measure it, report it, and be honest about what you can and cannot conclude from data that contains it.

What good looks like

The organizations that do this well treat their experimental results with productive skepticism. They ask not just "what did the experiment show?" but "what are the three most plausible confounds, and how much of the effect could each one explain?"

They build mechanisms for this: pre-registration templates, analysis checklists, review processes that specifically prompt analysts to consider alternative explanations before conclusions are written.

And they maintain an intellectual culture that can tolerate uncertainty — one that can say "we have a positive signal but we're not confident enough to scale yet, and here is why" without treating that honesty as a failure of the analysis.

The alternative — confident conclusions from confounded experiments, shipped at scale, with no mechanism for detection or correction — is what most organizations are doing.

It is more expensive than it looks.