Organizations run thousands of A/B tests every year and congratulate themselves on being data-driven. Most of those tests are statistically invalid. Here is why — and what rigorous experimentation actually requires.
DoOperator Research · May 27, 2026
Decision takeaway
Organizations run thousands of A/B tests every year and congratulate themselves on being data-driven. Most of those tests are statistically invalid. Here is why — and what rigorous experimentation actually requires.
Organizations run thousands of A/B tests every year and congratulate themselves on being data-driven. Product teams celebrate statistically significant results. Executives cite lift percentages in board decks. The implicit claim is that these experiments are producing reliable knowledge about what works.
Most of them are not.
The gap between running an experiment and running a valid experiment is wider than almost anyone admits. This gap is not a matter of effort or intention — it is structural, rooted in systematic misunderstandings of what statistical inference actually guarantees.
The most common mistake in A/B testing is also the most invisible one: looking at results before the experiment is complete and stopping when you see something you like.
This practice — called sequential peeking — violates the foundational assumption of frequentist hypothesis testing. A p-value of 0.05 means that, if the null hypothesis is true and you ran the experiment exactly once with a predetermined sample size, you would see a result this extreme 5% of the time by chance. It does not mean that p-value applies to your live experiment where you checked results fourteen times and stopped on the third significant reading.
Simulation studies consistently find that peeking at interim results and stopping when p < 0.05 produces false positive rates of 20–30%, even when the nominal threshold is 5%. The experiment looks like science. It produces the artifacts of science — a test statistic, a confidence interval, a p-value. But the inferential guarantees that give those numbers meaning have been violated.
The fix is either to commit to a fixed sample size in advance and not look until you reach it, or to use sequential testing methods — group sequential designs, alpha spending functions, or Bayesian adaptive approaches — that explicitly account for interim analyses. These methods exist. They are rarely used, because stopping early when things look good feels like efficiency, not bias.
Organizations do not run one experiment. They run many — across product surfaces, user segments, time periods, and metric families. Each test is evaluated at a 5% significance threshold. If you run 20 independent tests of genuinely null effects, you should expect one false positive by chance alone. If you run 100, you expect five.
In practice, this is compounded by within-experiment multiplicity: testing the primary metric, three secondary metrics, and four user segments simultaneously, then reporting whichever combination reached significance. The result is what statisticians call the garden of forking paths — an enormous space of analyses that could have been run, only one of which gets reported, selected precisely because it crossed the significance threshold.
This is not fraud. It is how most experimentation actually operates. The analysis is conducted in good faith. The problem is that "good faith" and "valid inference" are different standards, and the machinery of p-values and confidence intervals only satisfies the second one under conditions that most A/B testing programs do not meet.
Suppose your experiment is validly designed and honestly reported. You achieve statistical significance with a measured effect of +8% on your primary metric. You ship the feature. What effect should you expect in production?
Less than 8%. Possibly much less. Possibly nothing.
This is the winner's curse: statistically significant results systematically overestimate true effect sizes. The intuition is simple. To cross a significance threshold, a measured effect must be large enough — relative to your sample size and variance — to be declared significant. In small samples, random noise makes some estimates too high and some too low. The ones that cross the significance threshold are disproportionately those inflated by noise. The true effect is smaller.
In large-scale replication studies, reported effect sizes shrink by 30–50% on average when experiments are independently repeated. This phenomenon is not a pathology of bad research. It is an expected consequence of using significance thresholds as filters.
The practical implication: significant results in underpowered experiments should be treated with particular skepticism. An experiment with 200 observations that finds a significant effect at p = 0.03 is more likely to be an inflated estimate than one with 10,000 observations finding the same p-value.
None of this is an argument against experimentation. It is an argument against the theater of experimentation — the appearance of rigor without its substance.
Rigorous experimentation requires several things that most organizations skip:
Preregistration. Specify the primary metric, sample size, analysis plan, and stopping rule before you begin. This forces a distinction between confirmatory and exploratory conclusions that the data alone cannot provide.
Adequate power. Most A/B tests are underpowered. An experiment designed to detect a 5% lift when the true effect is 1% will mostly produce null results indistinguishable from a genuine null effect. The result is an organization that thinks it has tested something when it has mostly generated noise.
Honest null results. The standard in most experimentation programs is to ship when significant and shelve when not. This creates a biased record of what has been tried and learned. Null results are information. An experiment that finds no effect — validly, with adequate power — is telling you something important.
Replication. A single significant result is weak evidence. A replicated significant result in an independent experiment is strong evidence. Few organizations have the discipline to require replication before acting on experimental findings.
These requirements are standard practice in clinical medicine, psychology replication efforts, and every field that has confronted the reproducibility crisis. The lesson from those fields is not that experimentation is unreliable — it is that experimentation without methodological discipline produces unreliable results that feel reliable.
The prevalence of invalid A/B testing is not primarily a statistical literacy problem. It is an incentive problem. Null results do not get celebrated in team meetings. Stopping an experiment because it failed to recruit enough participants looks like weakness. Preregistration requires committing to an answer before you see the data, which feels less flexible than the alternative.
Experimentation done well is slower, more expensive, and produces more uncertainty than experimentation done carelessly. The careless version produces confident, reportable numbers. The careful version produces honest, calibrated conclusions. Organizations that genuinely want to learn — rather than to produce the appearance of learning — have to choose the harder path.
The methods exist. The choice is whether to use them.