DoOperator Research · May 14, 2026
A/B tests are the gold standard of evidence-based decision making. Yet most experiments fail—not because the treatment didn't work, but because the design was broken from the start. Wrong sample size, peeking at results, sample ratio mismatches, and missing guardrails silently invalidate thousands of experiments every day. The result? Confident decisions built on statistical illusions.
Here’s how to design experiments that actually work, structured across the three critical phases: pre-experiment, during, and post-experiment.
The most common mistake in experiment design is asking: “How many users do I need?” The right question is: “What’s the smallest effect worth acting on?”
Minimum Detectable Effect (MDE) is the smallest true lift your test can reliably detect given your sample size, significance level (α), and statistical power (1−β). But MDE isn’t a mathematical given—it’s a business decision.
Work backwards:
If your business MDE is 1% but your sample size can only detect 5%, you have two options: increase traffic or accept that you’re running a noisy, useless test. Most teams choose neither and run the test anyway. That’s how “statistically significant” results appear from underpowered designs—they’re often false positives or inflated effect sizes.
Key insight: Use power calculators (e.g., Evan Miller’s, R’s pwr package) before you randomize a single user. And remember: power is about planning, not post-hoc justification.
You’ve launched the experiment. On day three, you check the dashboard. The p-value is 0.04. Excited, you call the team: “We have a winner!”
Stop. That p-value is meaningless.
Every time you peek at a fixed-horizon test, you inflate the probability of a false positive. With continuous monitoring, the actual Type I error rate can exceed 30%—even with α=0.05. This is the peeking problem, and it’s the most widespread malpractice in industry experimentation.
The solution: Sequential (always-valid) tests.
Methods like the mSPRT (mixture Sequential Probability Ratio Test) or Always-Valid p-values allow you to monitor results continuously without inflating error rates. You can stop early if the effect is clear, or keep running if it’s not—all while maintaining correct statistical guarantees.
Platforms like Google’s Overlapping Experiments Infrastructure and Microsoft’s ExP use these methods. If your tool doesn’t support sequential testing, set a fixed sample size and do not look until the data collection is complete. Use a “time-locked” dashboard or a trusted colleague to enforce discipline.
SRM occurs when the observed split of users between control and treatment differs significantly from the expected split (e.g., 50/50). Even a 0.5% deviation can invalidate your results.
Why SRM happens:
How to detect SRM:
Run a chi-squared goodness-of-fit test comparing observed vs. expected counts. A p-value < 0.05 means your randomization is broken. Stop the test. Do not analyze results. Any observed effect could be due to the imbalance, not the treatment.
Pro tip: Build an automated SRM check into your experiment pipeline. It should fire within hours of launch, not days. And never trust a result from a test with a detected SRM—even if the p-value looks great.
You improved conversion by 3% (p=0.01). Time to ship, right? Maybe not. What if that improvement came at the cost of a 10% increase in page load time, a 5% drop in user retention, or a spike in customer support tickets?
Guardrail metrics are the unsung heroes of experiment design. These are secondary metrics that you monitor to ensure the treatment doesn’t cause unintended harm. They’re not your primary hypothesis—they’re your safety net.
Examples:
How to set guardrails:
Guardrails prevent the silent accumulation of technical debt and user harm. They transform experiments from isolated metric tests into holistic product evaluations.
Before you randomize a single user, run through this checklist:
A well-designed experiment is a commitment to truth, not convenience. It starts with a business-driven MDE, continues with disciplined monitoring (or sequential testing), and ends with a holistic view that includes guardrails and effect sizes—not just p-values.
Most A/B tests fail silently because their design was broken before the first user was randomized. Don’t let yours be one of them. Build the foundation, respect the process, and ship decisions you can defend—not just with statistics, but with good science.
Now go run an experiment that actually works.