← Blog
synthesisExperiment Design

Experiment Design That Actually Works: From Power Calculations to Guardrail Metrics

DoOperator Research · May 14, 2026

Experiment Design That Actually Works: From Power Calculations to Guardrail Metrics

A/B tests are the gold standard of evidence-based decision making. Yet most experiments fail—not because the treatment didn't work, but because the design was broken from the start. Wrong sample size, peeking at results, sample ratio mismatches, and missing guardrails silently invalidate thousands of experiments every day. The result? Confident decisions built on statistical illusions.

Here’s how to design experiments that actually work, structured across the three critical phases: pre-experiment, during, and post-experiment.


Phase 1: Pre-Experiment — Build the Foundation

1. Power and MDE: Start with Business Significance, Not Statistical Significance

The most common mistake in experiment design is asking: “How many users do I need?” The right question is: “What’s the smallest effect worth acting on?”

Minimum Detectable Effect (MDE) is the smallest true lift your test can reliably detect given your sample size, significance level (α), and statistical power (1−β). But MDE isn’t a mathematical given—it’s a business decision.

Work backwards:

  • Define the smallest revenue lift that justifies the engineering cost of shipping the feature.
  • Define the smallest conversion drop that would make you pause a rollout.
  • Then calculate the required sample size to detect that effect with 80% power at α=0.05.

If your business MDE is 1% but your sample size can only detect 5%, you have two options: increase traffic or accept that you’re running a noisy, useless test. Most teams choose neither and run the test anyway. That’s how “statistically significant” results appear from underpowered designs—they’re often false positives or inflated effect sizes.

Key insight: Use power calculators (e.g., Evan Miller’s, R’s pwr package) before you randomize a single user. And remember: power is about planning, not post-hoc justification.


Phase 2: During the Experiment — Avoid the Traps

2. The Peeking Problem and Why Fixed-Horizon Tests Break

You’ve launched the experiment. On day three, you check the dashboard. The p-value is 0.04. Excited, you call the team: “We have a winner!”

Stop. That p-value is meaningless.

Every time you peek at a fixed-horizon test, you inflate the probability of a false positive. With continuous monitoring, the actual Type I error rate can exceed 30%—even with α=0.05. This is the peeking problem, and it’s the most widespread malpractice in industry experimentation.

The solution: Sequential (always-valid) tests.
Methods like the mSPRT (mixture Sequential Probability Ratio Test) or Always-Valid p-values allow you to monitor results continuously without inflating error rates. You can stop early if the effect is clear, or keep running if it’s not—all while maintaining correct statistical guarantees.

Platforms like Google’s Overlapping Experiments Infrastructure and Microsoft’s ExP use these methods. If your tool doesn’t support sequential testing, set a fixed sample size and do not look until the data collection is complete. Use a “time-locked” dashboard or a trusted colleague to enforce discipline.


3. Sample Ratio Mismatch (SRM): The Silent Invalidator

SRM occurs when the observed split of users between control and treatment differs significantly from the expected split (e.g., 50/50). Even a 0.5% deviation can invalidate your results.

Why SRM happens:

  • Bugs in randomization logic (e.g., hashing errors, caching)
  • Differential attrition (one group drops out more)
  • Network effects (users in the same household being split across variants)
  • Cookie or device ID churn

How to detect SRM:
Run a chi-squared goodness-of-fit test comparing observed vs. expected counts. A p-value < 0.05 means your randomization is broken. Stop the test. Do not analyze results. Any observed effect could be due to the imbalance, not the treatment.

Pro tip: Build an automated SRM check into your experiment pipeline. It should fire within hours of launch, not days. And never trust a result from a test with a detected SRM—even if the p-value looks great.


Phase 3: Post-Experiment — Don’t Stop at the Primary Metric

4. Guardrail Metrics: Protect Against Hidden Harm

You improved conversion by 3% (p=0.01). Time to ship, right? Maybe not. What if that improvement came at the cost of a 10% increase in page load time, a 5% drop in user retention, or a spike in customer support tickets?

Guardrail metrics are the unsung heroes of experiment design. These are secondary metrics that you monitor to ensure the treatment doesn’t cause unintended harm. They’re not your primary hypothesis—they’re your safety net.

Examples:

  • For a checkout flow change: guardrail on error rate, page load time, and support contact rate
  • For a recommendation algorithm: guardrail on content diversity, user session length, and churn
  • For a pricing test: guardrail on refund rate, negative feedback, and repeat purchase rate

How to set guardrails:

  • Define thresholds for “acceptable degradation” (e.g., page load time increase < 200ms)
  • Run guardrail tests with higher α (e.g., 0.10) to catch early warning signs
  • If a guardrail metric is statistically significant and practically harmful, do not ship—even if the primary metric wins

Guardrails prevent the silent accumulation of technical debt and user harm. They transform experiments from isolated metric tests into holistic product evaluations.


Common Mistakes That Still Happen

  1. Under-powered tests: Running a two-week test when you need six weeks of traffic. Result: noisy data, false negatives, or inflated effect sizes.
  2. Testing on convenience samples: Using only “power users” or “new users” when the feature affects everyone. Your results won’t generalize.
  3. Ignoring network effects: In social products, splitting users within the same network (e.g., Facebook friends) violates the independence assumption. Use cluster-randomized designs instead.
  4. Over-relying on p-values: A p-value of 0.04 with a tiny effect size is less actionable than a p-value of 0.06 with a large, business-meaningful effect. Focus on effect size and confidence intervals.

Practical Guidance: A 5-Step Pre-Experiment Checklist

Before you randomize a single user, run through this checklist:

  1. Define the business MDE. What’s the smallest effect that would change your decision?
  2. Calculate required sample size. Use a power calculator with α=0.05 and power=0.80.
  3. Check randomization integrity. Simulate a null test (A/A test) to verify your split is truly 50/50.
  4. Set guardrail thresholds. List 3–5 secondary metrics with acceptable degradation limits.
  5. Choose your testing method. If you must monitor continuously, use sequential testing. If not, commit to a fixed horizon and no peeking.

The Bottom Line

A well-designed experiment is a commitment to truth, not convenience. It starts with a business-driven MDE, continues with disciplined monitoring (or sequential testing), and ends with a holistic view that includes guardrails and effect sizes—not just p-values.

Most A/B tests fail silently because their design was broken before the first user was randomized. Don’t let yours be one of them. Build the foundation, respect the process, and ship decisions you can defend—not just with statistics, but with good science.

Now go run an experiment that actually works.

More from the blog

Correlation Was Never the Problem"Correlation is not causation" is one of the most-repeated phrases in empirical research. It is also, as usually understood, a dramatic understatement of the actual difficulty. The real challenge is not distinguishing correlation from causation — it is identifying which causal story is correct when several are consistent with the same data.May 29, 2026The Illusion of Control: Why Most A/B Tests Mislead More Than They InformOrganizations run thousands of A/B tests every year and congratulate themselves on being data-driven. Most of those tests are statistically invalid. Here is why — and what rigorous experimentation actually requires.May 27, 2026What N-of-1 Trials Get Right That Population Studies Get WrongRandomized trials on populations measure average effects in heterogeneous groups. N-of-1 trials measure what actually happens to one specific person. For individual decision-making, the latter is usually more relevant.May 26, 2026
Experiment Design That Actually Works: From Power Calculations to Guardrail Metrics — DoOperator Research | DoOperator