The Statistical Foundations of Rigorous Experimentation

Most experiments fail not because the product didn't work, but because the analysis was wrong. I've seen teams celebrate a 12% lift at p=0.04, only to watch the same metric revert to baseline in production. I've watched analysts run 47 metrics, highlight the three that reached significance, and present them as "key learnings." The problem isn't the experiment—it's the statistical scaffolding around it. Here's the foundation every practitioner needs.

1. Frequentist vs. Bayesian Inference

Frequentist: You ask "If the null hypothesis is true, how unlikely is this data?" The answer is a p-value. It doesn't tell you the probability the treatment works. It tells you the probability of seeing data this extreme under the null. This is subtle and often misinterpreted. Use it when you need a sharp decision boundary, regulatory compliance, or a widely understood standard.

Bayesian: You ask "Given my prior beliefs and this data, what's my updated belief about the treatment effect?" The answer is a posterior distribution. It gives you direct probabilities: "There's an 87% chance the true lift is between 1% and 4%." Use it when you have strong priors (e.g., from previous experiments), when you need to make decisions under uncertainty, or when you want to communicate intuitive results.

When to use which: Frequentist for high-stakes decisions with strict alpha control (clinical trials, financial audits). Bayesian for iterative product experiments where you want to incorporate prior knowledge and make continuous decisions. Many modern platforms mix both: frequentist for hypothesis testing, Bayesian for effect size estimation.

2. Multiple Testing

Running 20 metrics at α=0.05 guarantees a 64% chance of at least one false positive (1 - 0.95^20). This is not a theoretical nicety—it's a daily threat.

Bonferroni correction: Divide α by the number of tests. Simple, safe, brutally conservative. α=0.05/20 = 0.0025. You lose power but control family-wise error rate (FWER). Use when false positives are catastrophic.

Benjamini-Hochberg (BH-FDR): Control the false discovery rate (FDR)—the expected proportion of false positives among rejected hypotheses. Sort p-values, compare to thresholds (i/m)*α. Less conservative, more power. Use when you want to discover signals and can tolerate some false positives.

Practical rule: If you're running an exploratory analysis with 50 metrics, use BH-FDR at q=0.1. If you're running a confirmatory analysis with 3 primary metrics, use Bonferroni at α=0.05/3.

3. Sequential Testing

The "peeking problem": checking results daily and stopping when p<0.05 inflates Type I error to 25% or more. You're not testing once—you're testing at every peek.

Always-valid p-values: Methods like the mixture Sequential Probability Ratio Test (mSPRT) and e-values maintain valid inference under continuous monitoring. They produce "always-valid" p-values that don't require a fixed sample size. The trade-off: longer average run times for a given effect size.

SPRT: Sequential Probability Ratio Test. Efficient for large effects. Requires specifying an effect size of interest. Use when you have a clear minimum detectable effect.

mSPRT: More robust than SPRT because it doesn't require a single effect size. Mixes over a range of alternatives. Widely used in tech (Microsoft, Netflix, Amazon).

E-values: A newer framework. An e-value of 10 means the evidence against the null is 10x stronger than under the null. They combine easily across experiments and don't require stopping rules. Still emerging but promising.

Practical rule: Don't peek without a sequential method. If you must peek, use mSPRT. If you can't implement it, commit to a fixed sample size and don't look until the end.

4. Effect Size vs. p-value

p=0.03 tells you something is probably not zero. It tells you nothing about whether that something matters.

Standardized effect sizes: Cohen's d (mean difference / pooled standard deviation), Hedges' g (bias-corrected Cohen's d), or the raw lift in business units. d=0.2 is "small," 0.5 "medium," 0.8 "large." But context matters: a 0.1% lift in revenue on a billion-dollar platform is enormous; a 5% lift in a niche metric may be noise.

Practical significance: Does the effect size change a decision? A lift of 0.3% with p=0.001 is statistically significant but may not cover implementation cost. A lift of 8% with p=0.06 may be practically significant but statistically uncertain.

Decision rule: Report both p-value and effect size with confidence intervals. "The treatment increased conversion by 1.2% (95% CI: 0.3% to 2.1%, p=0.009)." Then decide: is 1.2% worth the engineering effort?

5. Hierarchical Models

Running 20 experiments in parallel? Each with small sample sizes? Hierarchical (multi-level) models let you borrow strength across experiments without assuming they're identical.

Partial pooling: Instead of analyzing each experiment independently (no pooling) or assuming they're all the same (full pooling), partial pooling shrinks estimates toward the global mean. Experiments with small samples get more shrinkage; large experiments retain their individual estimates. This reduces variance and improves accuracy.

When to use: A/B tests across multiple cities, product features, or time periods. You have 10 experiments, each with 500 users. Independently, none reach significance. Hierarchically, you might detect a small but consistent effect across all 10.

Implementation: Bayesian hierarchical models (e.g., using Stan or brms) or frequentist mixed-effects models. The key is specifying a distribution of effects (e.g., normal with unknown mean and variance) rather than treating each as fixed.

Practical Traps

Underpowering: The most common error. Running an experiment with 80% power to detect a 5% lift when the true effect is 1%. You'll fail to detect it 80% of the time. Always compute power for the smallest effect you care about.

Variance inflation from stratification: Stratification (e.g., by country, device) reduces variance—if you use the correct variance formula. Using the unstratified variance formula on stratified data inflates variance and reduces power. Always use the stratified standard error.

SUTVA violations: Stable Unit Treatment Value Assumption—units don't interfere with each other. Violated in network effects (social media, marketplace, ad platforms). If treatment users affect control users, your estimates are biased. Use cluster-randomized designs or network-robust estimators.

A 5-Point Checklist for Rigorous Experimental Analysis

Pre-register your primary metric, effect size, and stopping rule. Not optional. Write it down before you see any data.
Correct for multiple testing. If you have >1 primary metric, use Bonferroni. If >5 secondary metrics, use BH-FDR. Report all tests, not just significant ones.
Report effect sizes with confidence intervals. Not just p-values. Include standardized effect sizes for comparability.
Check assumptions. SUTVA, no interference, no peeking, correct variance formula. Run a sensitivity analysis: what if the data had been slightly different?
Interpret in business context. "The treatment increased conversion by 1.2% (95% CI: 0.3% to 2.1%, p=0.009). This translates to an estimated $340K annual revenue increase, exceeding the $200K implementation cost. Decision: launch."

The best experimenters aren't the ones with the most sophisticated models. They're the ones who avoid the common statistical traps, understand what their methods actually mean, and communicate uncertainty honestly. Build your foundation on these five concepts, and your experiments will fail for the right reasons—not because your analysis was wrong.