| Authors | Ramesh Johari, Leo Pekelis, David J. Walsh |
| Year | 2015 |
| Citations | 101 |
What Problem It Solves
This paper addresses the fundamental failure of classical frequentist inference when users continuously monitor A/B test results and decide when to stop based on the evolving p-value. In standard practice, a user runs an A/B test, watches the p-value drop over time, and stops the experiment the moment it crosses 0.05 — a procedure known as "peeking" or continuous monitoring. This behavior invalidates the fixed-sample-size assumptions underlying traditional p-values and confidence intervals, inflating Type I error rates dramatically (e.g., from 5% to 25% or higher even with moderate sample sizes). The core challenge is that A/B testing platforms provide real-time dashboards showing p-values, creating an irresistible feedback loop: users see promising results and stop early, or see disappointing results and continue, both of which break the statistical guarantees. The paper develops "always valid" p-values and confidence intervals that remain valid under any data-dependent stopping rule, preserving the simple user interface of threshold-based decision-making while restoring correct error control. This is a problem of sequential inference under optional stopping, where the stopping time is unknown and potentially adversarial from the statistician's perspective.
This paper addresses the fundamental failure of classical frequentist inference when users continuously monitor A/B test results and decide when to stop based on the evolving p-value. In standard practice, a user runs an A/B test, watches the p-value drop over time, and stops the experiment the moment it crosses 0.05 — a procedure known as "peeking" or continuous monitoring. This behavior invalidates the fixed-sample-size assumptions underlying traditional p-values and confidence intervals, inflating Type I error rates dramatically (e.g., from 5% to 25% or higher even with moderate sample sizes). The core challenge is that A/B testing platforms provide real-time dashboards showing p-values, creating an irresistible feedback loop: users see promising results and stop early, or see disappointing results and continue, both of which break the statistical guarantees. The paper develops "always valid" p-values and confidence intervals that remain valid under any data-dependent stopping rule, preserving the simple user interface of threshold-based decision-making while restoring correct error control. This is a problem of sequential inference under optional stopping, where the stopping time is unknown and potentially adversarial from the statistician's perspective.
The core insight is to construct p-values and confidence intervals that are valid at any stopping time, not just at a pre-specified sample size. The key technical tool is the duality between always valid p-values and sequential tests of power one — tests that never accept the null hypothesis in finite time but eventually reject it if the null is false.
Intuition: A traditional p-value is computed as if the sample size were fixed in advance. If you peek repeatedly, you're effectively conducting multiple hypothesis tests on overlapping data, inflating the chance of a false positive. The always valid p-value corrects for this by using a test statistic that accounts for the fact that you could have stopped at any earlier time.
The mixture sequential probability ratio test (mSPRT): For testing a null hypothesis H₀: θ = θ₀ against a composite alternative H₁: θ ≠ θ₀, the mSPRT works as follows:
The remarkable property is that for any stopping time τ, P_θ₀(p_τ < α) ≤ α. This holds because the process {1/Λ_n} is a martingale under the null, and martingale inequalities (specifically, Ville's inequality) guarantee that the probability of ever crossing a threshold is bounded.
For the two-sample A/B test: The paper extends this to comparing two proportions (or means). Let X_i be observations from control (distribution F₀) and Y_i from treatment (distribution F₁). The null is that F₀ = F₁. The mSPRT is constructed by mixing over the treatment effect δ = θ₁ - θ₀, treating the baseline parameter as a nuisance. The test statistic becomes: [ \Lambda_n = \int \exp\left( \sum_{i=1}^n \log \frac{f(Y_i | \theta_0 + \delta)}{f(Y_i | \theta_0)} + \sum_{i=1}^n \log \frac{f(X_i | \theta_0)}{f(X_i | \theta_0)} \right) dH(\delta) ] In practice, for Bernoulli data, this simplifies to a closed-form expression involving the Beta function, making computation efficient.
Always valid confidence intervals: By inverting the test, one obtains confidence intervals that remain valid under continuous monitoring. The interval at time n is the set of δ such that the always valid p-value for testing H₀: δ = δ₀ exceeds α.
Multiple testing: Always valid p-values can be plugged into standard multiple testing procedures (Bonferroni, Benjamini-Hochberg) while preserving error control, because the validity holds at any stopping time. This is a major practical advantage: users can run many experiments simultaneously, monitor them continuously, and still control family-wise error rate or false discovery rate.
Prefer this method over fixed-horizon testing when:
Prefer fixed-horizon testing over this method when:
Prefer other sequential methods (e.g., group sequential designs, alpha-spending functions) when:
Prefer Bayesian methods (e.g., Bayes factors, posterior probabilities) when:
What breaks this method:
Non-i.i.d. data: The martingale property that guarantees validity relies on independent observations. Time series data with autocorrelation, user-level repeated measures, or network interference (e.g., social media experiments where users influence each other) will invalidate the test. In such cases, the Type I error can be inflated even with always valid p-values.
Very small effect sizes: The mSPRT can take an extremely long time to detect tiny effects. In the worst case, if the true effect is exactly zero, the test will never reject (correctly), but if the effect is non-zero but very small, the expected stopping time grows as 1/δ², which can be impractically large. Setting a maximum sample size M is essential.
Multiple testing with dependent tests: While the paper shows that always valid p-values can be used with Benjamini-Hochberg for FDR control, this requires that the tests are independent or have positive regression dependence. In practice, experiments on the same platform may share users (if users are randomized across experiments), creating dependence that can inflate FDR.
Mixing distribution misspecification: If the mixing distribution is poorly chosen (e.g., too narrow when the true effect is large), the test can be very slow to reject. The paper shows robustness to moderate misspecification, but extreme misspecification can lead to practical power loss.
Two-sample complications: The extension to two samples (control vs. treatment) requires treating the baseline parameter as a nuisance. If the baseline is poorly estimated (e.g., very rare events with few observations), the test may have inflated Type I error in
Related papers
The Design of Experiments
Ronald A. Fisher · 1935
PaperControlled experiments on the web: survey and practical guide
Ron Kohavi, Roger Longbotham, Dan Sommerfield +1 more · 2009
PaperStatistical Challenges in Online Controlled Experiments: A Review of A/B Testing Methodology
Nicholas Larsen, Alex Deng, Jiheng Zhang +2 more · 2024
PaperOn Causal Inference in the Presence of Interference
Eric J. Tchetgen Tchetgen, Tyler J. VanderWeele · 2012