Always Valid Inference: Bringing Sequential Analysis to A/B Testing — DoOperator Research

Authors	Ramesh Johari, Leo Pekelis, David J. Walsh
Year	2015
Citations	101

What Problem It Solves

This paper addresses the fundamental failure of classical frequentist inference when users continuously monitor A/B test results and decide when to stop based on the evolving p-value. In standard practice, a user runs an A/B test, watches the p-value drop over time, and stops the experiment the moment it crosses 0.05 — a procedure known as "peeking" or continuous monitoring. This behavior invalidates the fixed-sample-size assumptions underlying traditional p-values and confidence intervals, inflating Type I error rates dramatically (e.g., from 5% to 25% or higher even with moderate sample sizes). The core challenge is that A/B testing platforms provide real-time dashboards showing p-values, creating an irresistible feedback loop: users see promising results and stop early, or see disappointing results and continue, both of which break the statistical guarantees. The paper develops "always valid" p-values and confidence intervals that remain valid under any data-dependent stopping rule, preserving the simple user interface of threshold-based decision-making while restoring correct error control. This is a problem of sequential inference under optional stopping, where the stopping time is unknown and potentially adversarial from the statistician's perspective.

What problem it solves

How it works

The core insight is to construct p-values and confidence intervals that are valid at any stopping time, not just at a pre-specified sample size. The key technical tool is the duality between always valid p-values and sequential tests of power one — tests that never accept the null hypothesis in finite time but eventually reject it if the null is false.

Intuition: A traditional p-value is computed as if the sample size were fixed in advance. If you peek repeatedly, you're effectively conducting multiple hypothesis tests on overlapping data, inflating the chance of a false positive. The always valid p-value corrects for this by using a test statistic that accounts for the fact that you could have stopped at any earlier time.

The mixture sequential probability ratio test (mSPRT): For testing a null hypothesis H₀: θ = θ₀ against a composite alternative H₁: θ ≠ θ₀, the mSPRT works as follows:

Choose a mixing distribution H over the alternative parameter space (e.g., a Gaussian centered at zero with some variance τ²).
At each observation n, compute the mixture likelihood ratio: [ \Lambda_n = \int \frac{\prod_{i=1}^n f(X_i | \theta)}{\prod_{i=1}^n f(X_i | \theta_0)} dH(\theta) ]
The always valid p-value at time n is p_n = 1 / Λ_n (or more precisely, p_n = 1 / sup_{t ≤ n} Λ_t for a corrected version).
Stop and reject H₀ the first time p_n < α.

The remarkable property is that for any stopping time τ, P_θ₀(p_τ < α) ≤ α. This holds because the process {1/Λ_n} is a martingale under the null, and martingale inequalities (specifically, Ville's inequality) guarantee that the probability of ever crossing a threshold is bounded.

For the two-sample A/B test: The paper extends this to comparing two proportions (or means). Let X_i be observations from control (distribution F₀) and Y_i from treatment (distribution F₁). The null is that F₀ = F₁. The mSPRT is constructed by mixing over the treatment effect δ = θ₁ - θ₀, treating the baseline parameter as a nuisance. The test statistic becomes: [ \Lambda_n = \int \exp\left( \sum_{i=1}^n \log \frac{f(Y_i | \theta_0 + \delta)}{f(Y_i | \theta_0)} + \sum_{i=1}^n \log \frac{f(X_i | \theta_0)}{f(X_i | \theta_0)} \right) dH(\delta) ] In practice, for Bernoulli data, this simplifies to a closed-form expression involving the Beta function, making computation efficient.

Always valid confidence intervals: By inverting the test, one obtains confidence intervals that remain valid under continuous monitoring. The interval at time n is the set of δ such that the always valid p-value for testing H₀: δ = δ₀ exceeds α.

Multiple testing: Always valid p-values can be plugged into standard multiple testing procedures (Bonferroni, Benjamini-Hochberg) while preserving error control, because the validity holds at any stopping time. This is a major practical advantage: users can run many experiments simultaneously, monitor them continuously, and still control family-wise error rate or false discovery rate.

When to use it

Prefer this method over fixed-horizon testing when:

You cannot pre-specify sample size because you don't know the effect size or your cost of running the experiment is uncertain.
Users will inevitably peek at results and may stop early based on what they see (this is the default in most commercial A/B testing platforms).
You want to stop early for both efficacy (treatment works) and futility (treatment doesn't work and won't ever reach significance).
You are running many simultaneous experiments and need to control multiple testing error rates under continuous monitoring.
The opportunity cost of running experiments is high — detecting effects faster has real business value.

Prefer fixed-horizon testing over this method when:

You have a hard constraint on sample size (e.g., you can only run the experiment for two weeks due to business cycles).
You need maximum power for a given sample size — fixed-horizon tests are uniformly most powerful for a single look, while sequential tests sacrifice some power for the ability to stop early.
The effect size is known precisely from prior experiments, allowing optimal sample size planning.
Regulatory or compliance requirements mandate a pre-registered analysis plan with fixed sample size (common in clinical trials).

Prefer other sequential methods (e.g., group sequential designs, alpha-spending functions) when:

You want to specify a small number of pre-planned interim analyses (e.g., 3-5 looks) rather than continuous monitoring — group sequential tests are more powerful at those specific looks.
You need to control the maximum sample size strictly — the mSPRT can theoretically run forever if the effect is very small, though in practice a maximum time is set.
You are in a regulated environment (e.g., clinical trials) where group sequential designs are the accepted standard.

Prefer Bayesian methods (e.g., Bayes factors, posterior probabilities) when:

You have strong prior information about effect sizes that you want to incorporate formally.
You need to make decisions under uncertainty with explicit loss functions rather than hypothesis tests.
You are comfortable with subjective probability and want to avoid frequentist error rate guarantees.

Limitations and failure modes

What breaks this method:

Non-i.i.d. data: The martingale property that guarantees validity relies on independent observations. Time series data with autocorrelation, user-level repeated measures, or network interference (e.g., social media experiments where users influence each other) will invalidate the test. In such cases, the Type I error can be inflated even with always valid p-values.
Very small effect sizes: The mSPRT can take an extremely long time to detect tiny effects. In the worst case, if the true effect is exactly zero, the test will never reject (correctly), but if the effect is non-zero but very small, the expected stopping time grows as 1/δ², which can be impractically large. Setting a maximum sample size M is essential.
Multiple testing with dependent tests: While the paper shows that always valid p-values can be used with Benjamini-Hochberg for FDR control, this requires that the tests are independent or have positive regression dependence. In practice, experiments on the same platform may share users (if users are randomized across experiments), creating dependence that can inflate FDR.
Mixing distribution misspecification: If the mixing distribution is poorly chosen (e.g., too narrow when the true effect is large), the test can be very slow to reject. The paper shows robustness to moderate misspecification, but extreme misspecification can lead to practical power loss.
Two-sample complications: The extension to two samples (control vs. treatment) requires treating the baseline parameter as a nuisance. If the baseline is poorly estimated (e.g., very rare events with few observations), the test may have inflated Type I error in

Read full paper →PDF ↗More Experiment Design →