You've just launched a new feature on your platform. Your product team is confident it will increase engagement. The A/B test shows a statistically significant 0.3% lift in daily active users (p = 0.04, N = 500,000). Your VP wants to ship it tomorrow. But you've been burned befor
DoOperator Research · May 10, 2026
Decision takeaway
You've just launched a new feature on your platform. Your product team is confident it will increase engagement. The A/B test shows a statistically significant 0.3% lift in daily active users (p = 0.04, N = 500,000). Your VP wants to ship it tomorrow. But you've been burned befor
You've just launched a new feature on your platform. Your product team is confident it will increase engagement. The A/B test shows a statistically significant 0.3% lift in daily active users (p = 0.04, N = 500,000). Your VP wants to ship it tomorrow. But you've been burned before—last quarter's "significant" result reversed after two weeks, and the quarter before that, three concurrent experiments interacted in ways that took months to untangle.
The decision isn't whether to run experiments. It's whether you can trust the ones you're already running.
This is the central challenge of industry experimentation at scale: how to maintain rigorous causal inference when you're running hundreds of concurrent experiments across millions of users, with engineering teams shipping weekly, and where the cost of a false positive isn't just a p-value but a product roadmap.
The foundational paper on this topic—Kohavi et al.'s "Online controlled experiments at large scale" (2013)—documents Microsoft's experience running over 200 concurrent experiments daily on Bing's ~100 million monthly active users. The key insight is that scaling experimentation isn't primarily a statistical problem; it's an operational one. Classical experimental design assumes you run one experiment at a time, with clean randomization, no interference, and a fixed sample size. At scale, every one of these assumptions is violated daily.
Consider the interference problem. When Bing changed its search results page, the treatment affected not just the treated users but also what content was available for other users to see—a violation of the Stable Unit Treatment Value Assumption (SUTVA). The paper reports that this interference can bias treatment effect estimates by 10-20% for metrics like click-through rate, and the bias is not detectable through standard balance checks. You cannot test for SUTVA violations directly; you must design around them.
The most common failure mode in industry experimentation isn't small samples or multiple testing—it's interaction between concurrent experiments. When your team runs experiments on the recommendation algorithm, the pricing page, and the notification system simultaneously, the treatment effects are not additive. The recommendation change might increase engagement only when combined with the new notification design.
Kohavi et al. document that at Microsoft, approximately 5-10% of concurrent experiment pairs show statistically significant interactions. The practical solution isn't to avoid concurrent experiments—that would kill velocity—but to implement an overlapping experiment framework with interaction detection. The paper recommends running pairwise interaction tests for all concurrent experiments and flagging any pair where the interaction term is significant at α = 0.01 (not 0.05, to account for multiple testing). When an interaction is detected, the affected experiments should be re-run in isolation or with the interacting experiment held constant.
Let me walk you through a concrete scenario I've seen play out multiple times.
The setup: Your company runs an e-commerce platform. Team A launches an experiment that changes the product recommendation algorithm (Treatment: new collaborative filtering). Team B launches an experiment that changes the checkout flow (Treatment: one-click purchase). Both run for two weeks on 10% of users each, with 80% of users in the control for both.
The result: Team A sees a 1.2% lift in revenue (p = 0.03). Team B sees a 0.8% lift in conversion (p = 0.04). Both teams celebrate and prepare to ship.
The problem: Users in Team A's treatment group who also land in Team B's treatment group experience both changes simultaneously. The one-click checkout makes the new recommendations more actionable because the friction to purchase is lower. The interaction effect is 2.1%—larger than either main effect. But neither team's analysis accounts for this. Team A attributes the full lift to their recommendation change, when half of it comes from the interaction with the easier checkout.
The fix: Before shipping either change, run the interaction test. The interaction term is significant (p = 0.008). Now you have three options: (1) ship both together (the interaction is positive), (2) ship only one and re-run the other in isolation, or (3) design a joint experiment that randomizes users to all four combinations of the two treatments, allowing you to estimate both main effects and the interaction simultaneously.
The standard alternative to concurrent experimentation is the "one experiment at a time" approach—sequential testing where each experiment runs to completion before the next begins. This is the default in academic research and small-scale industry settings. Its advantage is simplicity: no interaction concerns, clean randomization, straightforward analysis.
But at scale, sequential testing is a non-starter. If each experiment takes two weeks and you have 200 experiments per quarter, sequential testing would require 400 weeks—nearly eight years. The opportunity cost of not running experiments in parallel far exceeds the bias from interactions.
A second alternative is the "holdout" approach: maintain a permanent control group that never receives any treatment, and compare all treated users against this holdout. This solves the interaction problem because the holdout provides a clean baseline. But it introduces a new problem: the holdout group becomes increasingly unrepresentative over time as the platform evolves. Users in the holdout experience a progressively outdated product, making them less engaged and less comparable to treated users. Kohavi et al. report that holdout groups at Microsoft showed 5-15% lower engagement than the control group after six months, violating the comparability assumption.
The recommended approach is the overlapping experiment framework with interaction detection: run experiments concurrently, test for pairwise interactions systematically, and when interactions are found, either ship the combination or re-run in isolation. This preserves velocity while maintaining rigor.
The failure mode I see most often isn't statistical—it's metric design. Teams define success metrics that are easily gamed or that conflate multiple causal pathways.
Consider a common scenario: you're testing a new onboarding flow. Your primary metric is "7-day retention." The treatment group shows a 2% improvement. You ship it. Three weeks later, retention has returned to baseline. What happened?
The new onboarding flow was more engaging initially, but it also attracted users who would have churned anyway—they engaged for a few days and then left. The 7-day retention metric captured the short-term engagement boost but missed the long-term churn. The metric was polluted by a selection effect: the treatment changed who was retained at day 7, not just how many.
The fix is to use a metric hierarchy: primary metrics that are hard to game (revenue, long-term retention), secondary metrics that provide diagnostic information (short-term engagement, feature adoption), and guardrail metrics that catch negative side effects (customer support tickets, system latency). Kohavi et al. recommend running each experiment through a standardized metric dashboard with at least 10-15 metrics, including at least three guardrail metrics that must not degrade significantly.
Before you ship your next experiment at scale, verify these six things:
Interaction check completed. Run pairwise interaction tests between your experiment and all other concurrent experiments. Flag any interaction with p < 0.01. If interactions exist, do not ship either treatment in isolation—ship the combination or re-run in isolation.
SUTVA violation risk assessed. Identify any mechanism by which one user's treatment could affect another user's outcome: shared content, network effects, market-level spillovers. If such mechanisms exist, consider cluster-randomized designs or holdout groups.
Metric hierarchy defined. Your primary metric should be a business outcome (revenue, retention, lifetime value), not a proxy (clicks, time on page). You need at least three guardrail metrics that must not degrade. Pre-specify what "degrade" means (e.g., >1% relative decrease with p < 0.05).
A/A test passed. Run an A/A test (same variant on both sides) on your randomization infrastructure before the experiment. Verify that no metric shows significant differences at α = 0.05. If your A/A test fails, your randomization is broken—fix it before proceeding.
Power analysis for the minimum detectable effect. Compute the sample size needed to detect your minimum economically meaningful effect size at 80% power and α = 0.05. If your experiment is underpowered, extend the duration or accept that you can only detect large effects.
Pre-registration of analysis plan. Document your primary metric, secondary metrics, guardrail metrics, stopping rule, and interaction testing procedure before the experiment starts. This prevents p-hacking and post-hoc rationalization. The registration should be timestamped and immutable.
Part of the DoOperator Research series on Industry Experiments. Browse the full paper corpus at dooperator.ai/research/industry_experiments.