| Authors | Ron Kohavi, Alex Deng, Brian Frasca, T. Walker, Ya Xu, Nils Pohlmann |
| Journal | Knowledge Discovery and Data Mining |
| Year | 2013 |
| DOI | 10.1145/2487575.2488217 |
| Citations | 424 |
What Problem It Solves
This paper addresses the challenge of scaling online controlled experiments (A/B tests) from isolated, one-off studies to an organization-wide decision-making engine that can run hundreds of concurrent experiments across millions of users. The core problem is that as organizations grow and adopt agile development, they need to evaluate many product ideas simultaneously — often hundreds per quarter — but traditional experimental designs and statistical methods break down under the combinatorial complexity of overlapping experiments, massive data volumes, and the need for rapid, trustworthy decisions. Existing approaches to product evaluation (focus groups, expert reviews, observational data analysis) fail to establish causal relationships reliably, while single-experiment A/B testing frameworks cannot handle the operational demands of concurrent experimentation at web scale. The paper synthesizes lessons from Microsoft's Bing, where over 200 concurrent experiments run daily across ~100 million monthly active users, and provides a framework for building the cultural, engineering, and statistical infrastructure needed to make experimentation a core organizational capability rather than a niche analytical tool.
This paper addresses the challenge of scaling online controlled experiments (A/B tests) from isolated, one-off studies to an organization-wide decision-making engine that can run hundreds of concurrent experiments across millions of users. The core problem is that as organizations grow and adopt agile development, they need to evaluate many product ideas simultaneously — often hundreds per quarter — but traditional experimental designs and statistical methods break down under the combinatorial complexity of overlapping experiments, massive data volumes, and the need for rapid, trustworthy decisions. Existing approaches to product evaluation (focus groups, expert reviews, observational data analysis) fail to establish causal relationships reliably, while single-experiment A/B testing frameworks cannot handle the operational demands of concurrent experimentation at web scale. The paper synthesizes lessons from Microsoft's Bing, where over 200 concurrent experiments run daily across ~100 million monthly active users, and provides a framework for building the cultural, engineering, and statistical infrastructure needed to make experimentation a core organizational capability rather than a niche analytical tool.
The paper describes a complete experimentation ecosystem rather than a single statistical method. The core idea is to embed controlled experimentation into the product development lifecycle so that every feature change, from a color tweak to a ranking algorithm overhaul, is evaluated through a randomized experiment before full deployment.
The experimentation pipeline works as follows:
Experiment design and configuration: A product team defines a hypothesis, selects the treatment variant(s), specifies the target user population (e.g., US English desktop users), and sets the traffic allocation (typically 10-20% of eligible users split evenly between control and treatment). The system assigns users to variants using a deterministic hash of their user ID, ensuring consistent experience across sessions.
Overlapping experiment framework: Rather than running each experiment on disjoint user populations (which would waste traffic), Bing uses an overlapping experiment architecture similar to Google's. Users are simultaneously enrolled in multiple experiments across different "layers" (e.g., one layer for relevance ranking, another for UI changes, another for ads). Each layer has its own randomization key, and the layers are orthogonalized so that experiments in different layers do not systematically confound each other. This allows 90% of eligible users to participate in ~15 concurrent experiments.
Data collection and instrumentation: Every user interaction is logged — queries, clicks, page views, time on site, ad impressions, revenue events, etc. The logging infrastructure must handle billions of events per day with low latency and high reliability. The paper notes that a typical two-week experiment using 20% of traffic processes about 4TB of data.
Automated analysis (scorecards): At the end of the experiment (or continuously during the experiment), the system computes the OEC and dozens of secondary metrics for each variant. The analysis uses a simple difference-in-means estimator:
[ \hat{\tau} = \bar{Y}{treatment} - \bar{Y}{control} ]
where (\bar{Y}) is the average of the metric (e.g., revenue per user) across all users in each group. The variance is estimated using the standard formula for two-sample t-tests, but with adjustments for the fact that users are clustered and metrics are often heavy-tailed.
Statistical significance testing: Each metric is tested at α=0.05 using a two-sided t-test. However, because hundreds of metrics are tested simultaneously, the paper explicitly warns about the multiple comparisons problem and recommends using false discovery rate (FDR) control or Bonferroni correction for the primary OEC metrics.
Alerting and monitoring: Because there are billions of possible site variants (5^15 ≈ 30 billion combinations), traditional testing and debugging is impossible. Instead, the system uses automated alerts that fire when:
Holdout group: 10% of users are placed in a permanent holdout group that receives no experimental treatments. This group serves as a baseline to measure the overall impact of the experimentation system itself and to detect systemic issues (e.g., if all experiments are degrading the user experience, the holdout group will show better metrics).
Key statistical insight: The paper emphasizes that at scale, the primary challenge is not statistical power (which is abundant with millions of users) but rather:
Prefer this approach over ad-hoc decision making when:
Prefer this over single, isolated A/B tests when:
Prefer single, isolated experiments over this approach when:
Prefer this over observational methods (e.g., before-after comparisons, cohort analysis) when:
Prefer observational methods over this when:
What breaks this method:
Related papers
The Design of Experiments
Ronald A. Fisher · 1935
PaperControlled experiments on the web: survey and practical guide
Ron Kohavi, Roger Longbotham, Dan Sommerfield +1 more · 2009
PaperStatistical Challenges in Online Controlled Experiments: A Review of A/B Testing Methodology
Nicholas Larsen, Alex Deng, Jiheng Zhang +2 more · 2024
PaperOn Causal Inference in the Presence of Interference
Eric J. Tchetgen Tchetgen, Tyler J. VanderWeele · 2012