← Research / Experiment Design

Online controlled experiments at large scale

Read full paper →
AuthorsRon Kohavi, Alex Deng, Brian Frasca, T. Walker, Ya Xu, Nils Pohlmann
JournalKnowledge Discovery and Data Mining
Year2013
DOI10.1145/2487575.2488217
Citations424

What Problem It Solves

This paper addresses the challenge of scaling online controlled experiments (A/B tests) from isolated, one-off studies to an organization-wide decision-making engine that can run hundreds of concurrent experiments across millions of users. The core problem is that as organizations grow and adopt agile development, they need to evaluate many product ideas simultaneously — often hundreds per quarter — but traditional experimental designs and statistical methods break down under the combinatorial complexity of overlapping experiments, massive data volumes, and the need for rapid, trustworthy decisions. Existing approaches to product evaluation (focus groups, expert reviews, observational data analysis) fail to establish causal relationships reliably, while single-experiment A/B testing frameworks cannot handle the operational demands of concurrent experimentation at web scale. The paper synthesizes lessons from Microsoft's Bing, where over 200 concurrent experiments run daily across ~100 million monthly active users, and provides a framework for building the cultural, engineering, and statistical infrastructure needed to make experimentation a core organizational capability rather than a niche analytical tool.

What problem it solves

This paper addresses the challenge of scaling online controlled experiments (A/B tests) from isolated, one-off studies to an organization-wide decision-making engine that can run hundreds of concurrent experiments across millions of users. The core problem is that as organizations grow and adopt agile development, they need to evaluate many product ideas simultaneously — often hundreds per quarter — but traditional experimental designs and statistical methods break down under the combinatorial complexity of overlapping experiments, massive data volumes, and the need for rapid, trustworthy decisions. Existing approaches to product evaluation (focus groups, expert reviews, observational data analysis) fail to establish causal relationships reliably, while single-experiment A/B testing frameworks cannot handle the operational demands of concurrent experimentation at web scale. The paper synthesizes lessons from Microsoft's Bing, where over 200 concurrent experiments run daily across ~100 million monthly active users, and provides a framework for building the cultural, engineering, and statistical infrastructure needed to make experimentation a core organizational capability rather than a niche analytical tool.

How it works

The paper describes a complete experimentation ecosystem rather than a single statistical method. The core idea is to embed controlled experimentation into the product development lifecycle so that every feature change, from a color tweak to a ranking algorithm overhaul, is evaluated through a randomized experiment before full deployment.

The experimentation pipeline works as follows:

  1. Experiment design and configuration: A product team defines a hypothesis, selects the treatment variant(s), specifies the target user population (e.g., US English desktop users), and sets the traffic allocation (typically 10-20% of eligible users split evenly between control and treatment). The system assigns users to variants using a deterministic hash of their user ID, ensuring consistent experience across sessions.

  2. Overlapping experiment framework: Rather than running each experiment on disjoint user populations (which would waste traffic), Bing uses an overlapping experiment architecture similar to Google's. Users are simultaneously enrolled in multiple experiments across different "layers" (e.g., one layer for relevance ranking, another for UI changes, another for ads). Each layer has its own randomization key, and the layers are orthogonalized so that experiments in different layers do not systematically confound each other. This allows 90% of eligible users to participate in ~15 concurrent experiments.

  3. Data collection and instrumentation: Every user interaction is logged — queries, clicks, page views, time on site, ad impressions, revenue events, etc. The logging infrastructure must handle billions of events per day with low latency and high reliability. The paper notes that a typical two-week experiment using 20% of traffic processes about 4TB of data.

  4. Automated analysis (scorecards): At the end of the experiment (or continuously during the experiment), the system computes the OEC and dozens of secondary metrics for each variant. The analysis uses a simple difference-in-means estimator:

    [ \hat{\tau} = \bar{Y}{treatment} - \bar{Y}{control} ]

    where (\bar{Y}) is the average of the metric (e.g., revenue per user) across all users in each group. The variance is estimated using the standard formula for two-sample t-tests, but with adjustments for the fact that users are clustered and metrics are often heavy-tailed.

  5. Statistical significance testing: Each metric is tested at α=0.05 using a two-sided t-test. However, because hundreds of metrics are tested simultaneously, the paper explicitly warns about the multiple comparisons problem and recommends using false discovery rate (FDR) control or Bonferroni correction for the primary OEC metrics.

  6. Alerting and monitoring: Because there are billions of possible site variants (5^15 ≈ 30 billion combinations), traditional testing and debugging is impossible. Instead, the system uses automated alerts that fire when:

    • A metric shows a statistically significant degradation (e.g., page load time increases)
    • An experiment interacts with another experiment (detected via pairwise interaction tests)
    • The overall system metrics (e.g., overall revenue) deviate from expected ranges
  7. Holdout group: 10% of users are placed in a permanent holdout group that receives no experimental treatments. This group serves as a baseline to measure the overall impact of the experimentation system itself and to detect systemic issues (e.g., if all experiments are degrading the user experience, the holdout group will show better metrics).

Key statistical insight: The paper emphasizes that at scale, the primary challenge is not statistical power (which is abundant with millions of users) but rather:

  • False positives: With hundreds of experiments running concurrently and each experiment testing dozens of metrics, the expected number of false positives is large. The paper reports that without correction, ~5% of metrics will appear significant by chance.
  • Effect size estimation: Even tiny effects (0.1-0.5% changes) can be statistically significant with large samples, but may not be practically meaningful. The paper advocates for focusing on the OEC and using confidence intervals rather than just p-values.
  • Interaction detection: With overlapping experiments, the probability that two experiments interact (i.e., the treatment effect of one depends on the other) grows quadratically with the number of experiments. The system runs pairwise interaction tests automatically.

When to use it

Prefer this approach over ad-hoc decision making when:

  • You have at least thousands of active users (the paper suggests "at least thousands" as a rough threshold)
  • You need to evaluate many product ideas simultaneously (tens to hundreds per quarter)
  • The effects you care about are small (1-5% changes in key metrics) and would be missed by qualitative methods
  • Your organization is willing to invest in the engineering infrastructure for logging, randomization, and automated analysis
  • You can define a clear OEC that captures long-term business value (not just short-term engagement)

Prefer this over single, isolated A/B tests when:

  • You have multiple teams working on different parts of the product simultaneously
  • You want to maximize the learning velocity per user (overlapping experiments use traffic more efficiently)
  • You need to detect interactions between features (e.g., a UI change that affects how users respond to a ranking change)
  • You want to build an organizational culture of experimentation rather than occasional hypothesis testing

Prefer single, isolated experiments over this approach when:

  • Your user base is small (fewer than a few thousand active users) — you cannot afford the traffic dilution from overlapping experiments
  • The effects you care about are large (10%+ changes) — you don't need the statistical precision of large-scale experiments
  • You are testing a radical change that fundamentally alters the user experience (e.g., a completely new product) — the SUTVA assumption is likely violated
  • You cannot maintain persistent user identity (e.g., anonymous browsing with frequent cookie deletion)
  • Your organization lacks the engineering resources to build and maintain the experimentation infrastructure

Prefer this over observational methods (e.g., before-after comparisons, cohort analysis) when:

  • You need causal inference (not just correlation) to make product decisions
  • There are known or unknown confounders that bias observational comparisons
  • You want to detect small effects that would be swamped by temporal trends in observational data

Prefer observational methods over this when:

  • Randomization is impossible (e.g., policy changes that affect all users simultaneously)
  • The cost of running an experiment (engineering time, user experience risk) exceeds the expected value of the information gained
  • You are studying rare events or long-term effects that require years of observation

Limitations and failure modes

What breaks this method:

  1. Network effects and interference: When users interact with each other (e.g., social networks, marketplaces), the SUTVA assumption fails. A treatment that helps one user may harm another through competition or information sharing. The paper
Read full paper →More Experiment Design

Related papers

Paper

The Design of Experiments

Ronald A. Fisher · 1935

Paper

Controlled experiments on the web: survey and practical guide

Ron Kohavi, Roger Longbotham, Dan Sommerfield +1 more · 2009

Paper

Statistical Challenges in Online Controlled Experiments: A Review of A/B Testing Methodology

Nicholas Larsen, Alex Deng, Jiheng Zhang +2 more · 2024

Paper

On Causal Inference in the Presence of Interference

Eric J. Tchetgen Tchetgen, Tyler J. VanderWeele · 2012