Online controlled experiments at large scale — DoOperator Research

Authors	Ron Kohavi, Alex Deng, Brian Frasca, T. Walker, Ya Xu, Nils Pohlmann
Journal	Knowledge Discovery and Data Mining
Year	2013
DOI	10.1145/2487575.2488217
Citations	424

What Problem It Solves

This paper addresses the challenge of scaling online controlled experiments (A/B tests) from isolated, one-off studies to an organization-wide decision-making engine that can run hundreds of concurrent experiments across millions of users. The core problem is that as organizations grow and adopt agile development, they need to evaluate many product ideas simultaneously — often hundreds per quarter — but traditional experimental designs and statistical methods break down under the combinatorial complexity of overlapping experiments, massive data volumes, and the need for rapid, trustworthy decisions. Existing approaches to product evaluation (focus groups, expert reviews, observational data analysis) fail to establish causal relationships reliably, while single-experiment A/B testing frameworks cannot handle the operational demands of concurrent experimentation at web scale. The paper synthesizes lessons from Microsoft's Bing, where over 200 concurrent experiments run daily across ~100 million monthly active users, and provides a framework for building the cultural, engineering, and statistical infrastructure needed to make experimentation a core organizational capability rather than a niche analytical tool.

What problem it solves

How it works

The paper describes a complete experimentation ecosystem rather than a single statistical method. The core idea is to embed controlled experimentation into the product development lifecycle so that every feature change, from a color tweak to a ranking algorithm overhaul, is evaluated through a randomized experiment before full deployment.

The experimentation pipeline works as follows:

Experiment design and configuration: A product team defines a hypothesis, selects the treatment variant(s), specifies the target user population (e.g., US English desktop users), and sets the traffic allocation (typically 10-20% of eligible users split evenly between control and treatment). The system assigns users to variants using a deterministic hash of their user ID, ensuring consistent experience across sessions.
Overlapping experiment framework: Rather than running each experiment on disjoint user populations (which would waste traffic), Bing uses an overlapping experiment architecture similar to Google's. Users are simultaneously enrolled in multiple experiments across different "layers" (e.g., one layer for relevance ranking, another for UI changes, another for ads). Each layer has its own randomization key, and the layers are orthogonalized so that experiments in different layers do not systematically confound each other. This allows 90% of eligible users to participate in ~15 concurrent experiments.
Data collection and instrumentation: Every user interaction is logged — queries, clicks, page views, time on site, ad impressions, revenue events, etc. The logging infrastructure must handle billions of events per day with low latency and high reliability. The paper notes that a typical two-week experiment using 20% of traffic processes about 4TB of data.
Automated analysis (scorecards): At the end of the experiment (or continuously during the experiment), the system computes the OEC and dozens of secondary metrics for each variant. The analysis uses a simple difference-in-means estimator:

[ \hat{\tau} = \bar{Y}{treatment} - \bar{Y}{control} ]

where (\bar{Y}) is the average of the metric (e.g., revenue per user) across all users in each group. The variance is estimated using the standard formula for two-sample t-tests, but with adjustments for the fact that users are clustered and metrics are often heavy-tailed.
Statistical significance testing: Each metric is tested at α=0.05 using a two-sided t-test. However, because hundreds of metrics are tested simultaneously, the paper explicitly warns about the multiple comparisons problem and recommends using false discovery rate (FDR) control or Bonferroni correction for the primary OEC metrics.
Alerting and monitoring: Because there are billions of possible site variants (5^15 ≈ 30 billion combinations), traditional testing and debugging is impossible. Instead, the system uses automated alerts that fire when:
- A metric shows a statistically significant degradation (e.g., page load time increases)
- An experiment interacts with another experiment (detected via pairwise interaction tests)
- The overall system metrics (e.g., overall revenue) deviate from expected ranges
Holdout group: 10% of users are placed in a permanent holdout group that receives no experimental treatments. This group serves as a baseline to measure the overall impact of the experimentation system itself and to detect systemic issues (e.g., if all experiments are degrading the user experience, the holdout group will show better metrics).

Key statistical insight: The paper emphasizes that at scale, the primary challenge is not statistical power (which is abundant with millions of users) but rather:

False positives: With hundreds of experiments running concurrently and each experiment testing dozens of metrics, the expected number of false positives is large. The paper reports that without correction, ~5% of metrics will appear significant by chance.
Effect size estimation: Even tiny effects (0.1-0.5% changes) can be statistically significant with large samples, but may not be practically meaningful. The paper advocates for focusing on the OEC and using confidence intervals rather than just p-values.
Interaction detection: With overlapping experiments, the probability that two experiments interact (i.e., the treatment effect of one depends on the other) grows quadratically with the number of experiments. The system runs pairwise interaction tests automatically.

When to use it

Prefer this approach over ad-hoc decision making when:

You have at least thousands of active users (the paper suggests "at least thousands" as a rough threshold)
You need to evaluate many product ideas simultaneously (tens to hundreds per quarter)
The effects you care about are small (1-5% changes in key metrics) and would be missed by qualitative methods
Your organization is willing to invest in the engineering infrastructure for logging, randomization, and automated analysis
You can define a clear OEC that captures long-term business value (not just short-term engagement)

Prefer this over single, isolated A/B tests when:

You have multiple teams working on different parts of the product simultaneously
You want to maximize the learning velocity per user (overlapping experiments use traffic more efficiently)
You need to detect interactions between features (e.g., a UI change that affects how users respond to a ranking change)
You want to build an organizational culture of experimentation rather than occasional hypothesis testing

Prefer single, isolated experiments over this approach when:

Your user base is small (fewer than a few thousand active users) — you cannot afford the traffic dilution from overlapping experiments
The effects you care about are large (10%+ changes) — you don't need the statistical precision of large-scale experiments
You are testing a radical change that fundamentally alters the user experience (e.g., a completely new product) — the SUTVA assumption is likely violated
You cannot maintain persistent user identity (e.g., anonymous browsing with frequent cookie deletion)
Your organization lacks the engineering resources to build and maintain the experimentation infrastructure

Prefer this over observational methods (e.g., before-after comparisons, cohort analysis) when:

You need causal inference (not just correlation) to make product decisions
There are known or unknown confounders that bias observational comparisons
You want to detect small effects that would be swamped by temporal trends in observational data

Prefer observational methods over this when:

Randomization is impossible (e.g., policy changes that affect all users simultaneously)
The cost of running an experiment (engineering time, user experience risk) exceeds the expected value of the information gained
You are studying rare events or long-term effects that require years of observation

Limitations and failure modes

What breaks this method:

Network effects and interference: When users interact with each other (e.g., social networks, marketplaces), the SUTVA assumption fails. A treatment that helps one user may harm another through competition or information sharing. The paper

Read full paper →More Experiment Design →