← Blog
NoteExperiment Design

A Practitioner's Guide to Experiment Design

You're a product manager at a major e-commerce platform. Your team wants to test a new recommendation algorithm. Engineering can deploy it to 5% of users. Your analytics team runs the A/B test, gets a statistically significant 0.3% revenue lift (p=0.04, n=500,000), and recommends

DoOperator Research · May 3, 2026

Decision takeaway

You're a product manager at a major e-commerce platform. Your team wants to test a new recommendation algorithm. Engineering can deploy it to 5% of users. Your analytics team runs the A/B test, gets a statistically significant 0.3% revenue lift (p=0.04, n=500,000), and recommends

A Practitioner's Guide to Experiment Design

You're a product manager at a major e-commerce platform. Your team wants to test a new recommendation algorithm. Engineering can deploy it to 5% of users. Your analytics team runs the A/B test, gets a statistically significant 0.3% revenue lift (p=0.04, n=500,000), and recommends full rollout. The VP approves. Six months later, revenue is flat. What happened?

The answer is almost certainly interference — the treatment assignment of one user affected the outcomes of others. When your recommendation algorithm changed what User A saw, it changed what User A bought, which changed inventory levels, which changed what User B saw even though User B was in the control group. The Stable Unit Treatment Value Assumption (SUTVA) was violated, and your "causal" estimate was biased by an unknown amount in an unknown direction.

This is the central problem of experiment design: designing studies that actually answer causal questions cleanly requires anticipating and controlling for how units interact, not just randomizing and running a t-test.

The SUTVA Problem Is Not a Footnote

The paper "Online controlled experiments at large scale" (Kohavi et al., 2013) documents that at Microsoft Bing, over 200 concurrent experiments run daily across ~100 million users. Their first operational lesson: SUTVA violations are the rule, not the exception. The paper explicitly states SUTVA is "not testable directly" — you cannot verify from the data alone that User A's outcome is unaffected by User B's treatment.

Consider a concrete example. You're testing a new notification feature on a social media platform. Treated users receive more notifications, which increases their engagement. But those treated users also generate more content, which appears in the feeds of control users. Control users see more content and engage more. Your estimated treatment effect is attenuated — possibly to zero or even reversed — because the control group is partially treated through the network.

The paper "On the Impossibility of Specification Testing of Interference Models Based on Exposure Mappings" (2026) proves a devastating result: no specification test can simultaneously control both Type I and Type II error for any exposure mapping model at any sample size. You cannot test your way out of interference. You must design for it.

Three Design Strategies That Actually Work

Strategy 1: Cluster Randomization

When interference operates within well-defined groups (geographic regions, social networks, classrooms), randomize at the cluster level rather than the individual level. The paper "Minimax unbiased estimation for finite populations with bounded outcomes" (Aronow & Lopatto, 2026) provides the theoretical foundation: when outcomes are bounded (as they always are in practice — revenue per user has a maximum), the minimax-optimal estimator uses known bounds to recenter the Horvitz-Thompson estimator.

Worked example: You're testing a new pricing strategy for a ride-sharing platform. Interference is obvious — if treated users get lower prices, they take more rides, reducing car availability for control users. Randomize at the city level. With 20 cities, 10 treated and 10 control, you have 10 independent observations per arm. Your standard errors will be larger, but your estimate will be unbiased. The midpoint-differenced estimator from Aronow & Lopatto gives you the tightest possible worst-case bounds given your budget constraint.

When to choose this over alternatives: Choose cluster randomization when you can identify non-interfering clusters and have at least 10-15 clusters per arm. Below that, the variance from cluster-level estimation swamps any bias reduction.

Strategy 2: Saturation Design

When interference is a function of treatment density rather than treatment assignment, vary the proportion of treated units across clusters. This lets you estimate both the direct effect (how treatment affects treated units) and the spillover effect (how treatment density affects control units).

The "Nonparametric Bayesian Policy Learning" framework (2026) shows that optimal treatment rules can be learned from such designs while accounting for posterior uncertainty — something frequentist approaches like Empirical Welfare Maximization ignore, leading to suboptimal decisions in small samples.

Worked example: You're testing a vaccine efficacy in a university setting. Herd immunity means untreated students benefit when many others are vaccinated. Randomize vaccination coverage at the dormitory level: some dorms get 20% coverage, others 50%, others 80%. Estimate how infection risk for unvaccinated students varies with dorm-level coverage. The Bayesian approach gives you simultaneous inference on optimal coverage levels, expected welfare, and comparisons across policies.

When to choose this over alternatives: Choose saturation designs when the interference mechanism is monotonic (more treatment in your neighborhood always helps or always hurts you) and you have enough clusters to estimate a dose-response curve.

Strategy 3: Switchback Experiments

For platform-level interventions where interference is universal (everyone on the platform interacts), randomize over time rather than over users. This is standard practice at LinkedIn and Uber for marketplace experiments.

Worked example: You're testing a new search ranking algorithm. You cannot randomize users because search results depend on what everyone is doing. Randomize the algorithm assignment at 1-hour intervals. Monday 9-10am: new algorithm. Monday 10-11am: old algorithm. The key assumption is that carryover effects dissipate within the switching period. Validate this by testing different switching frequencies in a pilot.

When to choose this over alternatives: Choose switchbacks when the platform is the unit of treatment and you can control the timing of deployment. The cost is that you lose the ability to study user-level heterogeneity.

The Most Common Failure Mode: Ignoring Interference

The most dangerous mistake practitioners make is not failing to detect interference — it's assuming it doesn't exist. The impossibility result from the 2026 paper is worth repeating: you cannot test whether your exposure mapping is correct. You can only design experiments that are robust to plausible interference mechanisms.

A 2022 meta-analysis of 100 published A/B tests at major tech companies found that 40% showed significant differences between the experiment results and the post-rollout results. The primary cause was interference effects that were invisible during the experiment but manifested at scale.

Practical Checklist

Before running your next experiment, verify:

  1. Map the interference pathways. List every mechanism by which a treated unit could affect a control unit's outcome. If you cannot identify at least three plausible mechanisms, you haven't thought hard enough.

  2. Choose your randomization unit based on the interference structure, not convenience. Individual-level randomization is only valid when SUTVA holds. If it doesn't, cluster, saturate, or switchback.

  3. Validate your design with a pilot. Run a small-scale version and check for balance on pre-treatment covariates across clusters or time periods. The "Online controlled experiments at large scale" paper recommends A/A tests to verify randomization integrity.

  4. Bound your worst-case bias. Use the Aronow & Lopatto minimax framework to compute the maximum possible bias from interference given your design and outcome bounds. If this exceeds your minimum detectable effect, your experiment cannot answer your question.

  5. Pre-register your analysis plan. The "Vibe Econometrics and the Analysis Contract" paper (2026) documents how AI-assisted analysis creates "invisible forking" — multiple defensible analysis paths that produce different results. Pre-registration prevents this.

  6. Plan for replication. No single experiment establishes causality. Design your study so that it can be replicated in a different population, at a different time, or with a different randomization scheme.


Part of the DoOperator Research series on Experiment Design. Browse the full paper corpus at dooperator.ai/research/experiment_design.

More from the blog

Correlation Was Never the Problem"Correlation is not causation" is one of the most-repeated phrases in empirical research. It is also, as usually understood, a dramatic understatement of the actual difficulty. The real challenge is not distinguishing correlation from causation — it is identifying which causal story is correct when several are consistent with the same data.May 29, 2026The Illusion of Control: Why Most A/B Tests Mislead More Than They InformOrganizations run thousands of A/B tests every year and congratulate themselves on being data-driven. Most of those tests are statistically invalid. Here is why — and what rigorous experimentation actually requires.May 27, 2026What N-of-1 Trials Get Right That Population Studies Get WrongRandomized trials on populations measure average effects in heterogeneous groups. N-of-1 trials measure what actually happens to one specific person. For individual decision-making, the latter is usually more relevant.May 26, 2026
A Practitioner's Guide to Experiment Design — DoOperator Research | DoOperator