A Practitioner's Guide to Sequential Decision-Making

Your team has been running A/B tests for six months on a recommendation system. Each week, you launch a new feature variant, monitor the p-value dashboard, and stop when it crosses 0.05. Your boss wants to know: how many of your "significant" results are real? The answer, from Johari et al. (2017), is that under continuous monitoring with traditional p-values, your Type I error rate has likely been 20–30%, not 5%. You've been making decisions under uncertainty without accounting for the fact that you were making decisions under uncertainty.

This is the core challenge of sequential decision-making: when your data arrives over time and you act on it as it arrives, the statistical guarantees you learned in graduate school break. This post covers the three frameworks you need—Thompson sampling, information-directed sampling, and always-valid inference—and when to use each.

The Three Pillars of Sequential Decisions

Sequential decision problems fall into three broad classes, each with different mathematical structure and different failure modes:

Bandit problems: You choose actions, observe rewards, and want to maximize cumulative reward. Exploration and exploitation are in tension.
A/B testing with optional stopping: You compare two treatments and want valid inference regardless of when you stop.
Structured bandits: Actions share information (e.g., through features or graph structure), and you must exploit that structure.

The mistake practitioners make is treating these as interchangeable. They are not.

Thompson Sampling: When You Need to Explore

Thompson sampling (Russo et al., 2018) solves the exploration-exploitation tradeoff by maintaining a posterior distribution over reward parameters and sampling actions according to the probability they are optimal. The key theoretical guarantee: for Bernoulli bandits, Thompson sampling achieves expected regret bounded by $O(\sqrt{KT\log T})$ where $K$ is the number of arms, matching the minimax lower bound up to logarithmic factors.

Worked example: You're optimizing click-through rates for 10 ad creatives. You model each as Beta(1,1) (uniform prior). After 1000 impressions per creative, your posteriors are Beta(45, 955) for creative A and Beta(38, 962) for creative B. Thompson sampling draws one sample from each posterior—say 0.047 for A and 0.039 for B—and serves the creative with the highest sample. Over 10,000 rounds, this procedure automatically allocates more traffic to promising creatives while still exploring alternatives.

When to choose Thompson sampling over UCB: Thompson sampling naturally handles correlated arms (e.g., through hierarchical priors) and generalizes to complex reward structures. UCB requires problem-specific confidence bounds that become computationally prohibitive for structured problems (Russo et al., 2018, Section 4.3). Choose Thompson sampling when you have prior knowledge you can encode as a Bayesian model, or when the action space has structure (e.g., linear rewards, combinatorial actions).

The most common failure mode: Using a misspecified prior. If your prior is too concentrated (e.g., Beta(100, 900) when true rates vary from 1% to 50%), Thompson sampling will underexplore and converge to a suboptimal arm. Russo et al. (2018, Section 5.2) show that regret bounds require the prior to be proper and the model to be correctly specified. Always run prior predictive checks and sensitivity analyses.

Information-Directed Sampling: When Actions Share Structure

Thompson sampling treats each action's information value implicitly—it explores because the posterior is uncertain. Information-directed sampling (Russo & Van Roy, 2014) makes this explicit by quantifying the information gain from each action about the identity of the optimal action.

The key result: IDS achieves regret bounds that depend on the information ratio $\Gamma_t$ , which measures the tradeoff between regret incurred and information gained. For linear bandits, IDS achieves regret $O(d\sqrt{T})$ where $d$ is the feature dimension, matching the optimal rate (Russo & Van Roy, 2014, Theorem 3). Critically, when actions are correlated—as in linear bandits or combinatorial problems—IDS can dramatically outperform Thompson sampling because it recognizes that sampling one action teaches you about many others.

Worked example: You're optimizing a news recommendation system with 10,000 articles, each described by a 50-dimensional feature vector (topic, recency, author popularity). Thompson sampling would maintain a posterior over 50 parameters and sample from it—but it doesn't explicitly account for how sampling article A informs beliefs about article B (which shares similar features). IDS computes, for each article, the expected regret of recommending it versus the best alternative, divided by the information it provides about the optimal article. This ratio naturally favors articles that resolve uncertainty across many similar articles simultaneously.

When to choose IDS over Thompson sampling: When your action space has exploitable structure—linear rewards, graph-structured arms, or combinatorial actions. Russo & Van Roy (2014, Section 5) show IDS achieves state-of-the-art performance on linear bandit benchmarks where Thompson sampling underperforms due to inefficient exploration of the feature space.

The most common failure mode: Computing the information ratio requires solving an optimization problem at each step. For large action spaces, this becomes computationally intractable. The paper provides approximations (Section 4.2), but these can degrade performance. Always benchmark approximation quality against exact computation on a small subset of actions before scaling.

Always-Valid Inference: When You Need Valid p-values

The previous two methods optimize cumulative reward. But sometimes you need valid inference—a p-value or confidence interval that you can trust regardless of when you stop the experiment. This is the problem Johari et al. (2017) solve with the mixture sequential probability ratio test (mSPRT).

The guarantee: the mSPRT produces an "always valid" p-value $p_t$ such that $\mathbb{P}(p_t \leq \alpha) \leq \alpha$ for any stopping time $\tau$ that depends on the data. This holds even if you peek every hour and stop the moment $p_t < 0.05$ . The cost is a modest increase in expected sample size: for a 1% treatment effect at 80% power, the mSPRT requires about 15% more samples on average than a fixed-horizon test (Johari et al., 2017, Section 5.2).

Worked example: You're testing a new checkout flow against the current one. You plan to run for two weeks but promise stakeholders you'll stop early if results are clear. With traditional methods, stopping at day 3 because $p = 0.04$ inflates your Type I error to 25%. With the mSPRT, you compute the mixture likelihood ratio using a normal prior on the treatment effect (mean 0, variance matching the expected effect size). You stop when the log-likelihood ratio exceeds a threshold calibrated to control Type I error at 5%. The threshold is higher than the traditional $p = 0.05$ cutoff—you'll need stronger evidence—but your inference remains valid.

When to choose always-valid inference over bandit methods: When the primary goal is hypothesis testing (not reward maximization) and stakeholders will monitor results in real-time. Bandit methods optimize for cumulative reward but don't provide valid p-values for the final comparison. If your boss asks "is the treatment significantly better?" after the experiment, you need always-valid inference.

The most common failure mode: Choosing the wrong mixing distribution. The mSPRT requires specifying a prior over effect sizes. Johari et al. (2017, Section 4.2) show that a normal prior with variance matching the expected effect size is near-optimal, but if the true effect is much larger than expected, the test loses power. Always run power simulations under plausible effect sizes before deploying.

Practical Checklist

Before using any sequential decision-making method in a real study, verify:

Is the stopping rule specified in advance? If you plan to stop based on data-dependent criteria (peeking, early stopping for efficacy or futility), you must use always-valid inference. Standard p-values will not protect you.
Is the reward structure stationary? Thompson sampling and IDS assume fixed reward distributions. If user behavior drifts over time (e.g., seasonal effects, changing content), these methods can fail catastrophically. Use change-point detection on held-out data before deploying.
Is the action space structured? If actions share features or graph structure, IDS will outperform Thompson sampling. If actions are independent (e.g., 10 unrelated ad creatives), Thompson sampling is simpler and nearly optimal.
Have you validated the prior? For Bayesian methods, run prior predictive checks: simulate data from the prior and verify the implied reward distributions are plausible. A prior that puts 99% mass on unrealistic effect sizes will ruin exploration.
What is the computational budget? Thompson sampling requires one posterior draw per action per round. IDS requires solving an optimization problem. Always-valid inference requires computing a likelihood ratio. If you have millions of actions per second, Thompson sampling with conjugate priors is the only feasible option.
Are you optimizing for cumulative reward or final inference? Bandit methods (Thompson sampling, IDS) optimize cumulative reward but don't provide valid final p-values. Always-valid inference provides valid p-values but doesn't optimize reward. If you need both, run a two-stage design: explore with Thompson sampling, then run a confirmatory test with always-valid inference on a holdout set.

Part of the DoOperator Research series on Sequential Decision-Making. Browse the full paper corpus at dooperator.ai/research/sequential_decisions.