← Blog
synthesisSequential Decisions

Bandits and Sequential Decisions: When to Explore and When to Exploit

DoOperator Research · May 18, 2026

Bandits and Sequential Decisions: When to Explore and When to Exploit

Every time you scroll through a news feed, click a recommended video, or receive a targeted ad, you're participating in a sequential decision problem. The same math governs clinical trial designs that allocate patients to treatment arms, A/B testing platforms that optimize conversion rates, and reinforcement learning agents that learn to play Atari games. At the heart of all these systems lies a single, deceptively simple question: should I try something new, or stick with what I know works?

This is the exploration-exploitation tradeoff, and the multi-armed bandit framework is the cleanest mathematical abstraction of it.

The Multi-Armed Bandit Problem

Imagine standing in front of a row of slot machines (one-armed bandits). Each machine has an unknown probability of paying out. You have a fixed number of pulls. Do you pull the machine that's paid out most often so far (exploit), or do you try a new machine that might be even better (explore)? The optimal strategy balances gathering information about uncertain arms against maximizing immediate reward.

Formally, at each time step tt, you choose an arm ata_t from KK options, observe reward rtPar_t \sim P_a, and update your beliefs. Your goal is to minimize regret — the difference between the cumulative reward you earned and the reward you would have earned by always pulling the best arm. The algorithms below represent the major families of solutions, each with distinct assumptions and guarantees.

1. ε-Greedy: The Simple Baseline

The simplest approach: with probability ϵ\epsilon, explore uniformly at random; otherwise, exploit the current best arm. It's trivial to implement, requires no distributional assumptions, and works reasonably well when the horizon is short.

When it works best: As a baseline for comparison, or when you need a quick, interpretable solution and can tolerate linear regret. The fixed exploration budget wastes pulls on clearly suboptimal arms — you'll never stop exploring entirely, which means regret grows linearly with time. For most production systems, ε-greedy is a starting point, not a final solution.

2. UCB (Upper Confidence Bound): Optimism in the Face of Uncertainty

UCB takes a radically different approach: "Be optimistic about the unknown." For each arm, UCB constructs an upper confidence bound around its estimated mean reward. At each step, pull the arm with the highest upper bound. This naturally balances exploration and exploitation — uncertain arms get pulled because their confidence intervals are wide, but as data accumulates, their bounds tighten.

The canonical UCB1 algorithm achieves minimax-optimal regret — no algorithm can guarantee better worst-case performance. In practice, UCB works exceptionally well when rewards are stationary and you have a finite, known number of arms.

When it works best: Recommendation systems with stable user preferences, A/B testing with fixed treatment sets, and any setting where you need strong theoretical guarantees and can assume i.i.d. rewards.

3. Thompson Sampling: Bayesian Posterior Sampling

Thompson sampling takes a Bayesian perspective. Maintain a posterior distribution over each arm's expected reward. At each step, sample a value from each posterior, then pull the arm with the highest sampled value. This injects uncertainty naturally — arms with high posterior variance are more likely to produce extreme samples and get pulled.

Empirically, Thompson sampling often outperforms UCB in practice, especially on non-stationary rewards where the best arm changes over time. By incorporating a forgetting factor (e.g., Bayesian updating with exponential decay), Thompson sampling can adapt to drift without explicit change-point detection.

When it works best: News recommendation (user interests shift), dynamic pricing, and any setting with temporal non-stationarity. It's also the go-to method when you have prior domain knowledge you want to encode.

4. EXP3: Adversarial Bandits

All the above algorithms assume rewards are drawn from fixed (stochastic) distributions. EXP3 (Exponential-weight algorithm for Exploration and Exploitation) makes no such assumption. It treats the environment as an adversary that can choose rewards arbitrarily at each time step — even maliciously.

EXP3 maintains a probability distribution over arms, updated via exponential weighting based on observed rewards. It achieves sublinear regret even in worst-case adversarial settings, though the constants are larger than stochastic algorithms.

When it works best: Security applications where an adversary might manipulate rewards, competitive bidding environments, or any setting where you cannot trust the stationarity assumption. If you suspect deliberate non-stationarity, EXP3 is your safest bet.

Contextual Bandits: When You Have Covariates

The algorithms above treat arms as independent. In reality, we often have context — user demographics, time of day, device type — that influences which arm is best. Contextual bandits use features xtx_t to personalize decisions.

LinUCB assumes a linear reward function: E[rx,a]=xTθa\mathbb{E}[r | x, a] = x^T \theta_a. It maintains a ridge regression estimate for each arm and constructs confidence intervals on the linear predictions. It's computationally efficient and works well when the linear assumption holds.

Neural contextual bandits replace the linear model with a deep neural network, often using last-layer features for UCB-style exploration (e.g., NeuralUCB). They excel with high-dimensional, non-linear reward structures (images, text, complex user embeddings) but require careful tuning and more data.

Real-World Complications

Bandit theory is beautiful; practice is messy. Here are the common complications:

  • Delayed feedback: In ad systems, a conversion may occur hours after a click. Standard algorithms assume immediate rewards; delays require modified update rules (e.g., importance-weighted estimators).
  • Non-stationarity: User preferences drift, seasonal effects dominate. Solutions include sliding windows, Bayesian forgetting, or change-point detection.
  • Batch learning: Many systems cannot update per-user in real-time. Batch bandits (e.g., Replay, Off-Policy Evaluation) require unbiased estimators from logged data.
  • Multiple objectives: Optimizing for both click-through rate and revenue simultaneously demands Pareto-optimal exploration or scalarization.

Decision Guide: Which Algorithm for Which Problem

Problem Setting Recommended Algorithm Why
Simple baseline, short horizon ε-greedy Fast, interpretable
Stationary rewards, strong guarantees UCB Minimax-optimal, no tuning
Non-stationary rewards, empirical performance Thompson sampling Adapts naturally, best in practice
Adversarial environment EXP3 No stochastic assumptions
Low-dimensional context, linear rewards LinUCB Efficient, theoretically grounded
High-dimensional context, non-linear rewards Neural contextual bandits Flexible, powerful
Batch feedback, offline evaluation Off-policy methods (e.g., IPS) Unbiased estimation
Multiple objectives Scalarized bandits or Pareto bandits Trade-off control

The Bottom Line

The exploration-exploitation tradeoff isn't just a mathematical curiosity — it's the fundamental tension in any sequential decision system. Start simple with ε-greedy to establish a baseline. Move to UCB or Thompson sampling when you need efficiency and guarantees. Reach for EXP3 when you suspect adversarial dynamics. And always consider context — the features that make each decision unique.

The beauty of the bandit framework is that it forces you to make the tradeoff explicit. Once you do, the algorithms do the rest. The next time you see a recommendation, ask yourself: is this explore or exploit? The math is the same, whether it's slot machines or clinical trials.

More from the blog

Correlation Was Never the Problem"Correlation is not causation" is one of the most-repeated phrases in empirical research. It is also, as usually understood, a dramatic understatement of the actual difficulty. The real challenge is not distinguishing correlation from causation — it is identifying which causal story is correct when several are consistent with the same data.May 29, 2026The Illusion of Control: Why Most A/B Tests Mislead More Than They InformOrganizations run thousands of A/B tests every year and congratulate themselves on being data-driven. Most of those tests are statistically invalid. Here is why — and what rigorous experimentation actually requires.May 27, 2026What N-of-1 Trials Get Right That Population Studies Get WrongRandomized trials on populations measure average effects in heterogeneous groups. N-of-1 trials measure what actually happens to one specific person. For individual decision-making, the latter is usually more relevant.May 26, 2026
Bandits and Sequential Decisions: When to Explore and When to Exploit — DoOperator Research | DoOperator