Bandits and Sequential Decisions: When to Explore and When to Exploit

Every time you scroll through a news feed, click a recommended video, or receive a targeted ad, you're participating in a sequential decision problem. The same math governs clinical trial designs that allocate patients to treatment arms, A/B testing platforms that optimize conversion rates, and reinforcement learning agents that learn to play Atari games. At the heart of all these systems lies a single, deceptively simple question: should I try something new, or stick with what I know works?

This is the exploration-exploitation tradeoff, and the multi-armed bandit framework is the cleanest mathematical abstraction of it.

The Multi-Armed Bandit Problem

Imagine standing in front of a row of slot machines (one-armed bandits). Each machine has an unknown probability of paying out. You have a fixed number of pulls. Do you pull the machine that's paid out most often so far (exploit), or do you try a new machine that might be even better (explore)? The optimal strategy balances gathering information about uncertain arms against maximizing immediate reward.

Formally, at each time step $t$ , you choose an arm $a_t$ from $K$ options, observe reward $r_t \sim P_a$ , and update your beliefs. Your goal is to minimize regret — the difference between the cumulative reward you earned and the reward you would have earned by always pulling the best arm. The algorithms below represent the major families of solutions, each with distinct assumptions and guarantees.

1. ε-Greedy: The Simple Baseline

The simplest approach: with probability $\epsilon$ , explore uniformly at random; otherwise, exploit the current best arm. It's trivial to implement, requires no distributional assumptions, and works reasonably well when the horizon is short.

When it works best: As a baseline for comparison, or when you need a quick, interpretable solution and can tolerate linear regret. The fixed exploration budget wastes pulls on clearly suboptimal arms — you'll never stop exploring entirely, which means regret grows linearly with time. For most production systems, ε-greedy is a starting point, not a final solution.

2. UCB (Upper Confidence Bound): Optimism in the Face of Uncertainty

UCB takes a radically different approach: "Be optimistic about the unknown." For each arm, UCB constructs an upper confidence bound around its estimated mean reward. At each step, pull the arm with the highest upper bound. This naturally balances exploration and exploitation — uncertain arms get pulled because their confidence intervals are wide, but as data accumulates, their bounds tighten.

The canonical UCB1 algorithm achieves minimax-optimal regret — no algorithm can guarantee better worst-case performance. In practice, UCB works exceptionally well when rewards are stationary and you have a finite, known number of arms.

When it works best: Recommendation systems with stable user preferences, A/B testing with fixed treatment sets, and any setting where you need strong theoretical guarantees and can assume i.i.d. rewards.

3. Thompson Sampling: Bayesian Posterior Sampling

Thompson sampling takes a Bayesian perspective. Maintain a posterior distribution over each arm's expected reward. At each step, sample a value from each posterior, then pull the arm with the highest sampled value. This injects uncertainty naturally — arms with high posterior variance are more likely to produce extreme samples and get pulled.

Empirically, Thompson sampling often outperforms UCB in practice, especially on non-stationary rewards where the best arm changes over time. By incorporating a forgetting factor (e.g., Bayesian updating with exponential decay), Thompson sampling can adapt to drift without explicit change-point detection.

When it works best: News recommendation (user interests shift), dynamic pricing, and any setting with temporal non-stationarity. It's also the go-to method when you have prior domain knowledge you want to encode.

4. EXP3: Adversarial Bandits

All the above algorithms assume rewards are drawn from fixed (stochastic) distributions. EXP3 (Exponential-weight algorithm for Exploration and Exploitation) makes no such assumption. It treats the environment as an adversary that can choose rewards arbitrarily at each time step — even maliciously.

EXP3 maintains a probability distribution over arms, updated via exponential weighting based on observed rewards. It achieves sublinear regret even in worst-case adversarial settings, though the constants are larger than stochastic algorithms.

When it works best: Security applications where an adversary might manipulate rewards, competitive bidding environments, or any setting where you cannot trust the stationarity assumption. If you suspect deliberate non-stationarity, EXP3 is your safest bet.

Contextual Bandits: When You Have Covariates

The algorithms above treat arms as independent. In reality, we often have context — user demographics, time of day, device type — that influences which arm is best. Contextual bandits use features $x_t$ to personalize decisions.

LinUCB assumes a linear reward function: $\mathbb{E}[r | x, a] = x^T \theta_a$ . It maintains a ridge regression estimate for each arm and constructs confidence intervals on the linear predictions. It's computationally efficient and works well when the linear assumption holds.

Neural contextual bandits replace the linear model with a deep neural network, often using last-layer features for UCB-style exploration (e.g., NeuralUCB). They excel with high-dimensional, non-linear reward structures (images, text, complex user embeddings) but require careful tuning and more data.

Real-World Complications

Bandit theory is beautiful; practice is messy. Here are the common complications:

Delayed feedback: In ad systems, a conversion may occur hours after a click. Standard algorithms assume immediate rewards; delays require modified update rules (e.g., importance-weighted estimators).
Non-stationarity: User preferences drift, seasonal effects dominate. Solutions include sliding windows, Bayesian forgetting, or change-point detection.
Batch learning: Many systems cannot update per-user in real-time. Batch bandits (e.g., Replay, Off-Policy Evaluation) require unbiased estimators from logged data.
Multiple objectives: Optimizing for both click-through rate and revenue simultaneously demands Pareto-optimal exploration or scalarization.

Decision Guide: Which Algorithm for Which Problem

Problem Setting	Recommended Algorithm	Why
Simple baseline, short horizon	ε-greedy	Fast, interpretable
Stationary rewards, strong guarantees	UCB	Minimax-optimal, no tuning
Non-stationary rewards, empirical performance	Thompson sampling	Adapts naturally, best in practice
Adversarial environment	EXP3	No stochastic assumptions
Low-dimensional context, linear rewards	LinUCB	Efficient, theoretically grounded
High-dimensional context, non-linear rewards	Neural contextual bandits	Flexible, powerful
Batch feedback, offline evaluation	Off-policy methods (e.g., IPS)	Unbiased estimation
Multiple objectives	Scalarized bandits or Pareto bandits	Trade-off control

The Bottom Line

The exploration-exploitation tradeoff isn't just a mathematical curiosity — it's the fundamental tension in any sequential decision system. Start simple with ε-greedy to establish a baseline. Move to UCB or Thompson sampling when you need efficiency and guarantees. Reach for EXP3 when you suspect adversarial dynamics. And always consider context — the features that make each decision unique.

The beauty of the bandit framework is that it forces you to make the tradeoff explicit. Once you do, the algorithms do the rest. The next time you see a recommendation, ask yourself: is this explore or exploit? The math is the same, whether it's slot machines or clinical trials.