DoOperator Research · May 18, 2026
Every time you scroll through a news feed, click a recommended video, or receive a targeted ad, you're participating in a sequential decision problem. The same math governs clinical trial designs that allocate patients to treatment arms, A/B testing platforms that optimize conversion rates, and reinforcement learning agents that learn to play Atari games. At the heart of all these systems lies a single, deceptively simple question: should I try something new, or stick with what I know works?
This is the exploration-exploitation tradeoff, and the multi-armed bandit framework is the cleanest mathematical abstraction of it.
Imagine standing in front of a row of slot machines (one-armed bandits). Each machine has an unknown probability of paying out. You have a fixed number of pulls. Do you pull the machine that's paid out most often so far (exploit), or do you try a new machine that might be even better (explore)? The optimal strategy balances gathering information about uncertain arms against maximizing immediate reward.
Formally, at each time step , you choose an arm from options, observe reward , and update your beliefs. Your goal is to minimize regret — the difference between the cumulative reward you earned and the reward you would have earned by always pulling the best arm. The algorithms below represent the major families of solutions, each with distinct assumptions and guarantees.
The simplest approach: with probability , explore uniformly at random; otherwise, exploit the current best arm. It's trivial to implement, requires no distributional assumptions, and works reasonably well when the horizon is short.
When it works best: As a baseline for comparison, or when you need a quick, interpretable solution and can tolerate linear regret. The fixed exploration budget wastes pulls on clearly suboptimal arms — you'll never stop exploring entirely, which means regret grows linearly with time. For most production systems, ε-greedy is a starting point, not a final solution.
UCB takes a radically different approach: "Be optimistic about the unknown." For each arm, UCB constructs an upper confidence bound around its estimated mean reward. At each step, pull the arm with the highest upper bound. This naturally balances exploration and exploitation — uncertain arms get pulled because their confidence intervals are wide, but as data accumulates, their bounds tighten.
The canonical UCB1 algorithm achieves minimax-optimal regret — no algorithm can guarantee better worst-case performance. In practice, UCB works exceptionally well when rewards are stationary and you have a finite, known number of arms.
When it works best: Recommendation systems with stable user preferences, A/B testing with fixed treatment sets, and any setting where you need strong theoretical guarantees and can assume i.i.d. rewards.
Thompson sampling takes a Bayesian perspective. Maintain a posterior distribution over each arm's expected reward. At each step, sample a value from each posterior, then pull the arm with the highest sampled value. This injects uncertainty naturally — arms with high posterior variance are more likely to produce extreme samples and get pulled.
Empirically, Thompson sampling often outperforms UCB in practice, especially on non-stationary rewards where the best arm changes over time. By incorporating a forgetting factor (e.g., Bayesian updating with exponential decay), Thompson sampling can adapt to drift without explicit change-point detection.
When it works best: News recommendation (user interests shift), dynamic pricing, and any setting with temporal non-stationarity. It's also the go-to method when you have prior domain knowledge you want to encode.
All the above algorithms assume rewards are drawn from fixed (stochastic) distributions. EXP3 (Exponential-weight algorithm for Exploration and Exploitation) makes no such assumption. It treats the environment as an adversary that can choose rewards arbitrarily at each time step — even maliciously.
EXP3 maintains a probability distribution over arms, updated via exponential weighting based on observed rewards. It achieves sublinear regret even in worst-case adversarial settings, though the constants are larger than stochastic algorithms.
When it works best: Security applications where an adversary might manipulate rewards, competitive bidding environments, or any setting where you cannot trust the stationarity assumption. If you suspect deliberate non-stationarity, EXP3 is your safest bet.
The algorithms above treat arms as independent. In reality, we often have context — user demographics, time of day, device type — that influences which arm is best. Contextual bandits use features to personalize decisions.
LinUCB assumes a linear reward function: . It maintains a ridge regression estimate for each arm and constructs confidence intervals on the linear predictions. It's computationally efficient and works well when the linear assumption holds.
Neural contextual bandits replace the linear model with a deep neural network, often using last-layer features for UCB-style exploration (e.g., NeuralUCB). They excel with high-dimensional, non-linear reward structures (images, text, complex user embeddings) but require careful tuning and more data.
Bandit theory is beautiful; practice is messy. Here are the common complications:
| Problem Setting | Recommended Algorithm | Why |
|---|---|---|
| Simple baseline, short horizon | ε-greedy | Fast, interpretable |
| Stationary rewards, strong guarantees | UCB | Minimax-optimal, no tuning |
| Non-stationary rewards, empirical performance | Thompson sampling | Adapts naturally, best in practice |
| Adversarial environment | EXP3 | No stochastic assumptions |
| Low-dimensional context, linear rewards | LinUCB | Efficient, theoretically grounded |
| High-dimensional context, non-linear rewards | Neural contextual bandits | Flexible, powerful |
| Batch feedback, offline evaluation | Off-policy methods (e.g., IPS) | Unbiased estimation |
| Multiple objectives | Scalarized bandits or Pareto bandits | Trade-off control |
The exploration-exploitation tradeoff isn't just a mathematical curiosity — it's the fundamental tension in any sequential decision system. Start simple with ε-greedy to establish a baseline. Move to UCB or Thompson sampling when you need efficiency and guarantees. Reach for EXP3 when you suspect adversarial dynamics. And always consider context — the features that make each decision unique.
The beauty of the bandit framework is that it forces you to make the tradeoff explicit. Once you do, the algorithms do the rest. The next time you see a recommendation, ask yourself: is this explore or exploit? The math is the same, whether it's slot machines or clinical trials.