A Practitioner's Guide to Online Learning

You're the head of experimentation at a large e-commerce platform. Your team has just launched a new recommendation algorithm, but you can't run a standard A/B test because the algorithm needs to learn from user behavior in real time—and the optimal recommendation changes as users' tastes shift throughout the day. Running a fixed treatment for two weeks means serving suboptimal recommendations to millions of users while you wait for statistical significance. Your alternative: deploy an adaptive algorithm that updates its recommendations after every user interaction, learning on the fly. But how do you guarantee this algorithm won't perform worse than a simple static baseline? And how do you measure its performance when the data itself is non-stationary?

This is the problem online learning solves—specifically, the framework of regret minimization through online convex optimization (OCO). It gives you provable guarantees that your adaptive algorithm will compete with the best fixed decision in hindsight, even when the environment is adversarial.

What Regret Minimization Actually Guarantees

The core object in online learning is regret: the difference between your cumulative loss and the loss of the best fixed decision you could have made in hindsight. Formally, after $T$ rounds, if you choose decisions $w_1, \ldots, w_T$ and incur losses $f_1(w_1), \ldots, f_T(w_T)$ , your regret is:

$R_T = \sum_{t=1}^T f_t(w_t) - \min_{w \in \mathcal{W}} \sum_{t=1}^T f_t(w)$

The goal is to design algorithms where $R_T = o(T)$ —meaning your average performance approaches that of the best fixed decision as $T$ grows. Shalev-Shwartz (2011) shows that for convex loss functions and bounded decision sets, Online Gradient Descent (OGD) achieves $R_T \leq \frac{D^2}{2\eta} + \frac{\eta G^2 T}{2}$ , where $D$ is the diameter of the decision set, $G$ is the maximum gradient norm, and $\eta$ is the learning rate. Setting $\eta = \frac{D}{G\sqrt{T}}$ gives $R_T \leq DG\sqrt{T}$ —a sublinear regret guarantee that holds for any sequence of convex loss functions, including those chosen adversarially.

This is not a probabilistic bound. It is deterministic and worst-case. If your loss functions are convex and bounded, OGD will never perform more than $DG\sqrt{T}$ worse than the best fixed decision, regardless of how the environment behaves.

A Concrete Example: Adaptive Pricing

Suppose you're pricing a digital product. Each day $t$ , you choose a price $p_t \in [0, 100]$ . Users arrive and decide whether to purchase based on their willingness to pay, which varies day-to-day. Your profit is $f_t(p_t) = -p_t \cdot \mathbb{I}(\text{purchase})$ , but you don't know the demand function in advance.

You decide to use OGD with learning rate $\eta = 10/\sqrt{T}$ . The decision set $\mathcal{W} = [0, 100]$ has diameter $D = 100$ . The gradient of your loss at each step is bounded by $G = 100$ (since changing the price by 1 changes profit by at most 100). After $T = 10,000$ days, your regret guarantee is $R_T \leq 100 \cdot 100 \cdot \sqrt{10000} = 1,000,000$ . This means your total profit is at most $1M less than the best fixed price you could have chosen with perfect hindsight.

But here's the key: you don't need to know the demand distribution. You don't need stationarity. You don't need i.i.d. assumptions. The guarantee holds even if demand shifts adversarially every day.

When to Choose Online Learning Over Alternatives

Batch optimization (e.g., training a model on historical data) assumes the future will resemble the past. When the data distribution shifts—which it almost always does in practice—batch methods can fail catastrophically. Online learning adapts continuously and provides worst-case guarantees.

Multi-armed bandits (MAB) are a special case of OCO where the loss functions are linear in the decision variable. The monograph by Bubeck and Cesa-Bianchi (2012) shows that for finite action sets, algorithms like UCB achieve $O(\log T)$ regret under stochastic rewards, while EXP3 achieves $O(\sqrt{T})$ under adversarial rewards. OCO generalizes this to continuous decision spaces and arbitrary convex losses.

Bayesian optimization works well for expensive black-box functions but requires a prior and assumes the function is drawn from a Gaussian process. OCO makes no distributional assumptions and scales to high dimensions where Bayesian optimization becomes intractable.

Choose OCO when: (1) your loss functions are convex, (2) you need worst-case guarantees, (3) you have a continuous decision space, and (4) you can compute gradients efficiently. Choose bandits when you have a finite set of discrete actions. Choose batch methods only when you're confident the future distribution matches the past.

The Most Common Failure Mode: Ignoring the Convexity Assumption

The most dangerous misapplication of online learning is using it on non-convex loss functions and assuming the regret guarantees still hold. They don't. Hazan (2016) proves that for non-convex losses, the regret can be linear in $T$ —meaning your algorithm performs no better than random guessing.

I've seen practitioners apply OGD to neural network training (which is non-convex) and claim "sublinear regret" based on empirical performance. This is not supported by theory. The regret bounds for OCO require convexity of every loss function $f_t$ . If your loss landscape has local minima, saddle points, or plateaus, the worst-case regret can be arbitrarily bad.

What to do instead: If your problem is non-convex, you need different tools. For deep learning, Adam (Kingma & Ba, 2014) provides adaptive learning rates and momentum but does not provide regret guarantees in the non-convex setting. The theoretical analysis of Adam assumes convexity for its regret bound; in practice, it works well on non-convex problems but without formal guarantees.

Another common failure: using the wrong learning rate schedule. The optimal $\eta = D/(G\sqrt{T})$ requires knowing $T$ in advance. If you don't know $T$ , use a doubling trick—run OGD in epochs of exponentially increasing length, restarting the learning rate each epoch. Hazan (2016) shows this achieves $O(\sqrt{T})$ regret without knowing $T$ in advance, at the cost of a constant factor.

Practical Checklist

Before deploying an online learning algorithm in production, verify each of these:

Is every loss function convex? Check this from domain knowledge, not empirical validation. If any loss function could be non-convex, the regret guarantees do not apply.
Is the decision set bounded? Compute the diameter $D = \max_{w,w' \in \mathcal{W}} |w - w'|$ . If the set is unbounded, algorithms can diverge. Project onto a bounded set after each update.
Do you have a bound on gradient norms? Estimate $G = \max_{t,w} |\nabla f_t(w)|$ from domain knowledge or empirical monitoring. If gradients can be arbitrarily large, set a gradient clipping threshold.
Are you using the correct learning rate? For OGD, set $\eta = D/(G\sqrt{T})$ if $T$ is known, or use the doubling trick. For adaptive methods like Adam, the default learning rates (e.g., 0.001) are not theoretically justified for regret minimization.
Is the feedback truly full-information? In many practical settings, you observe only the loss of your chosen action (bandit feedback), not the entire loss function. This changes the regret bound from $O(\sqrt{T})$ to $O(T^{2/3})$ or worse, depending on the algorithm.
Do you need to compare against a dynamic benchmark? Standard regret compares against the best fixed decision. If the optimal decision changes over time, you need dynamic regret or adaptive regret—a different class of algorithms with different guarantees.

Part of the DoOperator Research series on Online Learning. Browse the full paper corpus at dooperator.ai/research/online_learning.