You're the head of experimentation at a large e-commerce platform. Your team has just launched a new recommendation algorithm, but you can't run a standard A/B test because the algorithm needs to learn from user behavior in real time—and the optimal recommendation changes as user
DoOperator Research · May 8, 2026
Decision takeaway
You're the head of experimentation at a large e-commerce platform. Your team has just launched a new recommendation algorithm, but you can't run a standard A/B test because the algorithm needs to learn from user behavior in real time—and the optimal recommendation changes as user
You're the head of experimentation at a large e-commerce platform. Your team has just launched a new recommendation algorithm, but you can't run a standard A/B test because the algorithm needs to learn from user behavior in real time—and the optimal recommendation changes as users' tastes shift throughout the day. Running a fixed treatment for two weeks means serving suboptimal recommendations to millions of users while you wait for statistical significance. Your alternative: deploy an adaptive algorithm that updates its recommendations after every user interaction, learning on the fly. But how do you guarantee this algorithm won't perform worse than a simple static baseline? And how do you measure its performance when the data itself is non-stationary?
This is the problem online learning solves—specifically, the framework of regret minimization through online convex optimization (OCO). It gives you provable guarantees that your adaptive algorithm will compete with the best fixed decision in hindsight, even when the environment is adversarial.
The core object in online learning is regret: the difference between your cumulative loss and the loss of the best fixed decision you could have made in hindsight. Formally, after rounds, if you choose decisions and incur losses , your regret is:
The goal is to design algorithms where —meaning your average performance approaches that of the best fixed decision as grows. Shalev-Shwartz (2011) shows that for convex loss functions and bounded decision sets, Online Gradient Descent (OGD) achieves , where is the diameter of the decision set, is the maximum gradient norm, and is the learning rate. Setting gives —a sublinear regret guarantee that holds for any sequence of convex loss functions, including those chosen adversarially.
This is not a probabilistic bound. It is deterministic and worst-case. If your loss functions are convex and bounded, OGD will never perform more than worse than the best fixed decision, regardless of how the environment behaves.
Suppose you're pricing a digital product. Each day , you choose a price . Users arrive and decide whether to purchase based on their willingness to pay, which varies day-to-day. Your profit is , but you don't know the demand function in advance.
You decide to use OGD with learning rate . The decision set has diameter . The gradient of your loss at each step is bounded by (since changing the price by 1 changes profit by at most 100). After days, your regret guarantee is . This means your total profit is at most $1M less than the best fixed price you could have chosen with perfect hindsight.
But here's the key: you don't need to know the demand distribution. You don't need stationarity. You don't need i.i.d. assumptions. The guarantee holds even if demand shifts adversarially every day.
Batch optimization (e.g., training a model on historical data) assumes the future will resemble the past. When the data distribution shifts—which it almost always does in practice—batch methods can fail catastrophically. Online learning adapts continuously and provides worst-case guarantees.
Multi-armed bandits (MAB) are a special case of OCO where the loss functions are linear in the decision variable. The monograph by Bubeck and Cesa-Bianchi (2012) shows that for finite action sets, algorithms like UCB achieve regret under stochastic rewards, while EXP3 achieves under adversarial rewards. OCO generalizes this to continuous decision spaces and arbitrary convex losses.
Bayesian optimization works well for expensive black-box functions but requires a prior and assumes the function is drawn from a Gaussian process. OCO makes no distributional assumptions and scales to high dimensions where Bayesian optimization becomes intractable.
Choose OCO when: (1) your loss functions are convex, (2) you need worst-case guarantees, (3) you have a continuous decision space, and (4) you can compute gradients efficiently. Choose bandits when you have a finite set of discrete actions. Choose batch methods only when you're confident the future distribution matches the past.
The most dangerous misapplication of online learning is using it on non-convex loss functions and assuming the regret guarantees still hold. They don't. Hazan (2016) proves that for non-convex losses, the regret can be linear in —meaning your algorithm performs no better than random guessing.
I've seen practitioners apply OGD to neural network training (which is non-convex) and claim "sublinear regret" based on empirical performance. This is not supported by theory. The regret bounds for OCO require convexity of every loss function . If your loss landscape has local minima, saddle points, or plateaus, the worst-case regret can be arbitrarily bad.
What to do instead: If your problem is non-convex, you need different tools. For deep learning, Adam (Kingma & Ba, 2014) provides adaptive learning rates and momentum but does not provide regret guarantees in the non-convex setting. The theoretical analysis of Adam assumes convexity for its regret bound; in practice, it works well on non-convex problems but without formal guarantees.
Another common failure: using the wrong learning rate schedule. The optimal requires knowing in advance. If you don't know , use a doubling trick—run OGD in epochs of exponentially increasing length, restarting the learning rate each epoch. Hazan (2016) shows this achieves regret without knowing in advance, at the cost of a constant factor.
Before deploying an online learning algorithm in production, verify each of these:
Is every loss function convex? Check this from domain knowledge, not empirical validation. If any loss function could be non-convex, the regret guarantees do not apply.
Is the decision set bounded? Compute the diameter . If the set is unbounded, algorithms can diverge. Project onto a bounded set after each update.
Do you have a bound on gradient norms? Estimate from domain knowledge or empirical monitoring. If gradients can be arbitrarily large, set a gradient clipping threshold.
Are you using the correct learning rate? For OGD, set if is known, or use the doubling trick. For adaptive methods like Adam, the default learning rates (e.g., 0.001) are not theoretically justified for regret minimization.
Is the feedback truly full-information? In many practical settings, you observe only the loss of your chosen action (bandit feedback), not the entire loss function. This changes the regret bound from to or worse, depending on the algorithm.
Do you need to compare against a dynamic benchmark? Standard regret compares against the best fixed decision. If the optimal decision changes over time, you need dynamic regret or adaptive regret—a different class of algorithms with different guarantees.
Part of the DoOperator Research series on Online Learning. Browse the full paper corpus at dooperator.ai/research/online_learning.