Online Learning: Prediction and Decision-Making When the World Changes

The Stationarity Delusion

Most machine learning textbooks begin with a silent assumption: the world is stationary. Training data and test data are drawn from the same distribution. The model you train today will perform identically tomorrow. This assumption is so deeply embedded in our tools that we rarely question it—until the world changes.

Batch machine learning treats the training set as a fixed oracle. You minimize empirical risk, cross-validate, deploy, and hope. But in practice, user preferences shift, ad markets fluctuate, sensor drift occurs, and adversaries adapt. The moment you deploy a batch model, it begins to decay.

Online learning offers a different philosophy: don't assume the world stands still. Instead, learn continuously, adapt every round, and measure performance not against an unattainable optimum, but against what was possible given the information available at each step.

The Online Learning Framework

Online learning is a repeated game between a learner and an environment. At each round t = 1, 2, ..., T:

The learner selects a decision w_t from a convex set W
The environment reveals a loss function f_t(w_t)
The learner suffers loss f_t(w_t) and updates its strategy

The goal is not to minimize the final loss, but to minimize regret:

Regret(T) = Σ_{t=1}^{T} f_t(w_t) - min_{w ∈ W} Σ_{t=1}^{T} f_t(w)

Regret compares the cumulative loss of the online algorithm against the best fixed decision in hindsight. An algorithm is "no-regret" if regret grows sublinearly in T—meaning the average regret per round approaches zero.

This framework is remarkably general. The losses can be convex or non-convex, stochastic or adversarial, fully observed or partially revealed. The only constant is adaptation.

Five Key Algorithms

1. Online Gradient Descent (OGD)

OGD is the online counterpart to stochastic gradient descent, and it's brutally simple:

w_{t+1} = Proj_W(w_t - η_t ∇f_t(w_t))

At each round, take a gradient step on the current loss, then project back into the feasible set. For convex losses with bounded gradients, OGD achieves regret O(√T), which is optimal for arbitrary convex functions.

The beauty of OGD is its universality. It works for any convex loss, requires no distributional assumptions, and the projection step can often be computed efficiently. When the world is adversarial and losses are convex, OGD is your baseline.

2. Follow-the-Regularized-Leader (FTRL)

FTRL generalizes a family of online algorithms under a single framework. At each round, choose:

w_t = argmin_w Σ_{s=1}^{t-1} f_s(w) + R(w)

Where R(w) is a regularization term that ensures stability. The key insight: different regularizers yield different algorithms.

FTRL with L2 regularization reproduces OGD
FTRL with adaptive regularization (diagonal preconditioning) yields AdaGrad
FTRL with exponential gradient updates yields Adam (when combined with momentum)

AdaGrad and Adam are often presented as deep learning optimizers, but they are fundamentally online learning algorithms. AdaGrad adapts the learning rate per-coordinate based on the cumulative gradient magnitude, making it ideal for sparse features. Adam adds momentum and bias correction, trading theoretical guarantees for empirical speed.

Understanding FTRL unifies these methods and reveals why adaptive methods work: they implicitly estimate the geometry of the loss landscape and adjust accordingly.

3. Hedge / Multiplicative Weights

When the decision set is discrete—choose among N experts—gradient descent is not the right tool. Enter the Hedge algorithm (also called multiplicative weights or exponential weights):

p_t(i) ∝ exp(-η Σ_{s=1}^{t-1} f_s(i))

Maintain a probability distribution over experts, update multiplicatively based on observed losses. Hedge achieves regret O(√T log N), which is optimal for the experts setting.

This is the algorithm behind boosting, online portfolio selection, and many game-theoretic applications. It's also the foundation for the weighted majority algorithm, which works with binary losses and achieves the same regret bound.

4. EXP3 and Adversarial Bandits

What if you only observe the loss of the expert you chose, not all experts? This is the adversarial multi-armed bandit problem, and it's fundamentally harder.

EXP3 (Exponential-weight algorithm for Exploration and Exploitation) solves this by combining Hedge with importance-weighted estimates:

ℓ̂_t(i) = f_t(i) / p_t(i) if i was chosen, else 0

Then update using ℓ̂_t instead of the true loss. The exploration term in the denominator ensures that every action is tried often enough. EXP3 achieves regret O(√T N log N), which is optimal for adversarial bandits.

This matters for recommendation systems where you only see feedback on recommended items, for A/B testing in dynamic environments, and for any system where feedback is partial and the environment may be actively adversarial.

5. Online-to-Batch Conversion

Perhaps the most surprising result in online learning: any no-regret online algorithm can be converted into a batch learning algorithm with generalization guarantees.

The trick: run the online algorithm on the training data in any order, then output the average of the iterates w̄ = (1/T) Σ w_t. For convex losses, the expected loss of w̄ on a new sample is bounded by the regret plus the empirical loss of the best comparator.

This means online learning is not just for streaming data—it provides a path to generalization bounds without VC dimension or Rademacher complexity. It's a fundamentally different way to think about learning.

Non-Stationary Settings

The standard regret formulation compares against the best fixed decision. But what if the optimal decision itself changes over time?

Tracking regret compares against a sequence of comparators that can change a limited number of times. Algorithms like fixed-share or dynamic mirror descent achieve regret O(√T(1 + K)) where K is the number of changes.

Sleeping experts handle the case where experts may be unavailable at certain rounds—crucial for recommendation systems where items are added or removed.

Concept drift detection methods (ADWIN, DDM) explicitly detect when the underlying distribution has changed, triggering resets or reweighting of old data.

In non-stationary environments, the goal shifts from minimizing regret to minimizing dynamic regret—tracking a moving target rather than converging to a static point.

When Online Learning Matters

Online learning is not a curiosity—it's the correct framework for many production systems:

Recommendation systems operate in a non-stationary world. User tastes evolve, new content appears, and popularity shifts. Batch models retrained weekly cannot capture these dynamics. Online learning with bandit feedback (e.g., LinUCB, Thompson sampling) adapts in real-time.

Ad bidding is a repeated auction where the environment (other bidders, user valuations) changes constantly. Online convex optimization with budget constraints gives bidding strategies that adapt to market conditions while respecting spending limits.

Real-time pricing in e-commerce or energy markets requires setting prices dynamically based on demand signals. Online learning with demand functions that shift due to seasonality, competitors, or external shocks is the natural formulation.

Robotics and control systems must adapt to changing physical conditions—motor wear, surface friction, sensor calibration. Online learning provides the theoretical foundation for adaptive control.

The Takeaway

Batch machine learning is a powerful abstraction, but it's an abstraction built on a lie—the lie of stationarity. Online learning doesn't pretend the world is static. It embraces change, adapts continuously, and provides rigorous guarantees even in adversarial environments.

The next time you deploy a model, ask yourself: will the world look the same tomorrow as it did when I trained this model? If the answer is no, you need online learning.