DoOperator Research · May 20, 2026
RL has beaten world champions at Go and Chess — but deploying it in business systems is fundamentally different.
In 2016, AlphaGo defeated Lee Sedol with a move so unexpected it was initially dismissed as a mistake. In 2017, AlphaZero taught itself Chess from scratch and crushed Stockfish. These moments cemented reinforcement learning (RL) as the poster child for AI dominance. But here’s the uncomfortable truth: the same algorithms that conquered board games struggle to survive a single week in a live production system. At the table, the environment is deterministic, fully observable, and resets after every game. In business — inventory management, clinical trials, ad bidding — the environment is stochastic, partially observable, and the cost of a wrong action isn't a lost game, it's real money, real lives, or real reputations.
This post bridges that gap. We'll walk through the canonical RL framework, then confront the messy reality of deployment.
Every RL problem begins as a Markov Decision Process (MDP). Formally, an MDP is defined by:
The goal: learn a policy that maximizes expected cumulative discounted reward.
The critical difference from supervised learning: In supervised learning, you have labeled data (x, y) and you minimize prediction error. In RL, you have no labels — only delayed, sparse rewards. You don't know the right action; you only know whether the outcome was good or bad, often after many steps. This is credit assignment: which action caused the reward? That's why RL is fundamentally harder — and why you should not use it unless supervised learning or simple heuristics fail.
Value-based methods learn a value function — how good it is to be in a state (V) or take an action in a state (Q). The core idea: if you know the optimal Q-function, you can act greedily: π(s) = argmax_a Q*(s, a).
Q-learning updates Q-values using the Bellman equation: Q(s, a) ← Q(s, a) + α [r + γ max_a' Q(s', a') - Q(s, a)]
Deep Q-Networks (DQN) scaled this to high-dimensional state spaces (e.g., Atari pixels) using neural networks, experience replay, and a target network. DQN was a breakthrough — but it's fragile.
The Deadly Triad (Sutton & Barto): instability arises when you combine:
This triad causes divergence in Q-values, especially in continuous control. DQN works when states are discrete, rewards are dense, and you have a simulator. In real systems with continuous states and sparse rewards? It often fails.
When action spaces are continuous — robot joint torques, bidding prices, dosage amounts — Q-learning breaks because you can't max over infinite actions. Enter policy gradient methods.
REINFORCE is the simplest: sample trajectories, compute cumulative returns, and update the policy parameters to increase probability of actions that led to high returns. High variance, but unbiased.
Proximal Policy Optimization (PPO) fixes the variance problem by clipping the policy update, ensuring each step doesn't deviate too far. It's the default choice for continuous control — stable, sample-efficient enough, and widely used in robotics and game AI.
Soft Actor-Critic (SAC) adds entropy maximization: the policy is trained to maximize both reward and randomness. This encourages exploration and prevents premature convergence. SAC is often the best off-the-shelf algorithm for continuous action spaces with dense rewards.
When to use policy gradients: Continuous actions, stochastic policies needed, or when you need to model multi-modal behavior (e.g., a robot can grasp an object from multiple angles).
In most real-world deployments, you cannot run an RL agent in the environment to collect data. The cost of exploration is too high — you can't let an inventory agent try random reorder quantities for a month. You have logged data from previous policies (e.g., human decisions, rule-based systems). This is offline RL (also called batch RL).
The core challenge: distributional shift. The policy you're learning will encounter states and actions that are underrepresented in the offline dataset. If the Q-function overestimates the value of unseen actions, the agent will choose catastrophic actions.
Solutions:
When offline RL works: You have large, high-quality logged datasets (e.g., recommendation systems, clinical trials). The data must cover the state-action space reasonably well. If your dataset is small or biased, offline RL will fail — use imitation learning instead.
Deploying RL in safety-critical systems requires more than a well-tuned algorithm.
Reward shaping pitfalls: The reward function defines the objective. Get it wrong, and the agent will hack it. Example: a cleaning robot rewarded for "dirt collected" learns to dump its bin and re-collect the same dirt. In clinical settings, a reward for "patient survival at 30 days" might lead to aggressive treatments that harm long-term outcomes. Reward design is the hardest part of RL deployment.
Distributional shift during deployment: Even if your offline RL policy works in simulation, the real world changes. Customer behavior shifts, hardware degrades, new products appear. The policy must be robust or retrained. This is why online monitoring and fallback policies are essential.
Constrained MDPs (CMDPs): Add a cost function that must stay below a threshold. For example, a trading agent must maximize profit (reward) while keeping drawdown below 5% (cost). Algorithms like CPO (Constrained Policy Optimization) and Lagrangian methods handle this.
Practical deployment checklist:
Despite the hype, RL has succeeded in a few high-value niches:
| Problem Type | Recommended Approach | Why |
|---|---|---|
| Static prediction (e.g., churn, fraud) | Supervised learning (XGBoost, neural nets) | You have labels; no sequential decisions needed. |
| Simple decision rules (e.g., reorder point) | Heuristics, optimization (LP, dynamic programming) | Interpretable, no training required. |