Reinforcement Learning for Decision-Making: From MDPs to Real-World Deployment

RL has beaten world champions at Go and Chess — but deploying it in business systems is fundamentally different.

In 2016, AlphaGo defeated Lee Sedol with a move so unexpected it was initially dismissed as a mistake. In 2017, AlphaZero taught itself Chess from scratch and crushed Stockfish. These moments cemented reinforcement learning (RL) as the poster child for AI dominance. But here’s the uncomfortable truth: the same algorithms that conquered board games struggle to survive a single week in a live production system. At the table, the environment is deterministic, fully observable, and resets after every game. In business — inventory management, clinical trials, ad bidding — the environment is stochastic, partially observable, and the cost of a wrong action isn't a lost game, it's real money, real lives, or real reputations.

This post bridges that gap. We'll walk through the canonical RL framework, then confront the messy reality of deployment.

1. MDPs and the RL Problem: States, Actions, Rewards, Policy

Every RL problem begins as a Markov Decision Process (MDP). Formally, an MDP is defined by:

States (S): what the agent observes (e.g., inventory levels, patient vitals, board position)
Actions (A): what the agent can do (reorder stock, prescribe a drug, move a piece)
Rewards (R): scalar feedback signal (profit, patient survival rate, win/loss)
Transition dynamics (P): probability of moving from state s to s' given action a
Policy (π): the agent's strategy — a mapping from states to actions

The goal: learn a policy that maximizes expected cumulative discounted reward.

The critical difference from supervised learning: In supervised learning, you have labeled data (x, y) and you minimize prediction error. In RL, you have no labels — only delayed, sparse rewards. You don't know the right action; you only know whether the outcome was good or bad, often after many steps. This is credit assignment: which action caused the reward? That's why RL is fundamentally harder — and why you should not use it unless supervised learning or simple heuristics fail.

2. Value-Based Methods: Q-Learning, DQN, and the Deadly Triad

Value-based methods learn a value function — how good it is to be in a state (V) or take an action in a state (Q). The core idea: if you know the optimal Q-function, you can act greedily: π(s) = argmax_a Q*(s, a).

Q-learning updates Q-values using the Bellman equation: Q(s, a) ← Q(s, a) + α [r + γ max_a' Q(s', a') - Q(s, a)]

Deep Q-Networks (DQN) scaled this to high-dimensional state spaces (e.g., Atari pixels) using neural networks, experience replay, and a target network. DQN was a breakthrough — but it's fragile.

The Deadly Triad (Sutton & Barto): instability arises when you combine:

Function approximation (neural networks)
Bootstrapping (updating estimates using other estimates)
Off-policy learning (learning from data not generated by current policy)

This triad causes divergence in Q-values, especially in continuous control. DQN works when states are discrete, rewards are dense, and you have a simulator. In real systems with continuous states and sparse rewards? It often fails.

3. Policy Gradient Methods: REINFORCE, PPO, SAC

When action spaces are continuous — robot joint torques, bidding prices, dosage amounts — Q-learning breaks because you can't max over infinite actions. Enter policy gradient methods.

REINFORCE is the simplest: sample trajectories, compute cumulative returns, and update the policy parameters to increase probability of actions that led to high returns. High variance, but unbiased.

Proximal Policy Optimization (PPO) fixes the variance problem by clipping the policy update, ensuring each step doesn't deviate too far. It's the default choice for continuous control — stable, sample-efficient enough, and widely used in robotics and game AI.

Soft Actor-Critic (SAC) adds entropy maximization: the policy is trained to maximize both reward and randomness. This encourages exploration and prevents premature convergence. SAC is often the best off-the-shelf algorithm for continuous action spaces with dense rewards.

When to use policy gradients: Continuous actions, stochastic policies needed, or when you need to model multi-modal behavior (e.g., a robot can grasp an object from multiple angles).

4. Offline RL: Learning from Logged Data Without Environment Interaction

In most real-world deployments, you cannot run an RL agent in the environment to collect data. The cost of exploration is too high — you can't let an inventory agent try random reorder quantities for a month. You have logged data from previous policies (e.g., human decisions, rule-based systems). This is offline RL (also called batch RL).

The core challenge: distributional shift. The policy you're learning will encounter states and actions that are underrepresented in the offline dataset. If the Q-function overestimates the value of unseen actions, the agent will choose catastrophic actions.

Solutions:

Conservative Q-Learning (CQL): penalizes Q-values for out-of-distribution actions, making the learned policy conservative.
Implicit Q-Learning (IQL): avoids querying actions outside the dataset by using expectile regression.
TD3+BC: a simple trick — add a behavioral cloning term to TD3 to stay close to the data.

When offline RL works: You have large, high-quality logged datasets (e.g., recommendation systems, clinical trials). The data must cover the state-action space reasonably well. If your dataset is small or biased, offline RL will fail — use imitation learning instead.

5. Safe RL and Real-World Deployment: Reward Shaping, Distributional Shift, Constrained MDPs

Deploying RL in safety-critical systems requires more than a well-tuned algorithm.

Reward shaping pitfalls: The reward function defines the objective. Get it wrong, and the agent will hack it. Example: a cleaning robot rewarded for "dirt collected" learns to dump its bin and re-collect the same dirt. In clinical settings, a reward for "patient survival at 30 days" might lead to aggressive treatments that harm long-term outcomes. Reward design is the hardest part of RL deployment.

Distributional shift during deployment: Even if your offline RL policy works in simulation, the real world changes. Customer behavior shifts, hardware degrades, new products appear. The policy must be robust or retrained. This is why online monitoring and fallback policies are essential.

Constrained MDPs (CMDPs): Add a cost function that must stay below a threshold. For example, a trading agent must maximize profit (reward) while keeping drawdown below 5% (cost). Algorithms like CPO (Constrained Policy Optimization) and Lagrangian methods handle this.

Practical deployment checklist:

Can you simulate the environment accurately?
Can you reset the environment if the agent fails?
Is there a human-in-the-loop override?
Can you detect distributional shift online?

Where RL Actually Works in Practice

Despite the hype, RL has succeeded in a few high-value niches:

Recommendation systems: YouTube, Netflix, and TikTok use RL to optimize long-term user engagement. Offline RL with logged user interactions is the standard.
Robotics: Dexterous manipulation (e.g., OpenAI Dactyl), locomotion (Boston Dynamics), and warehouse picking (Amazon). Sim-to-real transfer with domain randomization is the key enabler.
Chip design: Google's RL-based chip floorplanning (published in Nature) reduced design time from weeks to hours. The state space is massive, but the reward (wirelength, power) is well-defined.
Clinical treatment: Optimizing sepsis treatment protocols (e.g., AI Clinician). Offline RL from ICU data, with careful safety constraints.

What Makes Real Deployment Hard

Partial observability: The agent doesn't see the full state. In trading, you don't know market sentiment. Use POMDPs or recurrent policies (LSTM, Transformer) — but this increases complexity.
Sparse rewards: In robotics, the reward might be "grasp the object" — binary and rare. Use reward shaping, curiosity-driven exploration, or hierarchical RL.
Sim-to-real gap: Simulation is never perfect. Domain randomization (varying physics, textures, lighting during training) helps, but doesn't eliminate the gap. System identification (matching sim to real dynamics) is an active research area.

Decision Guide: When Should You Consider RL Over Simpler Methods?

Problem Type	Recommended Approach	Why
Static prediction (e.g., churn, fraud)	Supervised learning (XGBoost, neural nets)	You have labels; no sequential decisions needed.
Simple decision rules (e.g., reorder point)	Heuristics, optimization (LP, dynamic programming)	Interpretable, no training required.