| Authors | Sergey Levine, Aviral Kumar, George Tucker, Justin Fu |
| Year | 2020 |
What Problem It Solves
Offline reinforcement learning (also called batch RL) addresses the challenge of learning optimal decision-making policies entirely from a fixed, pre-collected dataset of transitions, without any additional online interaction with the environment. This is fundamentally different from standard online RL, where the agent iteratively collects new experience by interacting with the environment using its current policy, and from off-policy RL, where the agent maintains a replay buffer that grows over time as new data is collected from evolving policies. The core problem is that standard off-policy RL algorithms—which can learn from data not generated by the current policy—fail catastrophically when applied to a static dataset, because they suffer from distributional shift: the learned policy selects actions that are out-of-distribution relative to the data, leading to value function extrapolation errors that compound during training. This makes it impossible to simply take a standard deep RL algorithm (like DQN, SAC, or TD3) and train it on a fixed dataset. The paper provides a comprehensive tutorial on why this failure occurs, surveys methods that mitigate it, and identifies open problems. The practical significance is enormous: successful offline RL would enable turning large existing datasets (from healthcare records, robotics logs, autonomous driving data, etc.) into powerful decision-making engines, without the expense, risk, or infeasibility of online data collection.
Offline reinforcement learning (also called batch RL) addresses the challenge of learning optimal decision-making policies entirely from a fixed, pre-collected dataset of transitions, without any additional online interaction with the environment. This is fundamentally different from standard online RL, where the agent iteratively collects new experience by interacting with the environment using its current policy, and from off-policy RL, where the agent maintains a replay buffer that grows over time as new data is collected from evolving policies. The core problem is that standard off-policy RL algorithms—which can learn from data not generated by the current policy—fail catastrophically when applied to a static dataset, because they suffer from distributional shift: the learned policy selects actions that are out-of-distribution relative to the data, leading to value function extrapolation errors that compound during training. This makes it impossible to simply take a standard deep RL algorithm (like DQN, SAC, or TD3) and train it on a fixed dataset. The paper provides a comprehensive tutorial on why this failure occurs, surveys methods that mitigate it, and identifies open problems. The practical significance is enormous: successful offline RL would enable turning large existing datasets (from healthcare records, robotics logs, autonomous driving data, etc.) into powerful decision-making engines, without the expense, risk, or infeasibility of online data collection.
The paper first establishes why naive application of off-policy RL to offline data fails, then surveys three families of solutions.
The core failure mechanism: Standard off-policy RL algorithms (like DQN or SAC) learn a Q-function Q(s,a) by minimizing the Bellman error: E[(Q(s,a) - (r + γ max_{a'} Q(s',a')))^2]. During training, the policy π(a|s) is updated to maximize Q(s,a). In the online setting, when π selects an out-of-distribution action a', the agent actually tries that action in the environment, observes the true next state and reward, and corrects the Q-function. In the offline setting, the agent never gets this correction signal. The Q-function can erroneously assign arbitrarily high values to actions not present in the dataset, because there is no data to contradict them. The policy then exploits these spurious high values, leading to a vicious cycle: the policy moves further out-of-distribution, and the Q-function's errors grow without bound.
Solution family 1: Policy constraint methods. These methods explicitly constrain the learned policy to stay close to the behavior policy that generated the data. The intuition is: if the policy only selects actions that are "supported" by the data, the Q-function will have reasonable estimates. Formally, they modify the policy update to include a divergence penalty:
π_{k+1} = argmax_π E_{s ~ D}[Q(s, π(s)) - α * D(π(·|s) || π_β(·|s))]
where D is some divergence measure (KL divergence, MMD, etc.) and π_β is the behavior policy (estimated from data). Examples include BRAC (Behavior Regularized Actor Critic) and BEAR (Bootstrapping Error Accumulation Reduction). A variant uses a "pessimistic" Q-function that explicitly lower-bounds the true Q-value for out-of-distribution actions.
Solution family 2: Uncertainty-based methods. These methods use uncertainty quantification to penalize actions where the Q-function is uncertain. The idea is that the Q-function should be confident about in-distribution actions and uncertain about out-of-distribution ones. Methods like PEARL (Probabilistic Ensembles with Uncertainty) and MOPO (Model-based Offline Policy Optimization) train an ensemble of Q-functions or dynamics models and use the ensemble variance as a penalty:
Q_{penalized}(s,a) = Q_{mean}(s,a) - λ * σ_{ensemble}(s,a)
where σ is the standard deviation across ensemble members. This naturally penalizes out-of-distribution actions because different ensemble members will disagree.
Solution family 3: Model-based methods. These methods learn a dynamics model T̂(s'|s,a) from the offline data, then generate synthetic rollouts for training. The key insight is that the model can be used to generate additional data, but the model itself will be inaccurate in regions not covered by the data. Methods like MOReL (Model-based Offline Reinforcement Learning) and COMBO (Conservative Offline Model-Based Policy Optimization) combine model learning with pessimism: they either truncate rollouts when the model uncertainty is high, or add a penalty for visiting states where the model is uncertain.
Theoretical foundation: The paper discusses the concept of "pessimism in the face of uncertainty" as the unifying principle. The optimal offline RL algorithm should be conservative: it should prefer actions and states that are well-covered by the data, even if that means sacrificing some potential reward. The theoretical guarantee is that with sufficient coverage and appropriate pessimism, the learned policy's performance converges to the best possible policy achievable from the dataset.
Prefer offline RL over online RL when:
Prefer online RL over offline RL when:
Prefer behavioral cloning (supervised learning) over offline RL when:
Prefer offline RL over behavioral cloning when:
Prefer model-based offline RL over model-free offline RL when:
Prefer model-free offline RL over model-based when:
Catastrophic failure with poor coverage: If the dataset does not cover the state-action space that the optimal policy would visit, offline RL will fail. The learned policy may either be overly conservative (staying too close to the behavior policy) or overestimate values and fail. This is the most common failure mode in practice.
Sensitivity to hyperparameters: Offline RL algorithms are notoriously sensitive to hyperparameters (especially the conservatism coefficient in CQL, the expectile in IQL, and the BC coefficient in TD3+BC). Tuning requires a validation
Related papers
Doubly Robust Off-policy Value Evaluation for Reinforcement Learning
Nan Jiang, Lihong Li · 2015
PaperConservative Q-Learning for Offline Reinforcement Learning
Aviral Kumar, Aurick Zhou, George Tucker +1 more · 2020
PaperOffline Reinforcement Learning with Implicit Q-Learning
Ilya Kostrikov, Ashvin Nair, Sergey Levine
PaperCounterfactual Risk Minimization: Learning from Logged Bandit Feedback
Adith Swaminathan, Thorsten Joachims