Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems — DoOperator Research

Authors	Sergey Levine, Aviral Kumar, George Tucker, Justin Fu
Year	2020

What Problem It Solves

Offline reinforcement learning (also called batch RL) addresses the challenge of learning optimal decision-making policies entirely from a fixed, pre-collected dataset of transitions, without any additional online interaction with the environment. This is fundamentally different from standard online RL, where the agent iteratively collects new experience by interacting with the environment using its current policy, and from off-policy RL, where the agent maintains a replay buffer that grows over time as new data is collected from evolving policies. The core problem is that standard off-policy RL algorithms—which can learn from data not generated by the current policy—fail catastrophically when applied to a static dataset, because they suffer from distributional shift: the learned policy selects actions that are out-of-distribution relative to the data, leading to value function extrapolation errors that compound during training. This makes it impossible to simply take a standard deep RL algorithm (like DQN, SAC, or TD3) and train it on a fixed dataset. The paper provides a comprehensive tutorial on why this failure occurs, surveys methods that mitigate it, and identifies open problems. The practical significance is enormous: successful offline RL would enable turning large existing datasets (from healthcare records, robotics logs, autonomous driving data, etc.) into powerful decision-making engines, without the expense, risk, or infeasibility of online data collection.

What problem it solves

How it works

The paper first establishes why naive application of off-policy RL to offline data fails, then surveys three families of solutions.

The core failure mechanism: Standard off-policy RL algorithms (like DQN or SAC) learn a Q-function Q(s,a) by minimizing the Bellman error: E[(Q(s,a) - (r + γ max_{a'} Q(s',a')))^2]. During training, the policy π(a|s) is updated to maximize Q(s,a). In the online setting, when π selects an out-of-distribution action a', the agent actually tries that action in the environment, observes the true next state and reward, and corrects the Q-function. In the offline setting, the agent never gets this correction signal. The Q-function can erroneously assign arbitrarily high values to actions not present in the dataset, because there is no data to contradict them. The policy then exploits these spurious high values, leading to a vicious cycle: the policy moves further out-of-distribution, and the Q-function's errors grow without bound.

Solution family 1: Policy constraint methods. These methods explicitly constrain the learned policy to stay close to the behavior policy that generated the data. The intuition is: if the policy only selects actions that are "supported" by the data, the Q-function will have reasonable estimates. Formally, they modify the policy update to include a divergence penalty:

π_{k+1} = argmax_π E_{s ~ D}[Q(s, π(s)) - α * D(π(·|s) || π_β(·|s))]

where D is some divergence measure (KL divergence, MMD, etc.) and π_β is the behavior policy (estimated from data). Examples include BRAC (Behavior Regularized Actor Critic) and BEAR (Bootstrapping Error Accumulation Reduction). A variant uses a "pessimistic" Q-function that explicitly lower-bounds the true Q-value for out-of-distribution actions.

Solution family 2: Uncertainty-based methods. These methods use uncertainty quantification to penalize actions where the Q-function is uncertain. The idea is that the Q-function should be confident about in-distribution actions and uncertain about out-of-distribution ones. Methods like PEARL (Probabilistic Ensembles with Uncertainty) and MOPO (Model-based Offline Policy Optimization) train an ensemble of Q-functions or dynamics models and use the ensemble variance as a penalty:

Q_{penalized}(s,a) = Q_{mean}(s,a) - λ * σ_{ensemble}(s,a)

where σ is the standard deviation across ensemble members. This naturally penalizes out-of-distribution actions because different ensemble members will disagree.

Solution family 3: Model-based methods. These methods learn a dynamics model T̂(s'|s,a) from the offline data, then generate synthetic rollouts for training. The key insight is that the model can be used to generate additional data, but the model itself will be inaccurate in regions not covered by the data. Methods like MOReL (Model-based Offline Reinforcement Learning) and COMBO (Conservative Offline Model-Based Policy Optimization) combine model learning with pessimism: they either truncate rollouts when the model uncertainty is high, or add a penalty for visiting states where the model is uncertain.

Theoretical foundation: The paper discusses the concept of "pessimism in the face of uncertainty" as the unifying principle. The optimal offline RL algorithm should be conservative: it should prefer actions and states that are well-covered by the data, even if that means sacrificing some potential reward. The theoretical guarantee is that with sufficient coverage and appropriate pessimism, the learned policy's performance converges to the best possible policy achievable from the dataset.

When to use it

Prefer offline RL over online RL when:

Data collection is expensive, dangerous, or time-consuming (healthcare, autonomous driving, robotics in the wild)
You have access to a large historical dataset but cannot deploy a learning agent to collect more data
You need to learn a policy from a fixed corpus (e.g., medical records, logged robot demonstrations)
The cost of a single failure during training is unacceptable (e.g., clinical decision support)

Prefer online RL over offline RL when:

You have a high-fidelity simulator where interaction is cheap and fast
The task is well-understood and sim-to-real transfer is feasible
You need to explore novel strategies not present in any existing dataset
The dataset has poor coverage of the state-action space (offline RL will be severely limited)

Prefer behavioral cloning (supervised learning) over offline RL when:

The behavior policy that generated the data is already near-optimal
You only need to mimic the data, not improve upon it
The reward function is poorly specified or unavailable
The dataset is small and high-quality (offline RL needs large, diverse data)

Prefer offline RL over behavioral cloning when:

The behavior policy is suboptimal and you want to improve upon it
You have a well-defined reward function
The dataset contains diverse behaviors that can be combined to form better policies
You need to handle distributional shift at deployment time (offline RL methods are explicitly designed for this)

Prefer model-based offline RL over model-free offline RL when:

The environment dynamics are relatively smooth and low-dimensional
You have a good understanding of the system's physics
You need to generate synthetic data for policy improvement
The dataset is large but sparse (model can generalize better)

Prefer model-free offline RL over model-based when:

The dynamics are complex, discontinuous, or high-dimensional
Model errors would compound catastrophically
You have a very large dataset of transitions (model-free can scale better)

Limitations and failure modes

Catastrophic failure with poor coverage: If the dataset does not cover the state-action space that the optimal policy would visit, offline RL will fail. The learned policy may either be overly conservative (staying too close to the behavior policy) or overestimate values and fail. This is the most common failure mode in practice.
Sensitivity to hyperparameters: Offline RL algorithms are notoriously sensitive to hyperparameters (especially the conservatism coefficient in CQL, the expectile in IQL, and the BC coefficient in TD3+BC). Tuning requires a validation

Read full paper →PDF ↗More offline_policy_evaluation →