| Authors | Aviral Kumar, Aurick Zhou, George Tucker, Sergey Levine |
| Year | 2020 |
What Problem It Solves
Offline reinforcement learning (also called batch RL) aims to learn effective policies from static, previously-collected datasets without any additional interaction with the environment. This is critical for real-world applications where online data collection is expensive, dangerous, or logistically infeasible — such as healthcare, robotics, autonomous driving, and recommendation systems. The central challenge is distributional shift: the learned policy π(a|s) inevitably deviates from the behavior policy πβ(a|s) that generated the dataset, causing standard off-policy RL algorithms to query the Q-function on out-of-distribution (OOD) actions. Because the Q-function has never observed the true returns for these OOD actions, it tends to produce erroneously optimistic value estimates. These overestimations compound during bootstrapping, leading the policy to exploit spurious high-value actions that are actually poor. Existing offline RL methods attempt to mitigate this by constraining the learned policy to stay close to the behavior policy (e.g., via KL divergence or MMD penalties), but these approaches can be overly conservative, computationally expensive, and often fail on complex, multi-modal data distributions. Conservative Q-Learning (CQL) addresses this by directly learning a Q-function that provably lower-bounds the true value of the policy, preventing overestimation at its source without requiring explicit policy constraints.
Offline reinforcement learning (also called batch RL) aims to learn effective policies from static, previously-collected datasets without any additional interaction with the environment. This is critical for real-world applications where online data collection is expensive, dangerous, or logistically infeasible — such as healthcare, robotics, autonomous driving, and recommendation systems. The central challenge is distributional shift: the learned policy π(a|s) inevitably deviates from the behavior policy πβ(a|s) that generated the dataset, causing standard off-policy RL algorithms to query the Q-function on out-of-distribution (OOD) actions. Because the Q-function has never observed the true returns for these OOD actions, it tends to produce erroneously optimistic value estimates. These overestimations compound during bootstrapping, leading the policy to exploit spurious high-value actions that are actually poor. Existing offline RL methods attempt to mitigate this by constraining the learned policy to stay close to the behavior policy (e.g., via KL divergence or MMD penalties), but these approaches can be overly conservative, computationally expensive, and often fail on complex, multi-modal data distributions. Conservative Q-Learning (CQL) addresses this by directly learning a Q-function that provably lower-bounds the true value of the policy, preventing overestimation at its source without requiring explicit policy constraints.
CQL operates on a simple but powerful insight: instead of trying to prevent the policy from selecting OOD actions (as in policy constraint methods), directly learn a Q-function that is conservative — meaning its expected value under the learned policy lower-bounds the true policy value. This prevents the overestimation that plagues standard offline RL while avoiding the brittleness of explicit policy constraints.
The core idea is to augment the standard Bellman error objective with a regularizer that penalizes high Q-values on actions that are likely under a distribution µ(a|s) (which we want to be pessimistic about) while rewarding high Q-values on actions that are likely under the behavior policy πβ(a|s) (which we have data for). This creates a Q-function that is intentionally pessimistic about unseen actions but accurate for actions observed in the data.
The basic CQL evaluation (Equation 1): For a fixed target policy π, the Q-function is updated by solving:
ˆQ^{k+1} ← arg min_Q [ α · E_{s∼D, a∼µ(a|s)}[Q(s,a)] + (1/2) · E_{s,a,s′∼D}[ (Q(s,a) - ˆB^π ˆQ^k(s,a))² ] ]
The first term minimizes Q-values under distribution µ, making the Q-function conservative. The second term is the standard Bellman error that ensures the Q-function remains a good approximation for transitions seen in the data. Theorem 3.1 shows that with sufficiently large α, the resulting Q-function lower-bounds the true Q-function pointwise for all (s,a) in the dataset.
The tighter CQL evaluation (Equation 2): The pointwise lower bound from Equation 1 is overly conservative. For policy evaluation and improvement, we only need the expected value under the policy to be a lower bound, not every individual Q-value. CQL achieves a tighter bound by adding a maximization term:
ˆQ^{k+1} ← arg min_Q [ α · ( E_{s∼D, a∼µ(a|s)}[Q(s,a)] - E_{s∼D, a∼ˆπβ(a|s)}[Q(s,a)] ) + (1/2) · E_{s,a,s′∼D}[ (Q(s,a) - ˆB^π ˆQ^k(s,a))² ] ]
The key change is subtracting the expected Q-value under the empirical behavior policy ˆπβ(a|s). This maximization term prevents the Q-function from being overly pessimistic for actions that are actually present in the data. When µ = π (the learned policy), Theorem 3.2 shows that the expected value E_{π(a|s)}[ˆQ^π(s,a)] lower-bounds the true value V^π(s), but individual Q-values may not be pointwise lower bounds — they can be overestimated for actions common in the data and underestimated for rare actions, as long as the expectation under π is conservative.
Intuition for why this works: Think of the Q-function as being pulled in two directions. The minimization term under µ pushes Q-values down for actions the learned policy might take (which could be OOD). The maximization term under ˆπβ pushes Q-values up for actions that actually appear in the data. The Bellman error term anchors the Q-function to observed transitions. The net effect is that the Q-function learns to be accurate for in-distribution actions but pessimistic for OOD actions, and the degree of pessimism is controlled by α. When the learned policy π deviates from the behavior policy, the minimization term dominates for those OOD actions, producing conservative estimates that prevent the policy from exploiting spurious high values.
For offline RL (policy improvement): CQL can be combined with any Q-learning or actor-critic framework. The Q-function is trained using Equation 2 with µ = π (the current policy), and the policy is updated to maximize the conservative Q-values. This creates a self-consistent loop: the Q-function is conservative with respect to the current policy, and the policy is optimized under this conservative estimate. The authors prove that this procedure has theoretical improvement guarantees — each policy update improves the true value, not just the estimated value.
Theoretical guarantees: The key theoretical results are:
Prefer CQL over alternative offline RL methods when:
Prefer alternative methods over CQL when:
Comparison to specific alternatives:
Related papers
Doubly Robust Off-policy Value Evaluation for Reinforcement Learning
Nan Jiang, Lihong Li · 2015
PaperOffline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems
Sergey Levine, Aviral Kumar, George Tucker +1 more · 2020
PaperOffline Reinforcement Learning with Implicit Q-Learning
Ilya Kostrikov, Ashvin Nair, Sergey Levine
PaperCounterfactual Risk Minimization: Learning from Logged Bandit Feedback
Adith Swaminathan, Thorsten Joachims