← Blog
NoteReinforcement Learning

A Practitioner's Guide to Reinforcement Learning

You're the CTO of a robotics startup. Your warehouse robot has logged 10,000 hours of pick-and-place trajectories, stored as state-action-reward-next_state tuples on a NAS drive. Your competitor is deploying a new fleet next quarter. You need a policy that outperforms your curren

DoOperator Research · May 9, 2026

Decision takeaway

You're the CTO of a robotics startup. Your warehouse robot has logged 10,000 hours of pick-and-place trajectories, stored as state-action-reward-next_state tuples on a NAS drive. Your competitor is deploying a new fleet next quarter. You need a policy that outperforms your curren

A Practitioner's Guide to Reinforcement Learning

You're the CTO of a robotics startup. Your warehouse robot has logged 10,000 hours of pick-and-place trajectories, stored as state-action-reward-next_state tuples on a NAS drive. Your competitor is deploying a new fleet next quarter. You need a policy that outperforms your current hand-tuned controller—but you cannot afford a single catastrophic failure during training. The robot costs $80,000, and a dropped server rack costs more.

This is the decision that reinforcement learning practitioners face daily: how to learn a good policy when data is abundant but interaction is expensive or dangerous. The answer depends critically on whether you can still interact with the environment, and on the structure of your data.

When You Can Still Interact: On-Policy Methods

If you can afford to run your robot (or simulator) to collect fresh data, you have two families of policy gradient methods: trust-region approaches and their first-order approximations.

Trust Region Policy Optimization (TRPO) provides a monotonic improvement guarantee in the exact setting—Theorem 1 of Schulman et al. (2015) proves that if you constrain the maximum KL divergence between successive policies to δ, the true return improves by at least −2δγϵ/(1−γ)², where ϵ = max_{s,a} |A_π(s,a)|. This guarantee is the strongest theoretical foundation for stable policy updates in deep RL. The cost: TRPO requires second-order optimization via conjugate gradient with Fisher-vector products, which Schulman et al. report adds approximately 2–3× computational overhead per iteration compared to first-order methods.

Proximal Policy Optimization (PPO) replaces the KL constraint with a clipped surrogate objective: L^CLIP(θ) = E_t[min(r_t(θ)Â_t, clip(r_t(θ), 1−ε, 1+ε)Â_t)]. Schulman et al. (2017) demonstrate that PPO matches or exceeds TRPO's performance on continuous control benchmarks while requiring only standard stochastic gradient ascent. The trade-off: PPO sacrifices the theoretical monotonic improvement guarantee. In practice, the clipping mechanism prevents destructive updates, but the paper shows that the clipping threshold ε is a sensitive hyperparameter—values outside [0.1, 0.3] degrade performance on the HalfCheetah-v1 benchmark by 20–40%.

When to choose TRPO over PPO: When you have compute budget for second-order optimization and need the strongest stability guarantee—for example, training a policy that will be deployed without further tuning. When to choose PPO: when you need to iterate quickly, scale to distributed training, or use architectures with stochastic components (dropout, noise) that are incompatible with Fisher-vector products.

When Interaction Is Impossible: Offline RL

Now return to the warehouse robot scenario. You cannot afford online interaction. You have a static dataset. Standard off-policy algorithms like DQN or SAC will fail catastrophically.

Levine et al. (2020) diagnose the root cause: distributional shift. When the learned policy π selects actions not covered by the behavior policy π_β, the Q-function extrapolates erroneously. The paper shows that this error compounds during bootstrapping—a 10% overestimation at each step can grow to 300%+ over 50 steps in the D4RL benchmark.

Conservative Q-Learning (CQL) solves this by directly regularizing the Q-function to lower-bound the true value. The CQL objective adds a term that minimizes Q-values on out-of-distribution actions while maximizing them on in-distribution actions: L_CQL = α E_{s∼D}[log Σ_a exp(Q(s,a)) − E_{a∼π_β}[Q(s,a)]]. Kumar et al. (2020) prove that the resulting Q-function is a pointwise lower bound on the true value: Q^CQL(s,a) ≤ Q^π(s,a) for all (s,a) in the support of the data distribution. On the D4RL MuJoCo benchmark, CQL achieves 2–5× higher returns than naive SAC applied offline.

The critical failure mode: CQL assumes support overlap—the behavior policy must have placed positive probability on all actions the learned policy might select. This is untestable from data alone. If your dataset was collected by a conservative human operator who never attempted aggressive maneuvers, CQL will learn a conservative policy that also never attempts them. The lower-bound guarantee becomes a ceiling.

The Exploration-Exploitation Trade-off: Thompson Sampling

Whether online or offline, you face the fundamental tension between gathering information and maximizing reward. Thompson sampling provides a principled resolution.

Russo et al. (2018) show that Thompson sampling achieves asymptotically optimal regret in multi-armed bandits: for Bernoulli rewards, the expected regret grows as O(log T), matching the Lai-Robbins lower bound. The algorithm is remarkably simple: maintain a posterior distribution over reward parameters, sample from it, and act greedily with respect to that sample.

Worked example: You're optimizing a news recommendation system with 100 articles. Each article has an unknown click-through rate. Using a Beta(1,1) prior, after 1,000 user visits, Thompson sampling will have pulled the best article approximately 950 times and each suboptimal article O(log(1000)) times. A naive ε-greedy strategy with ε=0.1 would waste 100 pulls on random exploration regardless of information value.

When to choose Thompson sampling over UCB: When the action space is large or structured. UCB requires computing confidence bounds for every action, which is O(|A|) per step. Thompson sampling requires only one posterior sample. For problems with combinatorial action spaces (e.g., shortest paths, product assortments), Russo et al. demonstrate that Thompson sampling remains tractable while UCB becomes computationally prohibitive.

The Most Common Failure Mode

Practitioners consistently conflate off-policy learning with offline RL. They take a working SAC implementation, swap the online replay buffer for a static dataset, and expect similar results. This fails because SAC's Q-function was never regularized against OOD actions. The resulting policy exploits spurious high-Q values for actions never taken in the dataset, achieving near-zero true return.

Levine et al. (2020) document this precisely: applying SAC to the D4RL halfcheetah-medium dataset yields a final return of approximately 500, compared to CQL's 4,000+ and the behavior policy's 3,000. The SAC policy is worse than the policy that generated the data—a catastrophic outcome for any deployment.

Practical Checklist

Before using RL in your study, verify:

  1. Can you interact with the environment? If yes, use on-policy methods (PPO or TRPO). If no, use offline RL (CQL). Do not use off-policy algorithms designed for growing replay buffers on static datasets.

  2. Does your dataset cover the actions your policy might take? For offline RL, compute the maximum action probability under the behavior policy for actions in your policy's support. If this probability is near zero for any action, CQL's lower-bound guarantee does not apply.

  3. Is your reward function stationary? Both TRPO and PPO assume the MDP dynamics do not change during training. If your environment has changing reward functions (e.g., adversarial opponents), monitor the KL divergence between successive policies—Schulman et al. (2015) show that violations of stationarity manifest as KL spikes.

  4. Do you have a proper Bayesian prior? Thompson sampling requires a proper prior. Russo et al. (2018) warn that improper priors lead to undefined posterior sampling. Use conjugate priors (Beta-Bernoulli for binary rewards, Normal-Normal for continuous) unless you have a strong reason to do otherwise.

  5. Is your advantage estimation reliable? PPO's clipping mechanism assumes bounded, reasonably accurate advantage estimates. Monitor the variance of your advantage estimates—if it exceeds 10× the mean absolute advantage, your value function is poorly fit and PPO will fail regardless of clipping.

  6. Can you afford the computational cost of trust regions? If your policy network has fewer than 100,000 parameters and you need guaranteed stability, use TRPO. For larger networks or faster iteration, use PPO with careful hyperparameter tuning (start with ε=0.2, anneal learning rate by 0.5 if KL divergence exceeds 0.02).


Part of the DoOperator Research series on Reinforcement Learning. Browse the full paper corpus at dooperator.ai/research/reinforcement_learning.

More from the blog

Correlation Was Never the Problem"Correlation is not causation" is one of the most-repeated phrases in empirical research. It is also, as usually understood, a dramatic understatement of the actual difficulty. The real challenge is not distinguishing correlation from causation — it is identifying which causal story is correct when several are consistent with the same data.May 29, 2026The Illusion of Control: Why Most A/B Tests Mislead More Than They InformOrganizations run thousands of A/B tests every year and congratulate themselves on being data-driven. Most of those tests are statistically invalid. Here is why — and what rigorous experimentation actually requires.May 27, 2026What N-of-1 Trials Get Right That Population Studies Get WrongRandomized trials on populations measure average effects in heterogeneous groups. N-of-1 trials measure what actually happens to one specific person. For individual decision-making, the latter is usually more relevant.May 26, 2026
A Practitioner's Guide to Reinforcement Learning — DoOperator Research | DoOperator