Evolution Strategies as a Scalable Alternative to Reinforcement Learning — DoOperator Research

Authors	Tim Salimans, Jonathan Ho, Xi Chen, I. Sutskever
Journal	arXiv.org
Year	2017

What Problem It Solves

Reinforcement learning (RL) methods based on Markov Decision Processes (MDPs)—such as Q-learning, policy gradients, and actor-critic algorithms—suffer from several practical limitations that hinder their scalability and robustness. These methods require careful tuning of discount factors, value function approximation, and temporal credit assignment, and they are notoriously sensitive to hyperparameters like learning rates and exploration schedules. Moreover, they often struggle with long horizons, delayed rewards, and high-dimensional action spaces. This paper demonstrates that Evolution Strategies (ES)—a class of black-box optimization algorithms—can serve as a scalable, robust alternative that sidesteps many of these issues. The core problem ES solves is: how to optimize a policy (parameterized by a vector θ) to maximize cumulative reward in a Markov decision process, without relying on gradient computation through the environment dynamics, value function estimation, or temporal difference learning. ES treats the entire return as a black-box function F(θ) and optimizes it using only function evaluations, making it particularly attractive for parallelization and for problems where gradient information is noisy, biased, or expensive to obtain.

What problem it solves

How it works

The core idea of Evolution Strategies is deceptively simple: instead of computing gradients through the environment (as in policy gradient methods) or through a value function (as in Q-learning), ES estimates the gradient of the expected return with respect to policy parameters using only evaluations of the return itself. This is achieved through a finite-difference approximation in random directions.

Intuition: Imagine you have a policy parameterized by a vector θ. You want to find θ that maximizes the expected cumulative reward R(θ). You cannot differentiate R with respect to θ because the environment is a black box. However, you can evaluate R(θ) by running a rollout. ES works by: (1) sampling a population of perturbed parameters θ + σ ε, where ε ~ N(0, I) and σ is the noise standard deviation; (2) evaluating the return for each perturbed parameter; (3) using the returns to estimate the gradient direction; (4) updating θ in that direction.

The gradient estimation step is the key insight. For a single perturbation ε, the finite-difference approximation of the directional derivative is:

∇_θ R(θ) · ε ≈ (R(θ + σ ε) - R(θ)) / σ

But this is noisy. To reduce variance, we average over many perturbations. The canonical ES gradient estimate (using antithetic sampling for variance reduction) is:

g = (1/(N σ)) Σ_{i=1}^{N} R(θ + σ ε_i) ε_i

where ε_i are standard normal vectors. This is an unbiased estimator of the gradient of the smoothed objective F_σ(θ) = E_ε[R(θ + σ ε)], which is a Gaussian-smoothed version of the true objective. As σ → 0, this approaches the true gradient, but with increasing variance.

The algorithm proceeds as follows:

Initialize policy parameters θ₀.
For each iteration t: a. Sample N random seeds (or use a shared random number generator). b. For each seed i, generate perturbation ε_i and compute perturbed parameters θ_i = θ_t + σ ε_i. c. In parallel across workers, evaluate R(θ_i) by running a full episode rollout. d. Compute the gradient estimate g_t using the formula above. e. Update θ_{t+1} = θ_t + α g_t, where α is the learning rate. f. Optionally, apply a rank-based transformation to the returns (e.g., using the rank instead of raw return) to make the algorithm invariant to monotonic transformations of the reward.

Key innovation for scalability: The paper introduces a communication-efficient strategy using common random numbers. Instead of generating perturbations independently on each worker (which would require communicating the full parameter vector), all workers share a synchronized random number generator. Each worker only needs to receive the current parameters θ_t and a seed to generate its perturbation. The worker then sends back only the scalar return R(θ_i). This reduces communication from O(d) (where d is the number of parameters) to O(1) per worker per iteration, enabling scaling to thousands of parallel workers.

Why this works for RL: ES treats the entire trajectory as a single function evaluation. This means it is naturally invariant to action frequency (the number of steps per episode), delayed rewards (the return is just a sum), and long horizons (no discounting needed). It does not require value function approximation, target networks, or experience replay. The price is that it ignores the temporal structure of the problem, which can make it sample-inefficient compared to methods that exploit the MDP structure.

When to use it

Prefer ES over policy gradients (e.g., PPO, A3C) when:

You have access to massive parallel compute (hundreds to thousands of CPUs) and want to minimize wall-clock time. ES scales linearly with the number of workers because communication is negligible.
The reward signal is sparse or delayed, and you cannot easily design a shaped reward. ES treats the entire episode return as a single scalar, so it does not need temporal credit assignment.
The action frequency is high or variable (e.g., continuous control at 1000 Hz vs 10 Hz). ES is invariant to the number of steps per episode.
You want to avoid the complexity of value function approximation, target networks, and hyperparameter tuning for discount factors and bootstrapping.
The environment is deterministic or has low stochasticity. ES performs best when the variance from parameter perturbations dominates environmental noise.

Prefer policy gradients over ES when:

You have limited computational resources (e.g., a single GPU). ES requires many function evaluations per update, making it sample-inefficient in terms of total environment interactions.
The environment is highly stochastic. ES gradient estimates become noisy when environmental randomness dominates, requiring larger populations or smaller σ.
You need fine-grained temporal credit assignment (e.g., which action in a long sequence caused a delayed reward). Policy gradients can use value functions to assign credit across time steps.
The policy parameter space is very high-dimensional (e.g., millions of parameters for deep networks). ES scales poorly with parameter dimension because the variance of the gradient estimate grows linearly with d.
You care about sample efficiency (total number of environment steps) rather than wall-clock time. ES typically requires 10-100x more environment interactions than PPO or SAC for the same performance.

Prefer ES over Q-learning (e.g., DQN, Rainbow) when:

The action space is continuous. Q-learning requires discretization or a separate optimization for continuous actions.
You want to avoid overestimation bias and target network instability. ES has no such issues.
The reward function is non-stationary or has complex structure. ES does not rely on Bellman optimality.

Prefer Q-learning over ES when:

The state space is high-dimensional but the action space is discrete and small. Q-learning can leverage function approximation efficiently.
You need off-policy learning from a fixed dataset. ES is inherently on-policy (it evaluates the current policy).
You have a limited budget for environment interactions. Q-learning can reuse past experiences.

Limitations and failure modes

Sample inefficiency: ES requires many environment interactions per update (N episodes per iteration). For a population of N=1000, each iteration consumes 1000 episodes. This makes ES impractical when environment interactions are expensive (e.g., robotics hardware,

Read full paper →PDF ↗More Evolutionary Methods →