| Authors | Tim Salimans, Jonathan Ho, Xi Chen, I. Sutskever |
| Journal | arXiv.org |
| Year | 2017 |
What Problem It Solves
Reinforcement learning (RL) methods based on Markov Decision Processes (MDPs)—such as Q-learning, policy gradients, and actor-critic algorithms—suffer from several practical limitations that hinder their scalability and robustness. These methods require careful tuning of discount factors, value function approximation, and temporal credit assignment, and they are notoriously sensitive to hyperparameters like learning rates and exploration schedules. Moreover, they often struggle with long horizons, delayed rewards, and high-dimensional action spaces. This paper demonstrates that Evolution Strategies (ES)—a class of black-box optimization algorithms—can serve as a scalable, robust alternative that sidesteps many of these issues. The core problem ES solves is: how to optimize a policy (parameterized by a vector θ) to maximize cumulative reward in a Markov decision process, without relying on gradient computation through the environment dynamics, value function estimation, or temporal difference learning. ES treats the entire return as a black-box function F(θ) and optimizes it using only function evaluations, making it particularly attractive for parallelization and for problems where gradient information is noisy, biased, or expensive to obtain.
Reinforcement learning (RL) methods based on Markov Decision Processes (MDPs)—such as Q-learning, policy gradients, and actor-critic algorithms—suffer from several practical limitations that hinder their scalability and robustness. These methods require careful tuning of discount factors, value function approximation, and temporal credit assignment, and they are notoriously sensitive to hyperparameters like learning rates and exploration schedules. Moreover, they often struggle with long horizons, delayed rewards, and high-dimensional action spaces. This paper demonstrates that Evolution Strategies (ES)—a class of black-box optimization algorithms—can serve as a scalable, robust alternative that sidesteps many of these issues. The core problem ES solves is: how to optimize a policy (parameterized by a vector θ) to maximize cumulative reward in a Markov decision process, without relying on gradient computation through the environment dynamics, value function estimation, or temporal difference learning. ES treats the entire return as a black-box function F(θ) and optimizes it using only function evaluations, making it particularly attractive for parallelization and for problems where gradient information is noisy, biased, or expensive to obtain.
The core idea of Evolution Strategies is deceptively simple: instead of computing gradients through the environment (as in policy gradient methods) or through a value function (as in Q-learning), ES estimates the gradient of the expected return with respect to policy parameters using only evaluations of the return itself. This is achieved through a finite-difference approximation in random directions.
Intuition: Imagine you have a policy parameterized by a vector θ. You want to find θ that maximizes the expected cumulative reward R(θ). You cannot differentiate R with respect to θ because the environment is a black box. However, you can evaluate R(θ) by running a rollout. ES works by: (1) sampling a population of perturbed parameters θ + σ ε, where ε ~ N(0, I) and σ is the noise standard deviation; (2) evaluating the return for each perturbed parameter; (3) using the returns to estimate the gradient direction; (4) updating θ in that direction.
The gradient estimation step is the key insight. For a single perturbation ε, the finite-difference approximation of the directional derivative is:
∇_θ R(θ) · ε ≈ (R(θ + σ ε) - R(θ)) / σ
But this is noisy. To reduce variance, we average over many perturbations. The canonical ES gradient estimate (using antithetic sampling for variance reduction) is:
g = (1/(N σ)) Σ_{i=1}^{N} R(θ + σ ε_i) ε_i
where ε_i are standard normal vectors. This is an unbiased estimator of the gradient of the smoothed objective F_σ(θ) = E_ε[R(θ + σ ε)], which is a Gaussian-smoothed version of the true objective. As σ → 0, this approaches the true gradient, but with increasing variance.
The algorithm proceeds as follows:
Key innovation for scalability: The paper introduces a communication-efficient strategy using common random numbers. Instead of generating perturbations independently on each worker (which would require communicating the full parameter vector), all workers share a synchronized random number generator. Each worker only needs to receive the current parameters θ_t and a seed to generate its perturbation. The worker then sends back only the scalar return R(θ_i). This reduces communication from O(d) (where d is the number of parameters) to O(1) per worker per iteration, enabling scaling to thousands of parallel workers.
Why this works for RL: ES treats the entire trajectory as a single function evaluation. This means it is naturally invariant to action frequency (the number of steps per episode), delayed rewards (the return is just a sum), and long horizons (no discounting needed). It does not require value function approximation, target networks, or experience replay. The price is that it ignores the temporal structure of the problem, which can make it sample-inefficient compared to methods that exploit the MDP structure.
Prefer ES over policy gradients (e.g., PPO, A3C) when:
Prefer policy gradients over ES when:
Prefer ES over Q-learning (e.g., DQN, Rainbow) when:
Prefer Q-learning over ES when:
Sample inefficiency: ES requires many environment interactions per update (N episodes per iteration). For a population of N=1000, each iteration consumes 1000 episodes. This makes ES impractical when environment interactions are expensive (e.g., robotics hardware,
Related papers
Deep Neuroevolution: Genetic Algorithms Are a Competitive Alternative for Training Deep Neural Networks for Reinforcement Learning
Felipe Petroski Such, Vashisht Madhavan, Edoardo Conti +3 more · 2017
PaperThe Surprising Creativity of Digital Evolution: A Collection of Anecdotes from the Evolutionary Computation and Artificial Life Research Communities
J. Lehman, J. Clune, D. Misevic +50 more · 2018
PaperThe CMA Evolution Strategy: A Tutorial
Nikolaus Hansen · 2016
PaperPopulation Based Training of Neural Networks
Max Jaderberg, Valentin Dalibard, Simon Osindero +9 more · 2017