DoOperator Research · May 22, 2026
When your objective is non-differentiable, black-box, or noisy, gradient descent fails — evolutionary strategies don't. While backpropagation dominates modern machine learning, a growing class of problems remain stubbornly resistant to gradient-based methods. From reinforcement learning with sparse rewards to neural architecture search over discrete topologies, the assumption of smooth, differentiable loss landscapes simply doesn't hold. Evolutionary strategies (ES) offer a compelling alternative: they scale to thousands of parallel workers, handle arbitrary objective functions, and in some cases match or exceed the performance of policy gradients. This post unpacks four key algorithms, their tradeoffs, and when you should reach for evolution over backprop.
Covariance Matrix Adaptation Evolution Strategy (CMA-ES) is the workhorse of continuous black-box optimization. Unlike simple genetic algorithms that mutate each parameter independently, CMA-ES maintains a full covariance matrix over the search distribution — essentially learning the local geometry of the objective landscape.
How it works: At each generation, CMA-ES samples candidate solutions from a multivariate Gaussian with mean μ and covariance C. It evaluates these candidates, ranks them by fitness, then updates μ toward the weighted mean of the top solutions. Critically, it also updates C to capture the correlation structure: if moving parameter A up and parameter B down consistently improves fitness, the covariance matrix aligns to that direction.
When it shines: CMA-ES excels on problems with up to ~100 continuous parameters, where the objective is smooth-ish but non-convex, and where gradient information is unavailable or unreliable. It's the default choice for hyperparameter tuning in scikit-learn, robot motor control calibration, and benchmarking optimization algorithms. The catch? The O(n²) memory cost per iteration (storing the covariance matrix) limits scaling to high-dimensional spaces.
Natural Evolution Strategies (NES) bridges the gap between evolutionary methods and information geometry. Instead of maintaining an explicit covariance matrix like CMA-ES, NES frames the optimization problem as: find the parameters θ of a search distribution p(x|θ) that maximize expected fitness.
The key insight: NES computes the natural gradient of expected fitness with respect to the distribution parameters, using the Fisher information matrix to account for the geometry of the parameter space. This yields update rules that are invariant to reparameterization — exactly the same property that makes natural gradient descent powerful in deep learning.
Why it matters: NES provides a principled, scalable alternative to CMA-ES. By approximating the Fisher matrix (often with a diagonal or low-rank structure), NES can handle higher-dimensional problems while retaining the invariance properties that make ES robust. Variants like Exponential NES (xNES) and Separable NES (SNES) offer different tradeoffs between sample efficiency and computational cost.
In 2017, OpenAI demonstrated that a simple evolutionary strategy could match the performance of state-of-the-art policy gradient methods on MuJoCo and Atari tasks — while being trivially parallelizable to over 1,000 workers.
The algorithm: OpenAI ES samples perturbations to the policy parameters ε ~ N(0, σ²I), evaluates each perturbed policy on the environment, and updates the parameter vector using a rank-based fitness shaping: θ ← θ + α/(Nσ) Σ (R_i × ε_i), where R_i are the normalized rewards. That's it — no backprop, no value function, no experience replay.
Scaling to 1000+ workers: The beauty of this approach is its communication efficiency. Each worker only needs to send back a single scalar (the episode reward) per generation. With 1,000 workers, you can evaluate 1,000 perturbations in parallel, achieving wall-clock speedups that policy gradient methods (which need to share gradients and experience) struggle to match. OpenAI reported that ES on 1,440 CPUs achieved the same performance as A3C on 32 GPUs for certain Atari games.
The catch: Sample efficiency suffers. ES typically requires 2-3x more environment interactions than policy gradients to reach the same performance. But when you have massive compute parallelism (cloud clusters, CPU farms), wall-clock time becomes the bottleneck — and ES wins.
Most optimization algorithms seek a single global optimum. Quality-Diversity (QD) algorithms, pioneered by MAP-Elites (Multi-dimensional Archive of Phenotypic Elites), ask a different question: find a diverse set of high-performing solutions that cover different behavioral niches.
How MAP-Elites works: Define a behavior space (e.g., robot gait speed and height, or neural network architecture depth and width). Create a grid over this space, where each cell stores the best solution found so far for that behavioral region. At each iteration, sample a parent from the archive, mutate it, evaluate its behavior and fitness, and insert it into the appropriate cell if it outperforms the current occupant.
Why it matters: For sim-to-real transfer, you don't want a single brittle solution — you want a portfolio of diverse behaviors that can adapt to real-world variations. In robotics, MAP-Elites has produced controllers that transfer to damaged robots by finding alternative gaits. In game design, it generates levels that are both playable and aesthetically varied. In NAS, it discovers architectures with different tradeoffs between accuracy and latency.
Non-differentiable rewards: RL with sparse, binary, or structured rewards (e.g., "did the robot reach the goal?"). Gradients are zero almost everywhere; ES sees the same signal but explores globally.
Sim-to-real transfer: Policies optimized with gradients often overfit to simulation physics. ES, with its inherent noise and exploration, produces more robust solutions that transfer to real hardware.
Neural Architecture Search (NAS): Discrete architecture choices (number of layers, kernel sizes, skip connections) break differentiability. ES can optimize over mixed continuous-discrete spaces naturally.
Adversarial robustness: Finding worst-case perturbations (adversarial examples) is a non-differentiable max-min problem. ES can discover more diverse adversarial examples than gradient-based attacks.
| Aspect | Gradient-Based | Evolutionary Strategies |
|---|---|---|
| Sample efficiency | Higher (uses gradient info) | Lower (random search noise) |
| Parallelism | Limited (gradient sync) | Embarrassingly parallel |
| Objective constraints | Requires differentiability | Any black-box function |
| Local vs. global | Local convergence | Global exploration (with noise) |
| Hyperparameters | Learning rate, optimizer | Population size, mutation rate, σ |
Mutation rate tuning is the ES equivalent of learning rate scheduling. Too high → random walk; too low → premature convergence. Adaptive schemes (like CMA-ES's covariance update) reduce this burden but add complexity.
The boundaries between ES and gradient-based methods are blurring. Recent work shows that ES can be interpreted as a policy gradient with a Gaussian exploration distribution — the gradient estimate is exactly ∇_θ E[R] ≈ E[R ε] / σ². This connects ES to the REINFORCE algorithm.
Similarly, CMA-ES's covariance update is formally equivalent to a natural gradient step on the parameters of a Gaussian distribution, using the exact Fisher information matrix. This insight has led to hybrid algorithms that combine the sample efficiency of natural gradients with the parallelism of ES.
Choose gradient-based when:
Choose evolutionary strategies when:
Evolutionary strategies are not a silver bullet — they trade sample efficiency for parallelism and generality. But in an era of cheap compute and complex, non-differentiable objectives, they've earned a permanent place in the optimization toolbox. When gradients fail, evolution scales.