A Practitioner's Guide to Evolutionary Methods

You've spent three weeks tuning a 12-layer policy network for a continuous control task. The reward signal is sparse—your agent gets a non-zero reward maybe once every 200 steps. Policy gradients are producing gradient estimates with variance so high that your loss curve looks like a random walk. Your learning rate scheduler has its own scheduler. You're questioning your career choices.

This is the problem evolutionary methods solve: optimizing high-dimensional, non-differentiable, noisy objective functions where gradient information is either unavailable, unreliable, or computationally prohibitive to obtain. Not as a last resort—as a deliberate design choice with specific theoretical and empirical advantages.

What Evolution Strategies Actually Do

Evolution Strategies (ES) treat optimization as a black-box problem. You have parameters θ, a function F(θ) that returns a scalar reward, and you want to find θ* that maximizes F. The core idea is deceptively simple: sample perturbations around the current parameters, evaluate them, and move in the direction of the ones that worked.

The canonical form—used by Salimans et al. (2017) to train Atari-playing agents with 3.7 million parameters—estimates the gradient as:

∇_θ E[F(θ)] ≈ (1/σN) Σ ε_i · F(θ + σε_i)

where ε_i ~ N(0, I). This is a finite-difference approximation, but with a critical property: the estimate is unbiased when F is smooth and the perturbations are symmetric (Salimans et al., 2017, Section 3). The variance scales as O(1/N), meaning you can trade computation for precision by increasing the population size.

This isn't a heuristic. It's a Monte Carlo gradient estimator with known convergence properties under mild regularity conditions.

When Gradient-Based Methods Fail

Consider the scenario that motivated the Deep Neuroevolution paper (Such et al., 2017): training neural networks for RL where the reward is sparse and the environment is stochastic. Policy gradient methods require computing gradients through the entire trajectory, which means:

Temporal credit assignment becomes a separate optimization problem (discount factors, advantage estimation, value function fitting)
Gradient variance compounds over long horizons
Exploration must be engineered through entropy bonuses or noise injection

Evolutionary methods sidestep all three. They evaluate complete trajectories, assign credit based solely on final return, and explore through parameter-space noise. Such et al. (2017) showed that a simple genetic algorithm matched or exceeded DQN and A3C on 10 of 11 Atari games, despite using no gradient information whatsoever.

The catch? ES requires more function evaluations per update. But in distributed settings, this becomes a feature: Salimans et al. (2017) demonstrated near-linear speedup up to 1,440 workers, because only scalar returns need to be communicated—no gradients, no network activations.

CMA-ES: When Isotropic Noise Isn't Enough

The simple ES above assumes all parameter dimensions are equally important and independent. This is almost never true. The Covariance Matrix Adaptation Evolution Strategy (CMA-ES) solves this by learning the correlation structure between parameters.

Hansen (2016) describes the key mechanism: CMA-ES maintains a multivariate normal distribution N(m, σ²C) over the search space. At each generation, it samples λ candidates, selects the top μ, and updates both the mean m and the covariance matrix C. The covariance update uses the empirical covariance of the selected steps, plus a rank-one update based on the evolution path—a weighted sum of consecutive mean shifts.

The theoretical guarantees are strong: CMA-ES is invariant to monotonic transformations of the objective function and to rotations of the search space (Hansen, 2016, Section 2.2). This means it performs identically whether your parameters are measured in meters or millimeters, and whether your problem is axis-aligned or rotated 45 degrees. No gradient-based method can make this claim without re-tuning.

Concrete example: You're tuning 20 hyperparameters for a production recommendation system. The objective is offline AUC, which takes 45 minutes to evaluate. Grid search would require exponential evaluations. Bayesian optimization assumes smoothness but breaks when hyperparameters interact nonlinearly. CMA-ES, with a population of λ=4+⌊3 ln(20)⌋=13 candidates per generation, will typically reach within 1% of the optimum in 20-30 generations (Hansen, 2016, Section 4.1). That's 260-390 evaluations—about 8-12 days of wall-clock time with a single worker, or under 24 hours with parallel evaluation.

Population-Based Training: Hyperparameters That Evolve

The standard approach to hyperparameter optimization—grid search, random search, Bayesian optimization—treats hyperparameters as static choices made before training. This is wrong. The optimal learning rate at epoch 1 is not the optimal learning rate at epoch 100.

Population Based Training (PBT), introduced by Jaderberg et al. (2017), solves this by evolving both model parameters and hyperparameters simultaneously. A population of models trains in parallel. Every few iterations, each model copies weights and hyperparameters from a better-performing member (exploit), then perturbs the hyperparameters (explore).

The key result from Jaderberg et al. (2017): PBT achieved state-of-the-art results on Atari, StarCraft II, and machine translation tasks, using the same computational budget as standard training. The method automatically discovered learning rate schedules that decayed over time, entropy bonuses that increased with training, and regularization strengths that adapted to model capacity.

This isn't just convenience. PBT addresses a fundamental mismatch: hyperparameter optimization is a sequential decision problem, but standard methods treat it as a one-shot choice.

The Most Common Failure Mode

Practitioners consistently misuse evolutionary methods by ignoring the evaluation noise problem.

Evolutionary methods evaluate each candidate once (or a few times) and use that noisy estimate for selection. If the noise from environmental stochasticity dominates the signal from parameter perturbations, selection becomes random. The algorithm drifts.

Salimans et al. (2017) demonstrated this empirically: on the Humanoid-v1 MuJoCo task, using a single episode per evaluation produced a final reward of ~800, while averaging over 20 episodes per evaluation reached ~1,800—more than double. The gradient estimate variance from environmental noise was swamping the signal.

The fix: Before running any evolutionary optimization, measure the variance of F(θ) at a fixed θ over multiple random seeds. If the standard deviation exceeds 10% of the mean, you need to average over multiple evaluations per candidate. This increases computational cost linearly, but it's non-negotiable.

A secondary failure mode: using too small a population. Hansen (2016) recommends λ ≥ 4 + ⌊3 ln(n)⌋ for CMA-ES, where n is the dimensionality. Smaller populations converge prematurely to local optima. Larger populations explore more thoroughly but require more evaluations. There's no free lunch.

Practical Checklist

Before deploying evolutionary methods in your next study, verify:

Is the objective function sufficiently smooth? ES gradient estimates are unbiased only when F is Lipschitz continuous in parameter space. If your reward landscape is a step function (e.g., binary success/failure with no intermediate signal), ES will struggle. Test by measuring how F(θ + ε) changes as ε varies.
Have you characterized the evaluation noise? Run F(θ) at 3-5 fixed parameter settings, 20+ times each. If the noise-to-signal ratio exceeds 0.1, you must average multiple evaluations per candidate. Document this ratio in your methods section.
Is your population size appropriate for the dimensionality? For CMA-ES, use Hansen's formula. For simple ES, start with N = 10·n and reduce if wall-clock time is prohibitive. Monitor population diversity—if it collapses before generation 10, increase population size.
Are you parallelizing correctly? Evolutionary methods are embarrassingly parallel, but only if evaluation is the bottleneck. Profile your pipeline: if communication or checkpointing dominates wall time, you're losing the advantage. Salimans et al. (2017) achieved linear speedup only when communication was limited to scalar returns.
Have you checked for invariance violations? CMA-ES is rotationally invariant; simple ES is not. If your parameters have strong correlations (e.g., weights in adjacent neural network layers), use CMA-ES or a method that adapts the covariance structure. Simple ES with isotropic noise will waste evaluations.
Is the budget sufficient for convergence? Evolutionary methods typically require 10-100× more function evaluations than gradient-based methods to reach comparable solution quality. If your total evaluation budget is under 1,000, consider Bayesian optimization instead. If it's over 10,000, evolutionary methods will likely outperform.

Part of the DoOperator Research series on Evolutionary Methods. Browse the full paper corpus at dooperator.ai/research/evolutionary_methods.