Deep Neuroevolution: Genetic Algorithms Are a Competitive Alternative for Training Deep Neural Networks for Reinforcement Learning — DoOperator Research

Authors	Felipe Petroski Such, Vashisht Madhavan, Edoardo Conti, Joel Lehman, Kenneth O. Stanley, Jeff Clune
Journal	arXiv
Year	2017

What Problem It Solves

Training deep neural networks for reinforcement learning (RL) has been dominated by gradient-based methods—specifically backpropagation through time, policy gradients (e.g., REINFORCE, PPO, TRPO), and Q-learning variants (e.g., DQN). These methods suffer from several well-known limitations: they require differentiable policies, are sensitive to reward sparsity, can get trapped in local optima, and often need careful hyperparameter tuning of learning rates, exploration schedules, and network architectures. This paper challenges the assumption that gradient information is necessary for training deep neural networks in RL. It demonstrates that a classic, gradient-free evolutionary algorithm—specifically a simple genetic algorithm (GA)—can train neural networks with millions of parameters to solve challenging RL benchmarks, including Atari games and continuous control tasks, at performance levels competitive with or exceeding state-of-the-art deep RL methods. The core problem addressed is whether evolutionary methods, long considered impractical for high-dimensional neural network optimization, can scale to modern deep RL problems and offer a viable alternative when gradient-based methods struggle.

What problem it solves

How it works

The core idea is straightforward: instead of computing gradients to update neural network weights, the GA treats the entire weight vector as a genotype and evolves a population of such genotypes through selection, crossover, and mutation. The algorithm proceeds as follows:

Initialization: Create a population of (P) individuals, each represented as a vector of neural network weights. Weights are typically initialized randomly (e.g., from a uniform or normal distribution with small variance).
Fitness evaluation: For each individual, decode the weight vector into a neural network policy. Run the policy in the RL environment for one or more episodes. The total reward (or a function thereof) is the fitness score. This step is embarrassingly parallel—each individual can be evaluated on a separate CPU/GPU.
Selection: Select parents for the next generation based on fitness. The paper uses tournament selection: randomly sample (k) individuals from the population, pick the one with highest fitness, and repeat until enough parents are selected. Tournament size (k) controls selection pressure (larger (k) = more pressure).
Crossover (recombination): With probability (p_c), two parents produce offspring by crossing over their weight vectors. The paper uses uniform crossover: each weight is inherited from either parent with equal probability. With probability (1 - p_c), a parent is cloned directly.
Mutation: Each weight in the offspring's weight vector is perturbed with probability (p_m) by adding Gaussian noise (\mathcal{N}(0, \sigma^2)). The mutation rate (p_m) and mutation scale (\sigma) are key hyperparameters.
Elitism: The best-performing individual(s) from the current generation are copied unchanged into the next generation to prevent loss of the best solution.
Repeat: Steps 2–6 for (G) generations.

The key insight is that this simple procedure can optimize millions of parameters. The paper shows that with a population size of (P = 1000), tournament size (k = 25), crossover probability (p_c = 0.5), mutation probability (p_m = 0.1) per weight, and mutation scale (\sigma = 0.02), the GA achieves competitive performance on Atari games and continuous control tasks.

No equations are needed beyond the mutation update: (w_i' = w_i + \epsilon_i) where (\epsilon_i \sim \mathcal{N}(0, \sigma^2)) with probability (p_m), else (\epsilon_i = 0). The fitness is simply the cumulative episode reward.

When to use it

Prefer deep neuroevolution over gradient-based RL when:

The reward signal is sparse or delayed, making credit assignment difficult for policy gradients. Evolutionary methods evaluate entire trajectories and do not require temporal credit assignment.
The policy is non-differentiable (e.g., discrete action spaces with hard thresholds, or policies involving logical rules or look-up tables).
You have access to massive parallel compute (e.g., thousands of CPU cores) and want to trade sample efficiency for wall-clock speed. The GA is embarrassingly parallel, while gradient-based methods require sequential updates.
The environment is deterministic or has low stochasticity. The GA's fitness evaluation is noisy in stochastic environments, requiring multiple episodes per individual.
You want to avoid gradient-related pathologies: vanishing/exploding gradients, saddle points, or sensitivity to learning rate schedules.

Prefer gradient-based RL (PPO, SAC, DQN) over deep neuroevolution when:

Sample efficiency is critical. Gradient-based methods typically require fewer environment interactions to achieve good performance because they use gradient information to move in a more informed direction.
The environment is highly stochastic. The GA's fitness estimates become noisy, requiring averaging over many episodes, which increases computational cost.
You have limited parallel compute. The GA requires evaluating many individuals per generation, which can be slow on a single machine.
The policy is continuous and smooth, and gradient information is cheap to compute. Policy gradient methods can exploit this structure.
You need to handle off-policy learning or experience replay, which gradient-based methods can leverage but the GA cannot.

Prefer deep neuroevolution over other evolutionary methods (e.g., CMA-ES, OpenAI ES) when:

Simplicity and ease of implementation are priorities. The GA has fewer hyperparameters and no covariance matrix adaptation.
You want to scale to very large populations (millions of individuals). The GA's memory footprint is linear in population size, while CMA-ES is quadratic.
You want to leverage crossover, which can recombine good building blocks. CMA-ES and OpenAI ES are mutation-only.

Prefer CMA-ES or OpenAI ES over deep neuroevolution when:

The search space is relatively low-dimensional (e.g., < 10,000 parameters). CMA-ES adapts the mutation distribution and can be more sample-efficient.
You need adaptive step sizes. The GA uses fixed mutation scale, which may be suboptimal as evolution progresses.
You want a theoretically grounded update rule. OpenAI ES is a special case of natural evolution strategies with connections to policy gradients.

Limitations and failure modes

Sample inefficiency: The GA requires orders of magnitude more environment interactions than gradient-based methods. On Atari, the GA used 1–5 billion frames vs. DQN's 50–200 million. This is acceptable only when parallel compute is abundant.
No temporal credit assignment: The GA treats an entire episode as a single fitness value. It cannot distinguish which actions were good or bad within an episode. This makes it unsuitable for problems with long horizons and sparse rewards where only a few actions matter.
Fixed architecture: The GA optimizes weights for a fixed network architecture. It cannot discover new architectures or connectivity patterns (though neuroevolution methods like NEAT can).
Sensitivity to mutation parameters: The mutation scale (\sigma) and probability (p_m) are critical and problem-dependent. There is no automatic adaptation (unlike CMA-ES). Poor choices lead to slow progress or divergence.
No reuse of experience: The GA discards all experience after each generation. It cannot use experience replay or off-policy learning, which are key to sample efficiency in deep RL.
Stochastic environments: In highly stochastic environments, fitness estimates are noisy. The GA may converge to a policy that is lucky rather than good. Averaging over multiple episodes helps but increases computational cost.
Deceptive fitness landscapes: If the fitness landscape has many local optima or requires specific sequences of actions, the GA may struggle. Crossover can help but is not guaranteed.
Scalability to very large networks: The paper shows the GA works with millions of parameters, but the mutation operator becomes increasingly random as dimensionality grows. For networks with billions of parameters, the GA may be impractical.
No theoretical convergence guarantees: Unlike policy gradient methods, which have convergence guarantees under certain conditions, the GA has no such guarantees. Performance is empirical.
Common misapplication: Using the GA on problems where gradient-based methods work well and compute is limited. The GA is not a drop-in replacement for PPO or SAC; it is a complementary tool for specific scenarios.

Read full paper →PDF ↗More Evolutionary Methods →