| Authors | Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M. Czarnecki, Jeff Donahue, Ali Razavi, O. Vinyals, Tim Green, Iain Dunning, K. Simonyan, Chrisantha Fernando, K. Kavukcuoglu |
| Journal | arXiv.org |
| Year | 2017 |
What Problem It Solves
Training neural networks requires selecting hyperparameters — learning rates, batch sizes, network architectures, regularisation coefficients — that are typically fixed at the start of training or tuned via grid search, random search, or Bayesian optimisation. These approaches treat hyperparameter selection as a static, pre-training problem: find the best fixed configuration, then train with it. This is fundamentally mismatched with the dynamics of neural network training, where the optimal hyperparameter values change as the model moves through parameter space. A learning rate that works well early in training may be too large for fine-tuning later; a regularisation strength that prevents overfitting in the first epoch may cripple capacity in the final one. Existing methods either waste compute on full training runs for each candidate configuration (grid/random/Bayesian search) or require expensive gradient-based meta-learning. Population Based Training (PBT) solves the joint problem of training a model and discovering a schedule of hyperparameters over the course of training, using a fixed computational budget and requiring only a small modification to standard distributed training frameworks.
Training neural networks requires selecting hyperparameters — learning rates, batch sizes, network architectures, regularisation coefficients — that are typically fixed at the start of training or tuned via grid search, random search, or Bayesian optimisation. These approaches treat hyperparameter selection as a static, pre-training problem: find the best fixed configuration, then train with it. This is fundamentally mismatched with the dynamics of neural network training, where the optimal hyperparameter values change as the model moves through parameter space. A learning rate that works well early in training may be too large for fine-tuning later; a regularisation strength that prevents overfitting in the first epoch may cripple capacity in the final one. Existing methods either waste compute on full training runs for each candidate configuration (grid/random/Bayesian search) or require expensive gradient-based meta-learning. Population Based Training (PBT) solves the joint problem of training a model and discovering a schedule of hyperparameters over the course of training, using a fixed computational budget and requiring only a small modification to standard distributed training frameworks.
PBT combines ideas from evolutionary algorithms and hyperparameter optimisation. The core insight is that instead of treating hyperparameter tuning as a separate pre-training step, we can interleave it with model training. A population of models (typically 16–64) is trained in parallel, each with its own hyperparameters. Periodically, the worst-performing models are replaced by copies of the best-performing models, with their hyperparameters perturbed. This creates a natural schedule: hyperparameters evolve over time as the population discovers what works at each stage of training.
The algorithm proceeds as follows:
Initialise population. Sample N sets of hyperparameters from a prior distribution (e.g., log-uniform for learning rate, uniform for entropy coefficient). Create N copies of the model with these hyperparameters, each assigned to a worker.
Train asynchronously. Each worker trains its model for a fixed number of steps (the "exploit/explore interval", typically 1–10% of total training). After each interval, the worker evaluates the model on a validation set and reports the metric.
Select, exploit, explore. For each worker, compare its validation metric to the rest of the population. If the worker is in the bottom 20% (the "tail"), it is replaced by a copy of a model from the top 20% (the "head"). The replacement model inherits the weights of the top performer, but its hyperparameters are perturbed: each hyperparameter is either resampled from the prior or multiplied by a random factor (e.g., 1.2 or 0.8). The worker then continues training from the copied weights with the new hyperparameters.
Repeat. Continue until the total budget is exhausted.
The key design choices are:
Exploit/explore interval. Too short and the population doesn't have time to benefit from new hyperparameters; too long and the method converges too slowly. The paper uses intervals of 5–10% of total training steps.
Fraction truncated. Typically 20% of the population is replaced each round. This balances exploration (keeping diverse models) with exploitation (focusing compute on promising configurations).
Perturbation strategy. The paper uses two modes: "resample" (draw a new hyperparameter from the prior) and "perturb" (multiply by a random factor). The choice depends on the hyperparameter: continuous ones like learning rate are perturbed; discrete ones like architecture choices are resampled.
Asynchronous execution. Workers do not synchronise; each worker independently decides when to evaluate and potentially replace its model. This avoids the overhead of synchronous population updates and makes efficient use of heterogeneous hardware.
The algorithm does not require a separate validation set for hyperparameter selection — the same validation metric used for early stopping or model selection can be used for PBT. This is a practical advantage: PBT reuses the evaluation that would happen anyway.
Prefer PBT over grid search or random search when:
Prefer Bayesian optimisation or Hyperband over PBT when:
Prefer PBT over evolutionary strategies (ES) when:
Prefer ES over PBT when:
Population collapse. If the validation metric is noisy or the population is too small, all models may converge to the same hyperparameter configuration. This eliminates the benefit of PBT, reducing it to standard training with random perturbations. Mitigation: increase population size, use a smoother validation metric, or add random exploration steps.
Overfitting to the validation metric. PBT optimises the validation metric directly, which can lead to overfitting if the metric is noisy or if the validation set is small. This is especially problematic in RL, where the reward signal is stochastic. Mitigation: use a large validation set, average over multiple evaluation episodes, or use a hold-out validation set for final reporting.
Instability from abrupt hyperparameter changes. Some hyperparameters (e.g., learning rate) can cause training to diverge if changed too abruptly. The perturbation factor (1.2 or 0.8) is designed to be conservative, but it may still cause instability in some models. Mitigation: clip hyperparameter changes to a safe range, or use a warm-up period after each perturbation.
Inefficiency for very large populations. PBT requires periodic evaluation of all models, which can become expensive if the population is large (hundreds of models). The evaluation cost scales linearly with population size. Mitigation: use a smaller validation subset for PBT, or evaluate only a random subset of models each round.
Not suitable for one-shot hyperparameter selection. PBT produces a trained model with a history of hyperparameter changes, not a single best configuration. If you need to deploy a model with a fixed hyperparameter schedule, you must extract the schedule from the population history, which may be noisy. Mitigation: run PBT multiple times and average the discovered schedules, or use the final hyperparameter values as a starting point for a separate training run.
Requires careful tuning of the exploit/explore interval. Too short: the population doesn't have time to benefit from new hyperparameters. Too long: the method converges too slowly. The optimal interval depends on the task and the hyperparameter space. Mitigation: start with 5% of total training steps and adjust based on population diversity diagnostics.
Open problem: theoretical guarantees. Unlike Bayesian optimisation, which has regret bounds, or random search, which has simple convergence properties, PBT has no theoretical guarantees. The empirical results are strong, but the method is heuristic. This is an open problem for future work.
Related papers
Deep Neuroevolution: Genetic Algorithms Are a Competitive Alternative for Training Deep Neural Networks for Reinforcement Learning
Felipe Petroski Such, Vashisht Madhavan, Edoardo Conti +3 more · 2017
PaperEvolution Strategies as a Scalable Alternative to Reinforcement Learning
Tim Salimans, Jonathan Ho, Xi Chen +1 more · 2017
PaperThe Surprising Creativity of Digital Evolution: A Collection of Anecdotes from the Evolutionary Computation and Artificial Life Research Communities
J. Lehman, J. Clune, D. Misevic +50 more · 2018
PaperThe CMA Evolution Strategy: A Tutorial
Nikolaus Hansen · 2016