Population Based Training of Neural Networks — DoOperator Research

Authors	Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M. Czarnecki, Jeff Donahue, Ali Razavi, O. Vinyals, Tim Green, Iain Dunning, K. Simonyan, Chrisantha Fernando, K. Kavukcuoglu
Journal	arXiv.org
Year	2017

What Problem It Solves

Training neural networks requires selecting hyperparameters — learning rates, batch sizes, network architectures, regularisation coefficients — that are typically fixed at the start of training or tuned via grid search, random search, or Bayesian optimisation. These approaches treat hyperparameter selection as a static, pre-training problem: find the best fixed configuration, then train with it. This is fundamentally mismatched with the dynamics of neural network training, where the optimal hyperparameter values change as the model moves through parameter space. A learning rate that works well early in training may be too large for fine-tuning later; a regularisation strength that prevents overfitting in the first epoch may cripple capacity in the final one. Existing methods either waste compute on full training runs for each candidate configuration (grid/random/Bayesian search) or require expensive gradient-based meta-learning. Population Based Training (PBT) solves the joint problem of training a model and discovering a schedule of hyperparameters over the course of training, using a fixed computational budget and requiring only a small modification to standard distributed training frameworks.

What problem it solves

How it works

PBT combines ideas from evolutionary algorithms and hyperparameter optimisation. The core insight is that instead of treating hyperparameter tuning as a separate pre-training step, we can interleave it with model training. A population of models (typically 16–64) is trained in parallel, each with its own hyperparameters. Periodically, the worst-performing models are replaced by copies of the best-performing models, with their hyperparameters perturbed. This creates a natural schedule: hyperparameters evolve over time as the population discovers what works at each stage of training.

The algorithm proceeds as follows:

Initialise population. Sample N sets of hyperparameters from a prior distribution (e.g., log-uniform for learning rate, uniform for entropy coefficient). Create N copies of the model with these hyperparameters, each assigned to a worker.
Train asynchronously. Each worker trains its model for a fixed number of steps (the "exploit/explore interval", typically 1–10% of total training). After each interval, the worker evaluates the model on a validation set and reports the metric.
Select, exploit, explore. For each worker, compare its validation metric to the rest of the population. If the worker is in the bottom 20% (the "tail"), it is replaced by a copy of a model from the top 20% (the "head"). The replacement model inherits the weights of the top performer, but its hyperparameters are perturbed: each hyperparameter is either resampled from the prior or multiplied by a random factor (e.g., 1.2 or 0.8). The worker then continues training from the copied weights with the new hyperparameters.
Repeat. Continue until the total budget is exhausted.

The key design choices are:

Exploit/explore interval. Too short and the population doesn't have time to benefit from new hyperparameters; too long and the method converges too slowly. The paper uses intervals of 5–10% of total training steps.
Fraction truncated. Typically 20% of the population is replaced each round. This balances exploration (keeping diverse models) with exploitation (focusing compute on promising configurations).
Perturbation strategy. The paper uses two modes: "resample" (draw a new hyperparameter from the prior) and "perturb" (multiply by a random factor). The choice depends on the hyperparameter: continuous ones like learning rate are perturbed; discrete ones like architecture choices are resampled.
Asynchronous execution. Workers do not synchronise; each worker independently decides when to evaluate and potentially replace its model. This avoids the overhead of synchronous population updates and makes efficient use of heterogeneous hardware.

The algorithm does not require a separate validation set for hyperparameter selection — the same validation metric used for early stopping or model selection can be used for PBT. This is a practical advantage: PBT reuses the evaluation that would happen anyway.

When to use it

Prefer PBT over grid search or random search when:

You have a fixed computational budget and want to maximise final performance, not just find a good fixed hyperparameter configuration. PBT discovers schedules, which often outperform the best static configuration.
Training is expensive (hours to days per run) and you cannot afford to run hundreds of independent trials. PBT uses the same total compute as training N models, but gets the benefit of hyperparameter adaptation.
You suspect that optimal hyperparameters change during training (e.g., learning rate annealing, entropy regularisation in RL, GAN training dynamics). PBT automatically discovers these schedules.
You have a distributed training infrastructure and can run multiple workers asynchronously. PBT is designed for this setting.

Prefer Bayesian optimisation or Hyperband over PBT when:

You need to find a single, fixed hyperparameter configuration for deployment, not a schedule. PBT's output is a trained model with a history of hyperparameter changes, not a single best configuration.
The hyperparameter space is small (fewer than 5 dimensions) and cheap to evaluate. Bayesian optimisation with Gaussian processes can be more sample-efficient in this regime.
You cannot run parallel workers (e.g., limited hardware). PBT requires at least 8–16 workers to maintain population diversity.
The training procedure is not robust to mid-training hyperparameter changes (e.g., some architectures or loss functions become unstable when hyperparameters are perturbed).

Prefer PBT over evolutionary strategies (ES) when:

You want to optimise both model weights and hyperparameters jointly. ES typically optimises weights only, with hyperparameters fixed.
You have a pre-existing distributed training framework (e.g., TensorFlow, PyTorch with distributed data parallel). PBT is a small modification to this setup.

Prefer ES over PBT when:

The model architecture itself is being evolved (e.g., neural architecture search). PBT is designed for continuous and categorical hyperparameters, not architectural mutations.
You need a black-box optimiser that does not require gradients. PBT still uses gradients for weight updates; it only evolves hyperparameters.

Limitations and failure modes

Population collapse. If the validation metric is noisy or the population is too small, all models may converge to the same hyperparameter configuration. This eliminates the benefit of PBT, reducing it to standard training with random perturbations. Mitigation: increase population size, use a smoother validation metric, or add random exploration steps.
Overfitting to the validation metric. PBT optimises the validation metric directly, which can lead to overfitting if the metric is noisy or if the validation set is small. This is especially problematic in RL, where the reward signal is stochastic. Mitigation: use a large validation set, average over multiple evaluation episodes, or use a hold-out validation set for final reporting.
Instability from abrupt hyperparameter changes. Some hyperparameters (e.g., learning rate) can cause training to diverge if changed too abruptly. The perturbation factor (1.2 or 0.8) is designed to be conservative, but it may still cause instability in some models. Mitigation: clip hyperparameter changes to a safe range, or use a warm-up period after each perturbation.
Inefficiency for very large populations. PBT requires periodic evaluation of all models, which can become expensive if the population is large (hundreds of models). The evaluation cost scales linearly with population size. Mitigation: use a smaller validation subset for PBT, or evaluate only a random subset of models each round.
Not suitable for one-shot hyperparameter selection. PBT produces a trained model with a history of hyperparameter changes, not a single best configuration. If you need to deploy a model with a fixed hyperparameter schedule, you must extract the schedule from the population history, which may be noisy. Mitigation: run PBT multiple times and average the discovered schedules, or use the final hyperparameter values as a starting point for a separate training run.
Requires careful tuning of the exploit/explore interval. Too short: the population doesn't have time to benefit from new hyperparameters. Too long: the method converges too slowly. The optimal interval depends on the task and the hyperparameter space. Mitigation: start with 5% of total training steps and adjust based on population diversity diagnostics.
Open problem: theoretical guarantees. Unlike Bayesian optimisation, which has regret bounds, or random search, which has simple convergence properties, PBT has no theoretical guarantees. The empirical results are strong, but the method is heuristic. This is an open problem for future work.

Read full paper →PDF ↗More Evolutionary Methods →