Skip to content

Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Hyperparameter Sweeps

hpc-compose sweep turns one compose file with an embedded sweep block into many independent tracked Slurm jobs. Each trial is a normal sbatch submission with its own allocation, rendered script, job record, and scheduler state. The sweep manifest ties those jobs together for listing and aggregate status.

Quickstart

Start from a spec that can run with ordinary defaults, then add a top-level sweep block:

name: training-sweep

x-slurm:
  time: "00:20:00"
  cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}

sweep:
  parameters:
    lr: [0.001, 0.01, 0.1]
    batch_size: [32, 64]
  matrix: full

services:
  trainer:
    image: python:3.11-slim
    environment:
      LR: "${lr:-0.001}"
      BATCH_SIZE: "${batch_size:-32}"
    command: ["python", "train.py"]

Preview the expansion first:

hpc-compose sweep submit -f examples/training-sweep.yaml --dry-run

Then submit the trials:

hpc-compose sweep submit -f examples/training-sweep.yaml
hpc-compose sweep status -f examples/training-sweep.yaml
hpc-compose sweep list -f examples/training-sweep.yaml

Matrix Modes

matrix: full expands the full Cartesian product over sorted parameter names, so the example above produces six trials in stable t000, t001, … order.

Random sampling selects without replacement:

sweep:
  parameters:
    lr: [0.001, 0.01, 0.1]
    batch_size: [32, 64]
  matrix:
    random: 5
    seed: "paper-table-2"

With a seed, the selected trials are stable across machines. Without a seed, sweep submit derives one from the new sweep id and persists it in the manifest.

Interpolation Rules

Sweep parameter names are interpolation variable names. Values may be scalar strings, numbers, or booleans. For each trial, those variables override values from the environment and settings before planning, preparing, and rendering.

Reserved variables are also available:

VariableValue
HPC_COMPOSE_SWEEP_IDThe persisted sweep id.
HPC_COMPOSE_SWEEP_TRIALThe stable trial label such as t000.
HPC_COMPOSE_SWEEP_TRIAL_INDEXZero-based trial index.

Normal commands still treat sweep as metadata. If plan, up, or render encounters ${lr} without a default, it fails unless lr is provided in the environment or settings. Use defaults such as ${lr:-0.001} when the base spec should remain runnable, and use sweep submit --dry-run as the validation path for missing sweep-only variables.

Fanout Guard

By default, submitted sweeps are capped at 100 trials. Larger matrices fail before calling sbatch:

hpc-compose sweep submit -f train.yaml

Raise the explicit ceiling when the fanout is intentional:

hpc-compose sweep submit -f train.yaml --max-trials 500

The guard applies to real submissions. Dry runs can inspect any matrix size.

Status Output

sweep status loads the manifest, queries the tracked state for submitted jobs, and aggregates:

  • completed
  • failed
  • running
  • pending
  • unknown
  • missing_tracking
  • submit_failed

Use JSON for notebooks, dashboards, or CI automation:

hpc-compose sweep submit -f train.yaml --format json
hpc-compose sweep status -f train.yaml --format json
hpc-compose sweep status -f train.yaml --sweep-id sweep-123 --format json
hpc-compose sweep list -f train.yaml --format json

The JSON includes the sweep id, manifest path, matrix mode, persisted seed, trial variables, job ids, record paths, and per-trial status.

Manifest Layout

Sweep state is stored beside normal tracked jobs:

.hpc-compose/
  sweeps/
    latest.json
    <sweep-id>/
      sweep.json
      t000.sbatch
      t001.sbatch
  jobs/
    <job-id>.json

Sweep-trial records have kind: sweep_trial and include sweep metadata. They do not update the normal latest.json or latest-run.json pointers, so status, watch, and logs for ordinary runs keep their existing meaning.

V1 Limitations

  • Sweeps must be embedded in the same compose file. sweep.spec is rejected in v1.
  • Each trial is a separate Slurm allocation. Sweeps are not Slurm arrays.
  • x-slurm.array is rejected during sweep submit.
  • Trials submit sequentially. If a submission fails, later trials are not submitted and the partial manifest is kept.
  • sweep status summarizes scheduler/tracking state only. It does not parse metric files or pick a best trial.