Hyperparameter Sweeps
hpc-compose sweep turns one compose file with an embedded sweep block into many independent tracked Slurm jobs. Each trial is a normal sbatch submission with its own allocation, rendered script, job record, and scheduler state. The sweep manifest ties those jobs together for listing and aggregate status.
Quickstart
Start from a spec that can run with ordinary defaults, then add a top-level sweep block:
name: training-sweep
x-slurm:
time: "00:20:00"
cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}
sweep:
parameters:
lr: [0.001, 0.01, 0.1]
batch_size: [32, 64]
matrix: full
services:
trainer:
image: python:3.11-slim
environment:
LR: "${lr:-0.001}"
BATCH_SIZE: "${batch_size:-32}"
command: ["python", "train.py"]
Preview the expansion first:
hpc-compose sweep submit -f examples/training-sweep.yaml --dry-run
Then submit the trials:
hpc-compose sweep submit -f examples/training-sweep.yaml
hpc-compose sweep status -f examples/training-sweep.yaml
hpc-compose sweep list -f examples/training-sweep.yaml
Matrix Modes
matrix: full expands the full Cartesian product over sorted parameter names, so the example above produces six trials in stable t000, t001, … order.
Random sampling selects without replacement:
sweep:
parameters:
lr: [0.001, 0.01, 0.1]
batch_size: [32, 64]
matrix:
random: 5
seed: "paper-table-2"
With a seed, the selected trials are stable across machines. Without a seed, sweep submit derives one from the new sweep id and persists it in the manifest.
Interpolation Rules
Sweep parameter names are interpolation variable names. Values may be scalar strings, numbers, or booleans. For each trial, those variables override values from the environment and settings before planning, preparing, and rendering.
Reserved variables are also available:
| Variable | Value |
|---|---|
HPC_COMPOSE_SWEEP_ID | The persisted sweep id. |
HPC_COMPOSE_SWEEP_TRIAL | The stable trial label such as t000. |
HPC_COMPOSE_SWEEP_TRIAL_INDEX | Zero-based trial index. |
Normal commands still treat sweep as metadata. If plan, up, or render encounters ${lr} without a default, it fails unless lr is provided in the environment or settings. Use defaults such as ${lr:-0.001} when the base spec should remain runnable, and use sweep submit --dry-run as the validation path for missing sweep-only variables.
Fanout Guard
By default, submitted sweeps are capped at 100 trials. Larger matrices fail before calling sbatch:
hpc-compose sweep submit -f train.yaml
Raise the explicit ceiling when the fanout is intentional:
hpc-compose sweep submit -f train.yaml --max-trials 500
The guard applies to real submissions. Dry runs can inspect any matrix size.
Status Output
sweep status loads the manifest, queries the tracked state for submitted jobs, and aggregates:
completedfailedrunningpendingunknownmissing_trackingsubmit_failed
Use JSON for notebooks, dashboards, or CI automation:
hpc-compose sweep submit -f train.yaml --format json
hpc-compose sweep status -f train.yaml --format json
hpc-compose sweep status -f train.yaml --sweep-id sweep-123 --format json
hpc-compose sweep list -f train.yaml --format json
The JSON includes the sweep id, manifest path, matrix mode, persisted seed, trial variables, job ids, record paths, and per-trial status.
Manifest Layout
Sweep state is stored beside normal tracked jobs:
.hpc-compose/
sweeps/
latest.json
<sweep-id>/
sweep.json
t000.sbatch
t001.sbatch
jobs/
<job-id>.json
Sweep-trial records have kind: sweep_trial and include sweep metadata. They do not update the normal latest.json or latest-run.json pointers, so status, watch, and logs for ordinary runs keep their existing meaning.
V1 Limitations
- Sweeps must be embedded in the same compose file.
sweep.specis rejected in v1. - Each trial is a separate Slurm allocation. Sweeps are not Slurm arrays.
x-slurm.arrayis rejected duringsweep submit.- Trials submit sequentially. If a submission fails, later trials are not submitted and the partial manifest is kept.
sweep statussummarizes scheduler/tracking state only. It does not parse metric files or pick a best trial.