Runbook
This runbook is the normal real-cluster flow for adapting a hpc-compose spec on a supported Linux Slurm submission host.
If you are new to Slurm, read Slurm And Container Basics first. If you are adapting to HAICORE@KIT, read HAICORE Guide alongside this runbook.
Commands below assume hpc-compose is on your PATH. If you are running from a local checkout, replace hpc-compose with target/release/hpc-compose.
Compose-aware commands accept -f / --file. When omitted, hpc-compose uses the active context compose file from .hpc-compose/settings.toml, then falls back to compose.yaml in the current directory. Global context flags are available everywhere:
--profile <NAME>selects a profile from.hpc-compose/settings.toml.--settings-file <PATH>uses an explicit settings file instead of upward auto-discovery.
Read Slurm And Container Basics, Execution Model, Runtime Backends, and Support Matrix before adapting a workflow to a new cluster.
Before You Start
Make sure you have:
- a Linux submission host with
srunandsbatch, - the runtime backend selected by
runtime.backend, scontrolwhenx-slurm.nodes > 1,- Pyxis support in
srunwhenruntime.backend: pyxis(srun --helpshould mention--container-image), - shared storage for the resolved cache directory,
- local source trees or local
.sqsh/.sifimages in place, - registry credentials when your cluster or registry requires them.
Backend-specific requirements are listed in Runtime Backends. Cluster profile generation and MPI smoke probes are covered in Cluster Profiles.
Normal Progression
For a new spec on a real cluster:
- Choose a starter from Examples, or run
hpc-compose new --template <name> --name my-app --output compose.yaml. - Run
hpc-compose setuponce if you want compose path, env files, env vars, and binary overrides stored in a project-local settings file. - Run
hpc-compose context --format jsonto verify resolved values and sources. - Set or confirm the resolved cache directory, then adjust cluster-specific resource settings.
- Run
hpc-compose plan -f compose.yamlandhpc-compose plan --verbose -f compose.yamlwhile adapting the file. - Run
hpc-compose up -f compose.yamlfor the normal cluster run. - If it fails, start with
hpc-compose debug -f compose.yaml --preflight, then use Troubleshooting and break outpreflight,prepare,render,status,ps,watch,stats, orlogsseparately.
For a minimal cluster smoke test from a checkout, set CACHE_DIR to shared storage and run scripts/cluster_smoke.sh. It validates, preflights, and renders by default; set HPC_COMPOSE_SMOKE_SUBMIT=1 only when you intentionally want it to launch the smoke job.
Project-Local Settings
hpc-compose can discover .hpc-compose/settings.toml by walking upward from the current directory. You can also pin a file with --settings-file.
Typical setup flow:
hpc-compose setup
hpc-compose context
hpc-compose --profile dev context --format json
Non-interactive setup is available for scripting:
hpc-compose setup --profile-name dev --compose-file compose.yaml --env-file .env --env-file .env.dev --cache-dir '<shared-cache-dir>' --default-profile dev --non-interactive
Settings file shape:
version = 1
default_profile = "dev"
[defaults]
compose_file = "compose.yaml"
env_files = [".env"]
[defaults.env]
CACHE_DIR = "/cluster/shared/hpc-compose-cache"
[defaults.cache]
dir = "/cluster/shared/hpc-compose-cache"
[profiles.dev]
compose_file = "compose.yaml"
env_files = [".env", ".env.dev"]
[profiles.dev.env]
RESUME_DIR = "/shared/$USER/runs/my-run"
MODEL_DIR = "$HOME/models"
[profiles.dev.cache]
dir = "/cluster/shared/dev-hpc-compose-cache"
[resource_profiles.cpu-small]
time = "00:30:00"
cpus_per_task = 4
mem = "16G"
[resource_profiles.gpu-small]
partition = "gpu"
time = "01:00:00"
gpus = 1
cpus_per_task = 8
mem = "32G"
Resolution precedence is fixed:
- CLI flags
- selected profile values
- shared settings defaults
- built-in CLI defaults
Use context whenever you want to inspect effective compose path, binaries, interpolation variables, runtime paths, and per-field sources.
Resource profiles are referenced from YAML with x-slurm.resources: gpu-small. They are Slurm resource defaults, not the same thing as the global --profile setting selector, and explicit x-slurm values in the spec override profile defaults.
Choose A Starting Example
The maintained selection guide is Examples. It includes:
- four promoted beginner paths,
- a novice ladder from authoring to distributed workloads,
- the full repository example matrix,
- companion notes for LLM worker examples,
- an adaptation checklist.
Keep docs/src/examples.md as the single source of example selection truth. The embedded YAML source appendix is Example Source.
1. Choose A Cache Directory Early
Set the cache default to a path visible from both the login node and compute nodes:
[profiles.dev.cache]
dir = "/cluster/shared/hpc-compose-cache"
Or set x-slurm.cache_dir directly in the spec when the cache path should travel with that file:
x-slurm:
cache_dir: /cluster/shared/hpc-compose-cache
Quick recipe:
export CACHE_DIR=/cluster/shared/hpc-compose-cache
mkdir -p "$CACHE_DIR"
test -w "$CACHE_DIR"
Rules:
- Do not use
/tmp,/var/tmp,/private/tmp, or/dev/shm. - If
cache_diris unset in the spec, resolution checks profile cache settings, then defaults cache settings, then$HOME/.cache/hpc-compose. - The default may work on some clusters, but a shared project/work/scratch path is safer.
- Validation can accept unsafe local paths;
preflightreports them as policy errors.
More cache details are in Cache Management.
2. Adapt The Example
Start with the nearest example and then change:
imagecommand/entrypointvolumesenvironmentx-slurmresource settingsx-runtime.preparecommands for dependencies or tooling
Recommended pattern:
- Put fast-changing application code in
volumes. - Put slower-changing dependency installation in
x-runtime.prepare.commands. - Add
readinessonly to services that other services truly depend on.
3. Validate The Spec
hpc-compose validate -f compose.yaml
hpc-compose validate -f compose.yaml --strict-env
Use validate first when changing field names, dependency shape, command/entrypoint form, paths, x-slurm, x-runtime, or compatibility x-enroot blocks.
If validate fails, fix that before doing anything more expensive. Use --strict-env when missing interpolation variables should fail instead of consuming ${VAR:-default} or ${VAR-default} fallbacks.
4. Plan The Run
hpc-compose plan -f compose.yaml
hpc-compose plan --verbose -f compose.yaml
hpc-compose plan --show-script -f compose.yaml
Check:
- service order,
- allocation geometry and service step geometry,
- normalized image references,
- host-to-container mount mappings,
- resolved environment values,
- runtime artifact paths,
- cache hit/miss expectations.
plan is purely static: it parses, validates, builds the normalized runtime plan, and can print the generated script to stdout, but it does not run preflight, prepare images, call sbatch, or write hpc-compose.sbatch. Add --explain for planner hints about cache paths, missing artifacts, resume/artifact settings, and the next command. plan --verbose can print secrets from resolved environment values.
5. Normal Run: Use up
hpc-compose up -f compose.yaml
up is the preferred end-to-end cluster flow. It runs preflight unless disabled, prepares images unless skipped, renders the script, calls sbatch, records tracked job metadata, polls scheduler state, and streams logs.
It also uses a spec-scoped lock under .hpc-compose/locks/ so two concurrent up invocations against the same compose file do not race through prepare/render/submit.
Useful options:
--script-out path/to/job.sbatchkeeps a copy of the rendered script.--force-rebuildrefreshes imported and prepared artifacts.--skip-preparereuses existing prepared artifacts.--no-preflightskips the preflight phase.--detachsubmits or launches, records tracking metadata, and returns without watching.--format text|jsonis accepted with--detachor--dry-run.--watch-queuewaits in line-oriented queue output until the Slurm job reachesRUNNING, then opens the normal watch view.--queue-warn-after <DURATION>warns once when--watch-queuestaysPENDINGlonger than the threshold; the default is10m, and0disables the warning.--watch-mode auto|tui|lineselects the live output mode;--no-tuiis a line-mode alias.--hold-on-exit never|failure|alwayscontrols whether the TUI stays open after the job reaches a terminal scheduler state.--resume-diff-onlyprints resume-sensitive config diffs without launching.--allow-resume-changesconfirms intentional resume-coupled config drift.
up --local is Linux + Pyxis-only and single-host. See Runtime Backends.
Array jobs should be submitted with up --detach; use SLURM_ARRAY_TASK_ID in the service command and output patterns such as %A_%a for task-specific logs. Scheduler dependencies declared with x-slurm.after_job or x-slurm.dependency are passed to sbatch --dependency=... at submit time. Arrays and scheduler dependencies are not supported by up --local.
For conditional submission on a busy partition, use when:
hpc-compose when -f compose.yaml --partition gpu8 --free-nodes 4
hpc-compose when -f compose.yaml --after-job 12345
hpc-compose when -f compose.yaml --between 22:00-06:00
when is a foreground monitor. Interrupt it with Ctrl-C to stop waiting before the job is submitted. It runs preflight, image preparation, and script rendering before the wait begins, so submission is immediate once the conditions match; use --skip-prepare only when the required runtime artifacts already exist. --detach applies after submission: it still waits in the foreground for conditions, then returns after tracking metadata is written instead of opening the watch view.
Idle-node checks are advisory, not reservations. Another user can still submit first, and Slurm may queue the job after when calls sbatch. Keep polling gentle on shared login nodes: the default 60s interval is a good starting point, and intervals below 30s should be reserved for short, intentional watches.
For interactive development inside one allocation, use alloc:
hpc-compose alloc -f compose.yaml
hpc-compose run app -- python -m pytest
Inside the allocation shell, run SERVICE -- CMD reuses the active allocation with srun instead of submitting a new sbatch job. alloc exports HPC_COMPOSE_* metadata for the compose file, cache directory, runtime backend, and allocated nodes.
6. Run Preflight When Debugging Cluster Readiness
hpc-compose preflight -f compose.yaml
hpc-compose preflight --verbose -f compose.yaml
hpc-compose preflight -f compose.yaml --strict
preflight checks selected-backend tools, Slurm tools, cache path policy, local mounts/images, registry credentials, cluster profile compatibility, distributed-readiness hazards, metrics collector tools, and resume path safety.
Generate a cluster capability profile on the target login node when you want validation and preflight to catch partition/backend/QOS/GPU/MPI mismatches earlier:
hpc-compose doctor cluster-report
See Cluster Profiles for generated profile details, site policy packs, and MPI smoke probes.
7. Prepare Images Separately When Needed
hpc-compose prepare -f compose.yaml
hpc-compose prepare -f compose.yaml --force
Use this when you want to build or refresh prepared images before submission, confirm cache reuse behavior, or debug preparation separately from job submission.
prepare needs the selected runtime backend tools, but it does not call sbatch.
8. Render The Batch Script
hpc-compose render -f compose.yaml --output /tmp/job.sbatch
This is useful when debugging generated srun arguments, mounts, environment passing, launch order, and readiness waits.
9. Inspect A Tracked Run
hpc-compose jobs list
hpc-compose status -f compose.yaml
hpc-compose status -f compose.yaml --array
hpc-compose ps -f compose.yaml
hpc-compose watch -f compose.yaml
hpc-compose replay -f compose.yaml --speed 10
hpc-compose logs -f compose.yaml --service app --follow
hpc-compose stats -f compose.yaml --format jsonl
For a failed run, a practical investigation path is hpc-compose jobs list, then hpc-compose replay -f compose.yaml --job-id <job-id> to find the failure moment, then debug, logs, or stats for deeper evidence. Use Runtime Observability for tracked state, replay, logs, metrics, and machine-readable output. Use Artifacts and Resume for artifact bundles and resume-aware attempts.
10. Manage Cache And Old State
hpc-compose cache list
hpc-compose cache inspect -f compose.yaml
hpc-compose cache prune --all-unused -f compose.yaml
hpc-compose cache prune --age 7 --cache-dir '<shared-cache-dir>'
hpc-compose clean -f compose.yaml --age 7 --dry-run
Use Cache Management for cache reuse and pruning. Use Troubleshooting before deleting tracked job directories.
What Changed And What Should I Run?
| If you changed… | Typical next step |
|---|---|
| YAML planning/runtime settings only | plan --verbose, then up |
Base image, x-runtime.prepare.commands, or prepare env | up --force-rebuild, or prepare --force when debugging separately |
Mounted runtime source under volumes | Usually just up |
| Cache entries this plan no longer references | cache prune --all-unused -f compose.yaml |
hpc-compose itself | Expect cache misses on the next prepare or up, then optionally prune old entries |