hpc-compose
Compose-style multi-service workflows, compiled into one inspectable Slurm job.
hpc-compose gives research and HPC teams a small YAML authoring model for services, startup order, readiness checks, runtime backends, logs, artifacts, and follow-up commands.
services:
app:
image: python:3.12-slim
command: python train.py
$ hpc-compose plan --show-script -f compose.yaml
spec is valid
service order: app
#SBATCH --job-name=my-app
Use hpc-compose when you want Docker Compose-style authoring on Slurm without adding Kubernetes, a long-running control plane, or custom cluster-side services.
Start with the Support Matrix before planning a real runtime workflow. Linux is the maintained runtime target; macOS is intended for authoring, validation, rendering, and inspection.
Safe First Path
These commands are safe from a laptop, workstation, or login node because new writes a local starter spec and plan is purely static:
hpc-compose new --template minimal-batch --name my-app --output compose.yaml
hpc-compose plan -f compose.yaml
hpc-compose plan --show-script -f compose.yaml
For real cluster runs, configure a cache path visible from both the Slurm submission host and compute nodes, either in x-slurm.cache_dir, hpc-compose setup --cache-dir, or [defaults.cache] / [profiles.<name>.cache] settings. From a source checkout, you can also inspect the checked-in examples with hpc-compose plan -f examples/minimal-batch.yaml.
Expected output includes:
spec is valid
service order: app
Rendered script:
Run hpc-compose up -f compose.yaml only on a supported Linux Slurm submission host with the runtime backend your spec selects. If it fails, start with hpc-compose debug -f compose.yaml --preflight.
Download the asciinema-style quickstart demo cast if you want the same flow as a terminal recording.
Terms To Know
| Term | Meaning |
|---|---|
| spec | The YAML file that describes services, runtime backend, and Slurm settings. |
| allocation | The Slurm job allocation where all planned services run. |
| runtime backend | The mechanism used to launch services: Pyxis/Enroot, Apptainer, Singularity, or host. |
| preflight | Checks that inspect local tools, paths, backend support, and optional cluster profiles before a run. |
| prepare | The login-node image import/customization phase used before compute-node runtime. |
| tracked job | Metadata under .hpc-compose/<job-id>/ that lets status, ps, watch, logs, stats, and artifacts reconnect later. |
x-slurm | The spec section for Slurm settings and hpc-compose runtime extensions. |
What It Is For
- model serving plus helper services inside one Slurm allocation
- data and ETL pipelines with startup ordering or stage-completion dependencies
- training jobs with checkpoint export, artifact tracking, and resume-aware reruns
- explicit multi-node launch patterns that still fit inside one allocation
What It Is Not
hpc-compose is not a full Docker Compose runtime and is not a general cluster orchestrator.
Unsupported Compose features include:
build:portsnetworks/network_mode- Compose
restartas a Docker key deploy- dynamic node bin packing
For exact boundaries, read Execution Model, Supported Slurm Model, and Spec Reference.
Read Next
- Quickstart for the shortest safe path.
- Examples to choose a starting spec.
- Runtime Backends before changing
runtime.backend. - Runbook when adapting a real workload on a cluster.
- Troubleshooting when the first cluster run fails.
Reference
Support Matrix
This page separates what hpc-compose can build, what CI currently exercises, and what is officially supported for real workflows.
Support levels
| Level | Meaning |
|---|---|
| Officially supported | Maintained target for user-facing workflows and issue triage |
| CI-tested | Exercised in the repository’s automated checks today |
| Release-built | Prebuilt archive is published, but that is not a promise of full runtime support |
Officially supported
| Platform | Scope | Notes |
|---|---|---|
Linux x86_64 | Full CLI and runtime workflows | Requires Slurm client tools plus at least one supported runtime backend: Pyxis/Enroot, Apptainer, Singularity, or host software modules |
Linux arm64 | Full CLI and runtime workflows | Same cluster requirements as Linux x86_64 |
macOS x86_64 | Authoring and local non-runtime commands | Suitable for project-local authoring flows such as new, setup, context, plan, validate, inspect, render, and completions; not for Slurm/Enroot runtime commands |
macOS arm64 | Authoring and local non-runtime commands | Same scope as macOS x86_64 |
CI-tested
| Platform | What is tested today |
|---|---|
Ubuntu 24.04 x86_64 | formatting, clippy, unit/integration tests, docs build, link checks, installer smoke tests, and coverage |
macOS arm64 | authoring-focused tests, validate/render/schema smoke tests, installer smoke tests, and Homebrew smoke tests |
macOS x86_64 | authoring-focused tests, validate/render/schema smoke tests, and Homebrew smoke tests |
Current CI validates full runtime-facing behavior on Ubuntu and authoring/distribution behavior on macOS. Other published builds should be treated as lower-confidence until corresponding CI coverage exists.
Release-built
| Platform | Status |
|---|---|
Linux x86_64 | Release archive published |
Linux arm64 | Release archive published |
macOS x86_64 | Release archive published |
macOS arm64 | Release archive published |
Windows x86_64 | Release archive published, but runtime workflows are not officially supported |
Windows status
Windows archives are published so users can inspect the CLI surface or experiment with non-runtime commands, but Windows is currently release-built only:
- Slurm plus HPC runtime workflows are not an officially supported Windows target.
- Issues that are specific to Windows runtime execution may be closed as out of scope until the support policy changes.
Cluster assumptions for full support
For full runtime support on Linux, the target environment should provide:
sbatch,srun, and related Slurm client tools on the submission host- one supported runtime path:
- Pyxis container support in
srunplus Enroot on the submission host, - Apptainer on the submission host and compute nodes,
- Singularity on the submission host and compute nodes,
- or module/vendor software available on the host runtime path
- Pyxis container support in
- shared storage for the resolved cache directory
Use Runtime Backends, Runbook, and Execution Model before adapting a real workload to a cluster.
Installation
For normal use, install from a published GitHub Release. Build from source when you are developing the project or need to inspect a local checkout before using it on a cluster.
Install From A Published Release
Pick the release tag you want from the GitHub Releases page and pin it:
RELEASE_TAG=vX.Y.Z
curl -fsSL "https://raw.githubusercontent.com/NicolasSchuler/hpc-compose/${RELEASE_TAG}/install.sh" \
| env HPC_COMPOSE_VERSION="${RELEASE_TAG}" sh
The installer downloads the matching archive for the current Linux or macOS machine, verifies the published .sha256 sidecar, installs hpc-compose into ~/.local/bin by default, and installs shipped Unix manpages when present.
After installation, make sure the install directory is on your shell PATH and verify the binary:
export PATH="$HOME/.local/bin:$PATH"
command -v hpc-compose
hpc-compose --version
Useful overrides:
RELEASE_TAG=vX.Y.Z
curl -fsSL "https://raw.githubusercontent.com/NicolasSchuler/hpc-compose/${RELEASE_TAG}/install.sh" \
| env HPC_COMPOSE_INSTALL_DIR=/usr/local/bin HPC_COMPOSE_VERSION="$RELEASE_TAG" sh
Installer availability does not imply full runtime support. Check the Support Matrix before assuming a platform can run submission, prepare, or watch workflows end to end.
About The main Installer Script
Fetching install.sh from main without HPC_COMPOSE_VERSION does not install unreleased main:
curl -fsSL https://raw.githubusercontent.com/NicolasSchuler/hpc-compose/main/install.sh | sh
That command runs the moving script from main, but the script resolves the latest published GitHub Release and downloads from releases/download/<tag>/.... Use the version-pinned command above for reproducible installs. Use a source checkout when you want unreleased code.
Manual Release Download
Prebuilt archives are published on the release page. Pick the archive that matches your platform.
Example for Linux x86_64:
RELEASE_TAG=vX.Y.Z
curl -L "https://github.com/NicolasSchuler/hpc-compose/releases/download/${RELEASE_TAG}/hpc-compose-${RELEASE_TAG}-x86_64-unknown-linux-musl.tar.gz" -o hpc-compose.tar.gz
tar -xzf hpc-compose.tar.gz
./hpc-compose --help
Linux x86_64 releases use a musl target to avoid common cluster glibc mismatches. Unix release archives also contain share/man/man1/.
Windows release archives are zip-only for inspection and checksum parity. The installer script and end-to-end Slurm runtime workflows target Unix-like systems; use Windows primarily through WSL or a remote Linux/macOS authoring environment.
Native Packages
Published Linux releases may include .deb and .rpm assets:
RELEASE_TAG=vX.Y.Z
sudo apt install "./hpc-compose-${RELEASE_TAG}-x86_64-unknown-linux-musl.deb"
sudo dnf install "./hpc-compose-${RELEASE_TAG}-x86_64-unknown-linux-musl.rpm"
Package availability does not change runtime support policy. Linux cluster workflows still need Slurm client tools, the selected runtime backend, and shared storage for the resolved cache directory.
Homebrew On macOS
The repository exposes a same-repo Homebrew tap:
brew install NicolasSchuler/hpc-compose/hpc-compose
The formula is refreshed by release automation when a Homebrew-published release is cut. Check brew info NicolasSchuler/hpc-compose/hpc-compose when you need to confirm the formula version before installing.
macOS support is for authoring and local non-runtime commands such as new, plan, validate, inspect, render, and completions; it is not a supported Slurm runtime target.
Verify A Release
Use GitHub-native verification as the primary trust path for published binaries.
- Verify the release:
RELEASE_TAG=vX.Y.Z
gh release verify "$RELEASE_TAG" -R NicolasSchuler/hpc-compose
- Verify a downloaded asset:
RELEASE_TAG=vX.Y.Z
ASSET="hpc-compose-${RELEASE_TAG}-x86_64-unknown-linux-musl.tar.gz"
gh release download "$RELEASE_TAG" -R NicolasSchuler/hpc-compose -p "$ASSET"
gh release verify-asset "$RELEASE_TAG" "./$ASSET" -R NicolasSchuler/hpc-compose
- Verify the artifact attestation directly:
gh attestation verify "./$ASSET" \
--repo NicolasSchuler/hpc-compose \
--signer-workflow NicolasSchuler/hpc-compose/.github/workflows/release.yml
Published releases also ship SHA256SUMS and per-asset .sha256 files. Those checksums are primarily for installer compatibility, mirroring, and corruption checks; attestations are the stronger authenticity signal.
Internal Mirrors And Cluster-Admin Installs
For internal mirrors, preserve release filenames exactly, including:
- platform archives or native packages
SHA256SUMS- each per-asset
.sha256sidecar
Then point the installer at the mirrored base URL and pin the matching version:
RELEASE_TAG=vX.Y.Z
curl -fsSL "https://raw.githubusercontent.com/NicolasSchuler/hpc-compose/${RELEASE_TAG}/install.sh" \
| env HPC_COMPOSE_BASE_URL="https://mirror.example.org/hpc-compose/${RELEASE_TAG}" \
HPC_COMPOSE_VERSION="$RELEASE_TAG" sh
HPC_COMPOSE_VERSION is required when HPC_COMPOSE_BASE_URL is set so the installer, mirrored assets, and checksum files stay aligned.
Build From Source
Use this path for development, unreleased testing, or local inspection:
git clone https://github.com/NicolasSchuler/hpc-compose.git
cd hpc-compose
cargo build --release
./target/release/hpc-compose --help
Before using a local build on a cluster workflow, validate the binary and one example spec:
env CACHE_DIR=/cluster/shared/hpc-compose-cache \
target/release/hpc-compose validate -f examples/minimal-batch.yaml
env CACHE_DIR=/cluster/shared/hpc-compose-cache \
target/release/hpc-compose plan --verbose -f examples/minimal-batch.yaml
Local Docs Commands
The repo ships two documentation layers:
mdbookfor the user manualcargo docfor contributor-facing crate internals
Useful commands:
mdbook build docs
mdbook serve docs
cargo doc --no-deps
Regenerate checked-in manpages from a checkout with:
cargo run --locked --features manpage-bin --bin gen-manpages
cargo test --locked --test release_metadata
man -l man/man1/hpc-compose.1
Quickstart
This is the shortest safe path from an empty shell to a static plan, a first real Slurm run, and one-command failure triage.
If Slurm terms such as sbatch, srun, allocation, job step, Pyxis, or Enroot are unfamiliar, read Slurm And Container Basics before the first real cluster run.
1. Install The CLI
For normal use, install from the latest published GitHub Release and pin the tag you selected:
RELEASE_TAG=vX.Y.Z
curl -fsSL "https://raw.githubusercontent.com/NicolasSchuler/hpc-compose/${RELEASE_TAG}/install.sh" \
| env HPC_COMPOSE_VERSION="${RELEASE_TAG}" sh
Replace vX.Y.Z with the published release tag shown on the release page.
The installer places hpc-compose in ~/.local/bin by default and verifies the release checksum sidecar before installing. Release verification, manual downloads, package-manager installs, and source-checkout builds are covered in Installation.
If your shell does not find the command immediately, add the default install directory to your PATH:
export PATH="$HOME/.local/bin:$PATH"
hpc-compose --version
2. Learn The Safe Authoring Path First
plan is the safe authoring command. It does not call sbatch, does not import images, and does not write a script file:
Create a starter spec first:
hpc-compose new \
--template minimal-batch \
--name my-app \
--output compose.yaml
If you want a guided learning path instead of a single starter template, run the Spec Metamorphosis tutorial:
hpc-compose evolve --output compose.yaml
Then inspect the static plan:
hpc-compose plan -f compose.yaml
hpc-compose plan --show-script -f compose.yaml
Expected output includes:
spec is valid
service order: app
This is the right first path on macOS, a laptop, or any machine where you want to evaluate the authoring model before touching a real cluster. The same flow is also available as an asciinema-style demo cast, but the snippets above are the accessible reference output.
The normal workflow to remember is:
hpc-compose plan -f compose.yaml
hpc-compose up -f compose.yaml
hpc-compose debug -f compose.yaml --preflight
3. Choose A Starting Spec
Use the built-in starter templates when you want a fresh compose.yaml with your application name filled in:
hpc-compose new \
--template minimal-batch \
--name my-app \
--output compose.yaml
Add --cache-dir '<shared-cache-dir>' when you want the generated file to include an explicit x-slurm.cache_dir. Otherwise the plan uses the active settings cache default or $HOME/.cache/hpc-compose.
From a source checkout, you can also inspect a known-good repository example:
hpc-compose plan -f examples/minimal-batch.yaml
The Examples page is the single selection guide for beginner, LLM, training, distributed, and pipeline workflows.
Use Spec Metamorphosis when you want to learn those concepts progressively in one evolving valid spec.
4. Pick And Test A Cache Directory
cache_dir is optional in the spec, but real clusters usually need a site-specific shared path because image preparation happens before the job starts and compute nodes must later see those artifacts.
Ask your cluster documentation or support team for a project scratch, work, or shared filesystem path, then test it:
export CACHE_DIR=/cluster/shared/hpc-compose-cache
mkdir -p "$CACHE_DIR"
test -w "$CACHE_DIR"
Persist it in project settings when you want the same value every time:
hpc-compose setup --profile-name dev --cache-dir "$CACHE_DIR" --default-profile dev --non-interactive
Or keep using an environment-backed explicit spec value and persist it next to your copied spec:
printf 'CACHE_DIR=%s\n' "$CACHE_DIR" > .env
Do not use /tmp, /var/tmp, /private/tmp, or /dev/shm for x-slurm.cache_dir. Validation may accept those strings, but preflight reports them as unsafe because prepare happens before runtime and compute nodes must later see the cached artifacts.
5. Before Your First Cluster Run
| Command category | Where to run it | Required tools | Notes |
|---|---|---|---|
Authoring: new, plan, validate, inspect, render, config, schema | laptop, workstation, or login node | hpc-compose | plan is the recommended static pre-run check. |
Prepare: prepare | Linux host with selected runtime backend | Pyxis needs Enroot; Apptainer needs apptainer; Singularity needs singularity; host backend needs no container runtime | Does not call sbatch, but needs runtime tools for image work. |
Cluster checks: preflight, doctor cluster-report | Linux Slurm login node | Slurm client tools plus selected backend tools | Use preflight --strict when warnings should block launch. |
Run: up, run | Linux Slurm login node | sbatch, srun, scheduler tools, selected backend tools | up is the normal cluster execution path. |
Local launch: up --local | Linux host only | Enroot and runtime.backend: pyxis | Single-host only; not a distributed Slurm substitute. |
For Pyxis, srun --help should mention --container-image.
6. Submit On A Real Cluster
When you move to a supported Linux submission host, the normal run is:
hpc-compose up -f compose.yaml
up runs preflight, prepares missing artifacts, renders the batch script, submits it through sbatch, then follows scheduler state and tracked logs. On an interactive TTY it opens the full-screen watch UI; otherwise it falls back to line-oriented output. Add --watch-queue when you want line-oriented queue polling until the Slurm job reaches RUNNING before the normal watch view opens; --queue-warn-after <DURATION> controls the one-time long-pending warning. The watch UI holds the final screen on failures by default; use --hold-on-exit never|failure|always to tune that behavior. Use hpc-compose up --detach -f compose.yaml when you want submit-and-return behavior.
Success looks like:
- the job is submitted or launched
- a tracked job id is recorded
- the watch UI or text follower shows scheduler progress
status,ps, andlogscan reconnect to the tracked run later
7. If The First Cluster Run Fails
| Symptom | Best next command | Why |
|---|---|---|
Missing sbatch, srun, enroot, apptainer, or singularity | hpc-compose debug -f compose.yaml --preflight | Reruns prerequisite checks and keeps the latest tracked context in one report. |
srun does not advertise --container-image | hpc-compose doctor cluster-report | Pyxis support is unavailable or not loaded on that node. |
| Job submitted but no service log appeared | hpc-compose debug -f compose.yaml | Shows scheduler state, batch log tail, service log hints, and the next command. |
| Cache path warning or error | hpc-compose debug -f compose.yaml --preflight | Confirms whether x-slurm.cache_dir is shared and writable. |
| Services start in the wrong order | hpc-compose plan --explain --verbose -f compose.yaml | Shows normalized dependencies, readiness gates, and planner hints before running. |
The longer symptom guide is Troubleshooting.
8. Revisit A Tracked Run Later
hpc-compose jobs list
hpc-compose status -f compose.yaml
hpc-compose ps -f compose.yaml
hpc-compose watch -f compose.yaml
hpc-compose stats -f compose.yaml
hpc-compose logs -f compose.yaml --follow
Use jobs list first when you need to rediscover tracked runs under the current repo tree. Use ps for a stable per-service snapshot, watch to reconnect to the live UI, and logs --follow for a text-only follower.
From A Source Checkout
If you are developing from a local checkout instead of an installed binary:
cargo build --release
target/release/hpc-compose validate -f examples/minimal-batch.yaml
target/release/hpc-compose plan -f examples/minimal-batch.yaml
target/release/hpc-compose plan --show-script -f examples/minimal-batch.yaml
Read Next
Examples
These examples are the fastest way to understand the intended hpc-compose workflows and adapt them to a real application.
There are two starting points:
- built-in starter templates generated by
hpc-compose new - repository example files copied directly from
examples/
Before launching anything, run the safe authoring path first:
hpc-compose new --template minimal-batch --name my-app --output compose.yaml
hpc-compose plan -f compose.yaml
hpc-compose plan --show-script -f compose.yaml
If you are reading from a source checkout, you can run the same static checks directly against examples/minimal-batch.yaml.
Some repository examples keep an explicit ${CACHE_DIR:-/cluster/shared/hpc-compose-cache} for portability, while starter examples rely on the settings/builtin cache default. Before running on a real cluster, configure a shared path visible from both the submission host and the compute nodes:
export CACHE_DIR=/cluster/shared/hpc-compose-cache
mkdir -p "$CACHE_DIR"
test -w "$CACHE_DIR"
Start Here: The Four Promoted Examples
These four examples are the intended conversion funnel.
minimal-batch.yaml
- Demonstrates: one service, no dependencies, no image prepare step
- Expected prerequisites: any machine for
plan; a Linux Slurm login node plus the selected runtime backend forup - Cluster run, Linux Slurm login node only:
hpc-compose up -f examples/minimal-batch.yaml - Success signal: the batch log prints
Hello from Slurm!
app-redis-worker.yaml
- Demonstrates: multi-service startup ordering plus TCP readiness inside one allocation
- Expected prerequisites: a normal Slurm + Enroot submission host and shared
CACHE_DIR - Cluster run, Linux Slurm login node only:
hpc-compose up -f examples/app-redis-worker.yaml - Success signal:
worker.logshows a successful RedisPINGfollowed by repeatedINCR jobscalls
llm-curl-workflow-workdir.yaml
- Demonstrates: one GPU-backed LLM service plus one client service in the same job
- Expected prerequisites: a GGUF model at
$HOME/models/model.gguf, a GPU-capable Slurm target, and sharedCACHE_DIR - Cluster run, Linux Slurm login node only:
hpc-compose up -f examples/llm-curl-workflow-workdir.yaml - Success signal:
curl_client.logcontains a JSON response from/v1/chat/completions
training-resume.yaml
- Demonstrates: checkpoint export, resume-aware reruns, and attempt-aware training state
- Expected prerequisites: shared storage for
x-slurm.resume.pathplus sharedCACHE_DIR - Cluster run, Linux Slurm login node only:
hpc-compose up -f examples/training-resume.yaml - Success signal:
results/<job-id>/contains exported checkpoints and later attempts resume from the previously saved epoch
Beginner Ladder
Use this ordering when you are new to the project:
For a guided version of the first five concepts, run hpc-compose evolve --output compose.yaml. The progressive-complexity lesson walks through minimal, second-service, readiness, failure-policy, and multi-node-placement as one evolving valid spec.
| Stage | Start here | Why |
|---|---|---|
| Authoring only | minimal-batch.yaml with plan and plan --show-script | Confirms the tool understands a spec without touching Slurm. |
| First cluster run | minimal-batch.yaml on a Linux Slurm login node | Smallest real submission and log-check path. |
| Single-node multi-service | app-redis-worker.yaml | Shows depends_on plus TCP readiness. |
| GPU or LLM serving | llm-curl-workflow-workdir.yaml, llama-app.yaml, or vllm-openai.yaml | Adds accelerator resources and service/client coordination. |
| Durable training | training-checkpoints.yaml or training-resume.yaml | Adds artifacts, checkpoints, and resume semantics. |
| Distributed launch | multi-node-mpi.yaml, multi-node-torchrun.yaml, or framework-specific examples below | Adds allocation-wide or explicitly placed multi-node services. |
Built-In Starter Templates
Use built-in templates when you want hpc-compose to write a fresh compose.yaml with your application name filled in for you.
hpc-compose new --list-templates
hpc-compose new --describe-template minimal-batch
hpc-compose new --template minimal-batch --name my-app --output compose.yaml
hpc-compose new --template minimal-batch --name my-app --cache-dir '<shared-cache-dir>' --output compose.yaml
If the workflow you want is not listed by --list-templates, copy the closest repository example directly from examples/.
Broader Example Matrix
The matrix below covers the broader set of runnable examples beyond the four promoted starts. “Built-in template” means hpc-compose new --template <name> can scaffold it; “repository file” means copy the YAML from examples/ directly.
| Example | Availability | What it demonstrates | When to start from it |
|---|---|---|---|
dev-python-app.yaml | Built-in template | Mounted source code plus x-runtime.prepare.commands for dependencies | You want an iterative development workflow |
dev-python-smoke.yaml | Repository file | Finite smoke-test variant of the source-mounted Python app | You want to test a development spec without a long-running process |
llm-curl-workflow.yaml | Built-in template | Repo-local variant of the smallest concrete inference workflow | You want the same LLM stack but with models under the repository tree |
llama-app.yaml | Built-in template | GPU-backed service, mounted model files, dependent app service | You need accelerator resources or a model-serving pattern |
llama-uv-worker.yaml | Built-in template | llama.cpp serving plus a source-mounted Python worker executed through uv | You want the GGUF server plus mounted worker pattern |
multi-node-mpi.yaml | Built-in template | First-class MPI launch, generated MPI hostfile, and one primary-node helper | You want a minimal multi-node MPI pattern without extra orchestration |
mpi-pmix-v4-host-mpi.yaml | Built-in template | Versioned PMIx launch plus host MPI bind/env configuration | Your site requires a host MPI stack inside containers |
multi-node-partitioned.yaml | Repository file | Disjoint node ranges, fractional node selection, and explicit co-location | You want multiple distributed roles inside one allocation |
multi-node-torchrun.yaml | Built-in template | Allocation-wide torchrun launch using the primary node as rendezvous | You want a multi-node GPU training starting point |
multi-node-deepspeed.yaml | Built-in template | DeepSpeed no-SSH launch using generated rendezvous and hostfile env | You want distributed fine-tuning without hand-written rendezvous setup |
multi-node-accelerate.yaml | Built-in template | Hugging Face Accelerate multi-machine launch | You want an Accelerate-based training or fine-tuning starting point |
multi-node-horovod.yaml | Built-in template | Horovod rank-per-GPU launch through Slurm MPI | You want Horovod without SSH fanout |
multi-node-jax.yaml | Built-in template | JAX coordinator/process metadata for jax.distributed.initialize | You want a JAX distributed starting point |
nccl-tests.yaml | Built-in template | MPI-backed NCCL all-reduce probe | You are debugging GPU fabric, NCCL, UCX, or OFI settings |
ray-symmetric.yaml | Built-in template | Ray symmetric-run across one Slurm allocation | You want a modern Ray-on-Slurm starting point without an autoscaler |
ray-head-workers.yaml | Built-in template | Ray head plus worker steps inside one allocation | You need explicit Ray head/worker control for an older or site-specific setup |
dask-scheduler-workers.yaml | Built-in template | Dask scheduler on the primary node plus allocation workers | You want Dask CLI deployment inside one Slurm allocation |
spark-standalone.yaml | Built-in template | Spark standalone master, workers, and app submission | You need a conservative Spark standalone pattern without external cluster management |
flux-nested.yaml | Built-in template | Nested Flux launched through srun flux start | You want Flux scheduling inside an existing Slurm allocation |
nextflow-bridge.yaml | Built-in template | Nextflow command wrapper inside one hpc-compose allocation | You want hpc-compose tracking around a workflow engine run without parsing Nextflow files |
snakemake-bridge.yaml | Built-in template | Snakemake command wrapper inside one hpc-compose allocation | You want hpc-compose tracking around a Snakemake run without replacing Snakemake scheduling semantics |
postgres-etl.yaml | Built-in template | PostgreSQL plus a Python data processing job | You need a database-backed batch pipeline |
restart-policy.yaml | Built-in template | Per-service restart_on_failure with bounded retries and a rolling-window crash-loop guard | You need transient-failure retries without letting one service spin forever |
training-checkpoints.yaml | Built-in template | GPU training with checkpoints exported to shared storage | You need durable checkpoint outputs but not automatic resume semantics |
training-sweep.yaml | Repository file | Embedded sweep parameters with interpolation defaults for dry-run and normal render workflows | You want a small hyperparameter sweep starting point |
vllm-openai.yaml | Built-in template | vLLM serving with an in-job Python client | You want vLLM-based inference instead of llama.cpp |
vllm-uv-worker.yaml | Built-in template | vLLM serving plus a source-mounted Python worker executed through uv | You want a common LLM stack with mounted app code |
mpi-hello.yaml | Built-in template | MPI hello world using service-level x-slurm.mpi | You need a small first-class MPI workload |
multi-stage-pipeline.yaml | Built-in template | Two-stage pipeline coordinating through the shared job mount | You need file-based stage-to-stage handoff |
pipeline-dag.yaml | Built-in template | One-shot preprocess -> train -> postprocess DAG using successful-completion dependencies | You need stage completion, not service readiness, to gate downstream work |
fairseq-preprocess.yaml | Built-in template | CPU-heavy NLP data preprocessing with parallel workers | You need a CPU-bound data preprocessing pipeline |
canary-right-size.yaml | Repository file | A deliberately over-requested training probe for hpc-compose germinate | You want to practice right-sizing recommendations before changing a real spec |
rendezvous-model-server.yaml | Repository file | A provider job that registers a model-server endpoint in the shared cache | You want one Slurm allocation to publish a service for later jobs |
rendezvous-client.yaml | Repository file | A separate client job that resolves HPC_COMPOSE_RDZV_MODEL_SERVER_URL | You want cross-job service discovery through shared storage |
Which Example Should I Start From?
- Start with
minimal-batch.yamlif you are new tohpc-composeand want the smallest possible file. - Start with
app-redis-worker.yamlif your workload depends on multi-service startup ordering. - Start with
llm-curl-workflow-workdir.yamlif you want the smallest real-cluster inference workflow. - Start with
training-resume.yamlif you need resume-aware checkpoints on shared storage. - Start with
multi-node-mpi.yamlif you need one distributed step plus helper services on the primary node. - Start with
multi-node-partitioned.yamlif services need explicit node ranges orshare_withco-location. - Start with
multi-node-torchrun.yaml,multi-node-deepspeed.yaml,multi-node-accelerate.yaml, ormulti-node-jax.yamlif you need a launcher-style rendezvous pattern across multiple nodes. - Start with
nccl-tests.yamlwhen you need to debug NCCL/IB fabric before running a real training job. - Start with
ray-symmetric.yaml,dask-scheduler-workers.yaml,spark-standalone.yaml, orflux-nested.yamlif your distributed framework already fits inside one Slurm allocation. - Start with
nextflow-bridge.yamlorsnakemake-bridge.yamlwhen you want hpc-compose submission, tracking, logs, and artifacts around a workflow-engine command. These bridge templates do not parse workflow files and do not replace the engines’ native cluster executors. - Start with
dev-python-app.yamlif you want a source-mounted development loop. - Start with
dev-python-smoke.yamlif you want a finite smoke-test companion for the source-mounted development example. - Start with
training-sweep.yamlwhen you want many independent trial allocations from one embeddedsweepblock. - Start with
restart-policy.yamlif you need a clear starting point forrestart_on_failuretuning andstatus-visible retry budgets. - Start with
canary-right-size.yamlwhen your first question is whether a large GPU or memory request is justified. - Start with
rendezvous-model-server.yamlplusrendezvous-client.yamlwhen the provider and client should run as separate Slurm jobs.
Companion notes for the more involved examples live alongside the example assets:
examples/llm-curl/README.mdexamples/llama-uv-worker/README.mdexamples/vllm-uv-worker/README.mdexamples/models/README.md
Development Workflow Recipe
examples/dev-python-app.yaml mounts examples/app/ and runs a long-lived Python process, so it is best for hot reload:
hpc-compose dev -f examples/dev-python-app.yaml
hpc-compose tmux -f examples/dev-python-app.yaml --no-attach
examples/dev-python-smoke.yaml keeps the same mounted-source shape but uses a finite command, so it is suitable for smoke tests:
hpc-compose test --local -f examples/dev-python-smoke.yaml
hpc-compose test --submit --time 00:01:00 -f examples/dev-python-smoke.yaml
Adaptation Checklist
- Copy the closest repository example to your own
compose.yaml, or runhpc-compose new --template <name> --name my-app --output compose.yamlwhen a matching built-in template exists. - Configure a cache path visible from both the login node and compute nodes through
hpc-compose setup --cache-dir,x-slurm.cache_dir, or[defaults.cache]/[profiles.<name>.cache]. - Override
CACHE_DIRbefore running repository examples that use${CACHE_DIR:-...}, or replace the default cache path in your copied file. - Replace the example
image,command,environment, andvolumeswith your workload. - Keep active source in
volumesand keep slower-changing dependency installation inx-runtime.prepare.commands. - Add
readinessto services that must be reachable before dependents continue. - Adjust top-level or per-service
x-slurmsettings for your cluster. - Run
hpc-compose plan -f compose.yamlbefore the first run, andhpc-compose debug -f compose.yaml --preflightif that run fails. - Run cluster
uponly from a supported Linux Slurm submission host with the selected runtime backend available.
Related Docs
Spec Metamorphosis
hpc-compose evolve is an interactive authoring tutorial. It starts from a minimal valid spec and progressively rewrites the same output file through increasingly realistic HPC workflow features.
The command is safe to run on a laptop or login node:
- it validates and plans candidate specs,
- it writes only the selected compose file,
- it does not prepare images,
- it does not call
sbatch, - it does not run
preflight.
Canonical Lesson
V1 ships one lesson:
hpc-compose evolve --describe-lesson progressive-complexity
The progressive-complexity path contains five valid snapshots:
| Step id | What it teaches | Safe follow-up |
|---|---|---|
minimal | One service and one single-node Slurm allocation | hpc-compose plan -f compose.yaml |
second-service | A dependent service and startup ordering | hpc-compose plan -f compose.yaml |
readiness | readiness plus depends_on.condition: service_healthy | hpc-compose plan --show-script -f compose.yaml |
failure-policy | restart_on_failure with bounded retries and a rolling crash-loop window | hpc-compose inspect -f compose.yaml |
multi-node-placement | A two-node allocation with explicit non-overlapping service placement | hpc-compose plan -f compose.yaml |
The final step can validate anywhere, but running it requires a Slurm target that can grant a two-node allocation and a runtime backend available on that cluster.
Interactive Flow
Start the tutorial:
hpc-compose evolve --output compose.yaml
At each step, the command prints:
- a short explanation,
- the concepts being introduced,
- a compact diff from the last accepted spec,
- and the validation summary for the candidate.
Controls:
Enter,y, oraaccepts the step and writescompose.yaml.sskips the current step.qquits after the last accepted valid spec.?prints prompt help.
Transcript Example
$ hpc-compose evolve --output compose.yaml
Step 1/5: Minimal batch spec
Accept this step? [Y/a/s/q/?]
wrote /path/to/compose.yaml
Step 2/5: Add a dependent service
Accept this step? [Y/a/s/q/?]
wrote /path/to/compose.yaml
Step 3/5: Gate on readiness
Accept this step? [Y/a/s/q/?]
wrote /path/to/compose.yaml
Inspect the accepted readiness-gated spec:
hpc-compose plan -f compose.yaml
Then continue the tutorial to failure policies and multi-node placement:
Accept this step? [Y/a/s/q/?]
For automation or docs examples, accept through a specific step noninteractively:
hpc-compose evolve --yes --until readiness --format json --output compose.yaml
Non-Goals
- V1 does not mutate arbitrary existing specs.
- V1 is not a full-screen TUI.
- V1 does not submit jobs.
For a fresh single-template scaffold, use hpc-compose new. For choosing among the broader runnable examples, use Examples.
Task Guide
Use this page when you know what you want to do, but not yet which command or example should be your starting point.
First run
- Read Quickstart.
- Run
hpc-compose evolve --output compose.yamlif you want a guided progression fromminimalthroughmulti-node-placement. - Run
hpc-compose new --list-templatesif you want to inspect the built-in starter templates before choosing one. - Start from
minimal-batchwithhpc-compose new --template minimal-batch --name my-app --output compose.yaml. - Before running on a cluster, configure a shared cache with
hpc-compose setup --cache-dir '<shared-cache-dir>'or explicitx-slurm.cache_dir. If you copy a repository example that usesCACHE_DIR, override it for your cluster before running. - Run
hpc-compose plan -f compose.yamlbefore the first real run. Add--show-scriptwhen you want to inspect the generated launcher without writing a file. - Run
hpc-compose up -f compose.yamlonly from a supported Linux Slurm submission host.
Remember directory/data/env settings once
- Run
hpc-compose setupto create or update the project-local settings file (.hpc-compose/settings.toml). - Use
hpc-compose --profile dev upso compose path, env files, env vars, and binary paths come from the selected profile. - Run
hpc-compose context --format jsonto inspect resolved paths plus value sources. Interpolation variables are scoped to names referenced by the compose file and sensitive-looking values are redacted unless you add--show-values. - Use
--settings-file <PATH>when you need an explicit settings file instead of upward discovery.
Migrate from Docker Compose
- Read Docker Compose Migration.
- Replace
build:withimage:plusx-runtime.prepare.commands. - Replace service-name networking with
127.0.0.1or explicit allocation metadata where appropriate.
Single-node multi-service app
- Start from app-redis-worker.yaml.
- Add
depends_onandreadinessonly where ordering really matters. - Use Execution Model to confirm which services can rely on localhost.
Multi-node distributed training
- Start from multi-node-torchrun.yaml, multi-node-deepspeed.yaml, multi-node-accelerate.yaml, multi-node-jax.yaml, or multi-node-mpi.yaml.
- Use multi-node-horovod.yaml or nccl-tests.yaml when rank-per-GPU MPI launch is the right shape.
- Start from multi-node-partitioned.yaml when independent distributed roles need disjoint node ranges or explicit co-location.
- Start from ray-symmetric.yaml, dask-scheduler-workers.yaml, spark-standalone.yaml, or flux-nested.yaml when the framework can run inside one Slurm allocation without an autoscaler.
- Use generated distributed metadata such as
HPC_COMPOSE_DIST_RDZV_ENDPOINT,HPC_COMPOSE_DIST_NODE_RANK, andHPC_COMPOSE_DIST_NPROC_PER_NODEinstead of Docker-style service discovery. - Put cluster-specific NCCL/UCX/OFI fabric variables in
.hpc-compose/cluster.tomlunder[distributed.env]so specs stay portable.
Checkpoint and resume workflows
- Start from training-checkpoints.yaml when you only need artifact output.
- Start from training-resume.yaml when the run should resume from shared storage across retries or later submissions.
- Keep the canonical resume source in
x-slurm.resume.path, not in exported artifact bundles.
LLM serving workflows
- Start from llm-curl-workflow.yaml, llm-curl-workflow-workdir.yaml, llama-uv-worker.yaml, or vllm-uv-worker.yaml.
- Use
volumesfor model directories and fast-changing code. - Use
x-runtime.prepare.commandsfor slower-changing dependencies.
Debug cluster readiness
- Run
hpc-compose validate -f compose.yaml. - Run
hpc-compose validate -f compose.yaml --strict-envwhen default interpolation fallbacks should be treated as failures. - Run
hpc-compose plan --verbose -f compose.yaml. - Run
hpc-compose preflight -f compose.yaml. - Run
hpc-compose debug -f compose.yaml --preflightafter a failed tracked run. - Read Troubleshooting.
Cache and artifact management
- Use
hpc-compose cache listto inspect imported/prepared artifacts. - Use
hpc-compose cache inspect -f compose.yamlto see per-service reuse expectations. - Use
hpc-compose --profile dev cache prune --age 14when you want age-based cleanup to follow the active context cache dir. - Use
hpc-compose cache prune --age 7 --cache-dir '<shared-cache-dir>'when you want a direct cache cleanup that does not depend on compose resolution. - Use
hpc-compose artifacts -f compose.yamlafter a run to export tracked payloads.
Find and clean tracked runs
- Use
hpc-compose jobs listto scan the current repo tree for tracked runs. - Use
hpc-compose ps -f compose.yamlwhen you want a one-shot per-service runtime table. - Use
hpc-compose watch -f compose.yamlto reconnect to the live watch UI for the latest tracked job. - Use
hpc-compose jobs list --disk-usagewhen you need a quick size estimate before deleting old state. - Use
hpc-compose clean -f compose.yaml --dry-run --age 7to preview what a cleanup would remove. - Use
hpc-compose clean -f compose.yaml --all --format jsonwhen automation needs a stable cleanup report for one compose context, including effective latest IDs plus stale-pointer diagnostics.
Automation and scripting with JSON output
- Prefer
--format jsonfor machine-readable output on non-streaming commands such asnew,plan,validate,render,prepare,preflight,config,inspect,debug,status,ps,stats,score,artifacts,down,cancel,setup,cache,clean, andcontext. Forup,--format jsonrequires--detachor--dry-run. - Include
context --format jsonwhen automation needs resolved compose path, binaries, referenced interpolation vars, and runtime path roots. - Use
hpc-compose stats --format jsonlor--format csvwhen downstream tooling wants row-oriented metrics. - Treat
--jsonas a compatibility alias on older machine-readable commands; new automation should prefer--format json. Streaming commands such aslogs --follow,watch, andcompletionskeep their native text or script output.
Related Docs
Migrating from Docker Compose
This guide helps you convert an existing docker-compose.yaml into an hpc-compose spec for Slurm clusters using Pyxis/Enroot, Apptainer, Singularity, or host runtimes.
At a glance
| Docker Compose feature | hpc-compose equivalent |
|---|---|
image | image (same syntax, auto-prefixed with docker://) |
command | command (string or list, same syntax) |
entrypoint | entrypoint (string or list, same syntax) |
environment | environment (map or list, same syntax) |
volumes | volumes (host:container bind mounts, same syntax) |
depends_on | depends_on (list or map with condition: service_started / service_healthy) |
working_dir | working_dir (requires explicit command or entrypoint) |
build | Not supported. Use image + x-runtime.prepare.commands instead. |
ports | Not supported. Use host networking semantics instead. 127.0.0.1 works only when both sides run on the same node. |
networks / network_mode | Not supported. There is no Docker-style overlay network or service-name DNS layer. |
restart | Not supported as a Compose key. Use services.<name>.x-slurm.failure_policy. |
deploy | Not supported. Use x-slurm for resource allocation. |
healthcheck | Supported for a constrained TCP/HTTP subset and normalized into readiness; use explicit readiness for anything more complex. |
Resource limits (cpus, mem_limit) | Use x-slurm.cpus_per_task, x-slurm.mem, x-slurm.gpus |
Side-by-side: web app + Redis
Docker Compose
version: "3.9"
services:
redis:
image: redis:7
ports:
- "6379:6379"
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 5s
timeout: 3s
retries: 5
app:
build: .
ports:
- "8000:8000"
depends_on:
redis:
condition: service_healthy
environment:
REDIS_HOST: redis
volumes:
- ./app:/workspace
working_dir: /workspace
command: python -m main
hpc-compose
version: "1"
name: my-app
x-slurm:
job_name: my-app
time: "01:00:00"
mem: 8G
cpus_per_task: 4
cache_dir: /cluster/shared/hpc-compose-cache
services:
redis:
image: redis:7
command: redis-server --save "" --appendonly no
readiness:
type: tcp
host: 127.0.0.1
port: 6379
timeout_seconds: 30
app:
image: python:3.11-slim
depends_on:
redis:
condition: service_healthy
environment:
REDIS_HOST: 127.0.0.1
volumes:
- ./app:/workspace
working_dir: /workspace
command: python -m main
x-runtime:
prepare:
commands:
- pip install --no-cache-dir redis fastapi uvicorn
Key changes
version: "3.9"→version: "1"or remove the field. hpc-compose uses this as its own spec schema version, not a Docker Compose compatibility version.build: .→image: python:3.11-slim+x-runtime.prepare.commandsfor dependencies.ports→ Removed. Services communicate via127.0.0.1because they run on the same node.REDIS_HOST: redis→REDIS_HOST: 127.0.0.1. No DNS service names; use localhost.healthcheck→readinesswithtype: tcp.- Added
x-slurmblock for Slurm resource allocation (time, memory, CPUs). - Configured a shared cache for image storage, either through
x-slurm.cache_diras shown or project settings.
Key differences
Networking
Docker Compose creates isolated networks where services find each other by name. In hpc-compose, helper services on the same node share the host network directly, and multi-node distributed steps must use explicit rendezvous addresses. Replace service hostnames with 127.0.0.1 only when both sides intentionally stay on one node. For multi-node runs, derive the rendezvous host from /hpc-compose/job/allocation/primary_node or HPC_COMPOSE_PRIMARY_NODE.
Building images
Docker Compose uses build: to run a Dockerfile. hpc-compose uses x-runtime.prepare.commands instead:
# Docker Compose
app:
build:
context: .
dockerfile: Dockerfile
# hpc-compose
app:
image: python:3.11-slim
x-runtime:
prepare:
commands:
- pip install --no-cache-dir -r /tmp/requirements.txt
mounts:
- ./requirements.txt:/tmp/requirements.txt
Prefer volumes for fast-changing source code and x-runtime.prepare.commands for slower-changing dependencies. x-enroot.prepare remains accepted as a Pyxis/Enroot compatibility spelling, but new specs should use x-runtime.prepare.
Health checks vs readiness
Docker Compose uses healthcheck with a test command, interval, timeout, and retries. hpc-compose now accepts a constrained healthcheck subset and normalizes it into readiness:
# TCP: wait for a port to accept connections
readiness:
type: tcp
host: 127.0.0.1
port: 6379
timeout_seconds: 30
# Log: wait for a pattern in service output
readiness:
type: log
pattern: "Server started"
timeout_seconds: 60
# Sleep: fixed delay
readiness:
type: sleep
seconds: 5
Supported healthcheck migration patterns:
["CMD", "nc", "-z", HOST, PORT]["CMD-SHELL", "nc -z HOST PORT"]- recognized
curlprobes againsthttp://orhttps://URLs - recognized
wget --spiderprobes againsthttp://orhttps://URLs
Still unsupported in v1:
- arbitrary custom command probes
intervalretriesstart_period
Resource allocation
Docker Compose uses deploy.resources or top-level cpus/mem_limit. hpc-compose uses Slurm-native resource settings:
x-slurm:
time: "02:00:00"
mem: 32G
cpus_per_task: 8
gpus: 1
services:
app:
x-slurm:
cpus_per_task: 4
gpus: 1
Restart policies
Docker Compose supports restart: always, on-failure, etc. hpc-compose does not accept the Compose restart: key, but it does support per-service restart behavior through services.<name>.x-slurm.failure_policy.
services:
app:
image: python:3.11-slim
x-slurm:
failure_policy:
mode: restart_on_failure
max_restarts: 3
backoff_seconds: 5
window_seconds: 60
max_restarts_in_window: 3
restart_on_failure retries only on non-zero exits. It enforces both a lifetime restart cap and a rolling-window crash-loop cap during one live batch-script execution. If you omit the rolling-window fields, hpc-compose defaults to window_seconds: 60 and max_restarts_in_window: <resolved max_restarts>. Use mode: fail_job (default) for fail-fast behavior, or mode: ignore for non-critical sidecars.
Practical mapping:
- Compose
restart: "no"-> omitfailure_policyor usemode: fail_job - Compose
restart: on-failure[:N]-> usemode: restart_on_failurewithmax_restarts: Nwhen you want a similar lifetime retry budget - Compose
restart: always/unless-stopped-> no direct equivalent;hpc-composeintentionally keeps restart handling bounded within one batch job
The rolling-window fields have no direct Docker Compose equivalent. They exist to stop fast crash loops inside one Slurm allocation without giving up a larger lifetime retry budget for transient failures.
What to do about unsupported features
| Feature | Alternative |
|---|---|
build | Use image + x-runtime.prepare.commands. Mount build context files with x-runtime.prepare.mounts if needed. |
ports | Not needed. Services share 127.0.0.1 on one node. |
networks / network_mode | Not needed. All services are on the same host network. |
restart | Use services.<name>.x-slurm.failure_policy (fail_job, ignore, restart_on_failure). |
deploy | Use x-slurm for resources. |
| Service DNS names | Use 127.0.0.1 for same-node helpers, or explicit host metadata such as HPC_COMPOSE_PRIMARY_NODE for distributed runs. |
| Named volumes | Use host-path bind mounts in volumes. |
.env file | Supported. .env in the compose file directory is loaded automatically. |
Migration checklist
- Replace Compose
version:— Useversion: "1"or omit the field; values like"3.9"are rejected by hpc-compose. - Remove
build:— Replace withimage:pointing to a base image. Move dependency installation tox-runtime.prepare.commands. - Remove
ports:— Use host-network semantics instead of container port publishing. - Remove
networks:/network_mode:— There is no Docker-style overlay network or service-name DNS layer. - Remove Compose
restart:— useservices.<name>.x-slurm.failure_policywhen you need per-service restart behavior. - Remove
deploy:— Usex-slurmfor resource allocation. - Replace service hostnames — Change any service-name references (e.g.
redis,postgres) to127.0.0.1for same-node helpers, or to explicit allocation metadata for distributed runs. - Replace
healthcheck:— Convert toreadiness:withtype: tcp,type: log, ortype: sleep. - Add
x-slurm:— Settime,mem,cpus_per_task, and optionallygpus,partition,account. - Set cache storage — Point
x-slurm.cache_dirorsetup --cache-dirto shared storage visible from login and compute nodes. - Validate — Run
hpc-compose validate -f compose.yamlto check the converted spec. - Inspect — Run
hpc-compose inspect --verbose -f compose.yamlto confirm the planner understood your intent.
Related docs
Slurm And Container Basics
This page is for users who know shell scripts, Python jobs, or Docker images, but are new to Slurm and HPC container runtimes.
It is not a Slurm administration guide. The goal is to explain the vocabulary you will see in generated hpc-compose scripts and in cluster error messages.
The Short Mental Model
compose.yaml
-> hpc-compose plan/render/up
-> generated sbatch script
-> sbatch creates one Slurm allocation
-> srun launches one or more service steps
-> Pyxis/Enroot, Apptainer, Singularity, or host software starts the process
The important point is that hpc-compose does not replace Slurm. It writes one inspectable Slurm batch script and uses Slurm to run the planned services inside one allocation.
Slurm Terms In Plain Language
| Term | Meaning for hpc-compose users |
|---|---|
| Login node | The machine where you edit files, run plan, run preflight, and submit jobs. Do not run long compute work here. |
| Compute node | A worker machine where Slurm runs your job after it starts. |
| Partition | A named queue or resource pool. Sites often use partitions to separate CPU, GPU, debug, and large jobs. |
| Job | A submitted unit of work managed by Slurm. hpc-compose up submits one job. |
| Allocation | The nodes, CPUs, memory, GPUs, and wall time reserved for a job. |
| Batch script | A shell script submitted with sbatch. It contains #SBATCH directives and normal shell commands. |
| Job step | A launched process group inside the allocation. hpc-compose launches services as srun steps. |
| Task | Usually one process or rank. More ntasks means more processes, not more CPU threads per process. |
cpus_per_task | CPU threads requested for each task. This is common for threaded Python, OpenMP, or data-loader-heavy jobs. |
gres | Slurm’s generic resource request field, commonly used for GPUs. |
If you only remember one distinction: sbatch gets the allocation; srun starts work inside it.
A Minimal sbatch Script
A traditional Slurm script often looks like this:
#!/usr/bin/env bash
#SBATCH --job-name=hello-slurm
#SBATCH --partition=<partition>
#SBATCH --time=00:10:00
#SBATCH --cpus-per-task=2
#SBATCH --mem=4G
set -euo pipefail
hostname
python -c 'print("hello from a Slurm job")'
Submit it from a Slurm login node:
sbatch hello.sbatch
sbatch returns a job id. The job may wait in the queue before it starts, and Slurm normally writes batch output to a file such as slurm-<job-id>.out unless the script or site policy sets another output path.
Where hpc-compose Fits
The equivalent hpc-compose starting point is a spec:
name: hello-slurm
x-slurm:
job_name: hello-slurm
partition: <partition>
time: "00:10:00"
cpus_per_task: 2
mem: 4G
services:
app:
image: python:3.11-slim
command: python -c "import socket; print('hello from', socket.gethostname())"
Preview the generated Slurm script before submitting:
hpc-compose plan -f compose.yaml
hpc-compose plan --show-script -f compose.yaml
Run it on a supported Slurm login node:
hpc-compose up -f compose.yaml
up runs preflight checks, prepares missing runtime artifacts, renders the batch script, calls sbatch, records tracked job metadata, and follows scheduler/log output.
How YAML Maps To Slurm
| In the spec | In Slurm | Why it matters |
|---|---|---|
Top-level x-slurm.partition | #SBATCH --partition | Selects the site queue/resource pool. |
Top-level x-slurm.time | #SBATCH --time | Sets the allocation wall-time limit. |
Top-level x-slurm.nodes | #SBATCH --nodes | Reserves the allocation node count. |
Top-level x-slurm.ntasks | #SBATCH --ntasks | Sets the default process/rank count for the allocation. |
Top-level x-slurm.cpus_per_task | #SBATCH --cpus-per-task | Requests CPU threads per task. |
Top-level x-slurm.mem | #SBATCH --mem | Requests memory for scheduling and enforcement. It is not disk space. |
Top-level x-slurm.gres | #SBATCH --gres | Requests generic resources such as GPUs. |
Service x-slurm.ntasks | srun --ntasks | Sets the process/rank count for that service step. |
Service x-slurm.extra_srun_args | Raw srun arguments | Escape hatch for site-specific launch options. |
Prefer first-class fields from Spec Reference when they exist. Use raw submit_args or extra_srun_args only for site-specific options that hpc-compose does not model directly.
sbatch vs srun vs hpc-compose up
| Command | What it does |
|---|---|
sbatch job.sbatch | Submits a batch script and creates a Slurm job when scheduled. |
srun ... | Launches a job step. Inside an sbatch allocation, this starts work on allocated resources. |
hpc-compose render -f compose.yaml --output job.sbatch | Writes the generated batch script without submitting it. |
hpc-compose up -f compose.yaml | Runs the normal end-to-end flow and submits through sbatch. |
hpc-compose status, ps, logs, watch | Reconnects to tracked jobs after submission. |
When debugging, inspect the generated script:
hpc-compose plan --show-script -f compose.yaml
If a job was submitted but failed before service logs appeared, inspect Slurm state and batch output through:
hpc-compose debug -f compose.yaml
Pyxis And Enroot Basics
Slurm itself is the scheduler. Container support depends on what the cluster installed.
For the default runtime.backend: pyxis path:
- Pyxis is the Slurm plugin that adds
--container-*flags tosrun. - Enroot is the unprivileged container image/runtime layer used under Pyxis.
- An imported image is commonly represented as a cacheable SquashFS artifact such as
.sqsh. hpc-composemaps service image, command, environment, working directory, and volumes into the generatedsrun --container-*launch.
Check Pyxis support on the target login node:
srun --help | grep container-image
hpc-compose preflight -f compose.yaml
If srun does not advertise --container-image, choose another backend or ask the site how Pyxis is enabled. Enroot being installed is not the same thing as Slurm supporting Pyxis flags.
Other supported runtime paths are covered in Runtime Backends.
Why Shared Storage Matters
hpc-compose prepare can run before the Slurm job starts, but services run later on compute nodes. That means the resolved runtime cache must be visible from both places. You can set it in project settings:
[profiles.dev.cache]
dir = "/cluster/shared/hpc-compose-cache"
Or directly in a spec:
x-slurm:
cache_dir: /cluster/shared/hpc-compose-cache
Use a project, work, scratch, or workspace path that your site documents as shared. Do not use /tmp, /var/tmp, /private/tmp, or /dev/shm for the resolved cache directory.
The same rule applies to host paths mounted through volumes: the compute node must be able to read the path when the service starts.
Small Checks That Explain A Lot
These commands are useful in tiny smoke tests:
hostname
env | grep '^SLURM_' | sort
python -c 'import socket; print(socket.gethostname())'
cat /etc/os-release
Inside a container, cat /etc/os-release should describe the container image. Outside the container, it describes the host. That simple distinction helps diagnose whether a command is running where you expect.
Common Beginner Mistakes
| Symptom | Likely misunderstanding | Next step |
|---|---|---|
plan looks fine but up fails immediately | Static validation is not the same as cluster readiness. | Run hpc-compose debug -f compose.yaml --preflight on the login node. |
srun does not accept --container-image | Pyxis is not available or not loaded in Slurm. | Read Runtime Backends and use the site-supported backend. |
| Cache warnings mention local paths | The cache path is not shared between login and compute nodes. | Configure x-slurm.cache_dir or setup --cache-dir with shared storage. |
| A GPU job waits longer than expected | The request may be larger than available idle resources. | Check site queue policy and start with the smallest useful request. |
| More CPUs were requested but only one process appears | cpus_per_task adds threads per task; it does not create more tasks. | Use ntasks for more processes/ranks, and make the application use them. |
Docker Compose ports or service DNS do not work | This is one Slurm allocation, not a Docker Compose network. | Use host networking and Slurm/hpc-compose allocation metadata instead. |
Further Reading
- Slurm Quick Start User Guide
- Slurm
sbatchreference - Slurm job launch design notes
- Slurm containers guide
- NVIDIA Pyxis
- NVIDIA Enroot
Read Next
Execution model
This page explains the few runtime rules that matter most when a Compose mental model meets Slurm and HPC runtime backends.
What runs where
| Stage | Where it runs | What happens |
|---|---|---|
plan, validate, inspect, preflight | login node or local shell | Parse the spec, resolve paths, preview the runtime plan, and check prerequisites |
prepare | login node or local shell with the selected runtime backend | Import base images and build prepared runtime artifacts |
up | login node or local shell with Slurm access | Run preflight, prepare missing artifacts, render the batch script, call sbatch, and watch by default |
| Batch script and services | compute-node allocation | Launch the planned services through srun and the selected runtime backend |
status, ps, watch, stats, logs, artifacts | login node or local shell | Read tracked metadata and job outputs after submission |
The main consequence is simple: image preparation and validation happen before the job starts, but the containers themselves run later inside the Slurm allocation.
Service failure policies inside one job
hpc-compose does not provide a separate long-running orchestrator. Service failure handling happens inside the rendered batch script for the current allocation.
mode: fail_jobkeeps fail-fast behavior and stops the job on the first non-zero service exit.mode: ignorerecords the failure but allows the rest of the job to continue.mode: restart_on_failureonly reacts to non-zero process exits. It does not restart on successful exits, and it does not use cross-attempt or cross-requeue history.
For restart_on_failure, the batch script enforces two limits during one live execution:
- a lifetime cap through
max_restarts - a rolling-window cap through
max_restarts_in_windowwithinwindow_seconds
If a service omits the rolling-window fields, hpc-compose still enables crash-loop protection with window_seconds: 60 and max_restarts_in_window: <resolved max_restarts>.
Use status to inspect the tracked policy state after submission. The text view reports:
state service 'worker': failure_policy=restart_on_failure restarts=1/5 window=1/3@60s last_exit=42
Use logs to inspect the corresponding restart messages from the batch script when you need to distinguish lifetime-cap exhaustion from rolling-window exhaustion.
Use per-service x-slurm.hooks when you want host-side notifications around those policy transitions. on: restart runs before a granted relaunch; on: window_exhausted runs when the rolling-window guard blocks another restart. These hooks are best-effort and do not change the service policy outcome.
Which paths must be shared
- The resolved cache directory must be visible from both the login node and the compute nodes. It may come from
x-slurm.cache_dir, project settings, or the builtin$HOME/.cache/hpc-composefallback. - Relative host paths in
volumes, local image paths, andx-runtime.prepare.mountsresolve against the compose file directory. - Each submitted job writes tracked state under
${SLURM_SUBMIT_DIR:-$PWD}/.hpc-compose/${SLURM_JOB_ID}on the host. - That per-job directory is mounted into every container at
/hpc-compose/job. - Multi-node jobs also populate
/hpc-compose/job/allocation/{primary_node,nodes.txt}and export allocation-wideHPC_COMPOSE_NODE...variables plus service-scopedHPC_COMPOSE_SERVICE_NODE...variables.
Use /hpc-compose/job for small shared state inside the allocation, such as ready files, request payloads, logs, metrics, or teardown signals.
Enroot runtime paths
The generated batch script sets three Enroot runtime paths scoped per job under the resolved cache directory:
| Variable | Value | Purpose |
|---|---|---|
ENROOT_CACHE_PATH | $CACHE_ROOT/runtime/$SLURM_JOB_ID/cache | Enroot image cache for the current job |
ENROOT_DATA_PATH | $CACHE_ROOT/runtime/$SLURM_JOB_ID/data | Enroot data directory for the current job |
ENROOT_TEMP_PATH | $CACHE_ROOT/runtime/$SLURM_JOB_ID/tmp | Enroot temp directory for the current job |
These paths are created at batch startup and are available inside the batch script and to tooling that reads Enroot environment variables. They are not injected into service containers.
Warning
Do not put the resolved cache directory under /tmp, /var/tmp, /private/tmp, or /dev/shm. Those paths are not safe for login-node prepare plus compute-node reuse.
Networking inside the allocation
- Single-node services share the host network on one node.
- In a multi-node job, helper services stay on the allocation’s primary node by default.
- A distributed service may span the full allocation, or services may use
x-slurm.placementto select explicit allocation node subsets. - Partitioned services should use service-scoped metadata such as
HPC_COMPOSE_SERVICE_PRIMARY_NODE,HPC_COMPOSE_SERVICE_NODE_COUNT,HPC_COMPOSE_SERVICE_NODELIST, andHPC_COMPOSE_SERVICE_NODELIST_FILE. ports, custom Docker networks, and service-name DNS are not part of the model.- Use
depends_onplusreadinesswhen a dependent service must wait for real availability rather than process start. - Use
depends_onwithcondition: service_completed_successfullywhen a dependent service should wait for a one-shot stage to exit successfully.
Use 127.0.0.1 only when both sides are intentionally on the same node. For multi-node distributed or partitioned runs, derive rendezvous addresses from allocation or service metadata files and environment variables instead of relying on localhost.
If a service binds its TCP port before it is actually ready, prefer HTTP or log-based readiness over plain TCP readiness.
volumes vs x-runtime.prepare
| Mechanism | Use it for | When it is applied | Reuse behavior |
|---|---|---|---|
volumes | fast-changing source code, model directories, input data, checkpoint paths | at runtime inside the allocation | reads live host content every normal run |
x-runtime.prepare.commands | slower-changing dependencies, tools, and image customization | before submission on the login node | cached until the prepared artifact changes |
Recommended default:
- keep active source trees in
volumes - keep slower-changing dependency installation in
x-runtime.prepare.commands - use
prepare.mountsonly when the prepare step truly needs host files
Warning
If a mounted file is a symlink, the symlink target must also be visible from inside the mounted directory. Otherwise the path can exist on the host but fail inside the container.
Command vocabulary
- The normal run is
hpc-compose up -f compose.yaml. See Quickstart for the full end-to-end description. - The tracked follow-up tools are
statusfor scheduler/log summaries,psfor a stable per-service snapshot, andwatchwhen you want to reconnect to the live TUI later. - The debugging flow is
validate,inspect,preflight, andpreparerun separately when you need more visibility.
Read Runtime Backends before changing runtime.backend, Runbook for the operational workflow, Examples for starting points, and Spec reference for exact field behavior.
Supported Slurm Model
This page makes the hpc-compose Slurm boundary explicit. It is a tool for compiling one Compose-like application into one Slurm allocation with one or more srun steps. Those steps can use Pyxis/Enroot, Apptainer, Singularity, or host runtime software. It is not a general frontend for the full Slurm command surface.
First-class support
These capabilities are modeled, validated, and intentionally supported by the planner, renderer, and tracked-job workflow.
| Area | Support |
|---|---|
| Allocation model | One Slurm allocation per application |
| Submission flow | new, plan, validate, config, inspect, preflight, prepare, render, up, when, alloc, run, debug |
| Tracked job workflow | status, ps, watch, stats, score, logs, down, cancel, artifacts, clean, cache inspection/pruning |
| Top-level Slurm fields | job_name, partition, account, qos, time, nodes, ntasks, ntasks_per_node, cpus_per_task, mem, gres, gpus, GPU/CPU binding fields, constraint, output, error, chdir |
| Service step fields | nodes, placement, ntasks, ntasks_per_node, cpus_per_task, gres, gpus, GPU/CPU binding fields, mpi |
| Multi-node model | Single-node jobs, full-allocation distributed steps, and explicit node-index partitioning within one allocation |
| Runtime orchestration | depends_on, readiness checks, one-shot completion dependencies, service failure policies, primary-node helper placement, explicit co-location through placement.share_with |
| Service hooks | Per-service prologue and epilogue lifecycle hooks, plus host-side restart and window_exhausted event hooks |
| Runtime workflow | Pyxis/Enroot .sqsh, Apptainer/Singularity .sif, host runtime commands, x-runtime.prepare, shared cache handling |
| Scratch and staging | x-slurm.scratch, stage_in, stage_out, per-service scratch opt-out, raw #BB/#DW burst-buffer directives |
| Job tracking | Scheduler state via squeue/sacct, step stats via sstat, tracked logs, runtime state, metrics, artifacts, resume metadata |
| Advisory cluster weather | weather summarizes current node and queue conditions from read-only Slurm probes without reserving resources or changing submission behavior |
| Conditional submission | when actively monitors typed conditions, then submits one normal hpc-compose allocation |
| Canary right-sizing | germinate submits one short canary, writes latest-canary.json, and recommends resource settings without rewriting the spec |
| Hyperparameter sweeps | sweep submit expands one embedded sweep into many independent single-allocation jobs, then sweep status aggregates their tracked state |
| Cross-job rendezvous | Provider/client discovery through shared-cache JSON records under one cluster-visible cache directory |
Raw pass-through
These capabilities are usable, but hpc-compose does not model or validate their semantics beyond passing them through to Slurm.
| Mechanism | What it allows |
|---|---|
x-slurm.submit_args | Raw #SBATCH ... lines for site-specific flags such as mail settings, reservations, or other submit-time options |
services.<name>.x-slurm.extra_srun_args | Raw srun arguments for site-specific launch flags such as exclusivity settings |
| Existing reservations | Joining an already-created reservation through raw submit args is supported as pass-through |
Pass-through is appropriate when a site-specific flag is useful but does not justify a first-class schema field. hpc-compose rejects line breaks and null bytes in raw #SBATCH entries so one list entry cannot emit multiple directives, but it does not validate the Slurm semantics of those flags.
Unsupported or out of scope
These capabilities are intentionally outside the product seam.
| Area | Status |
|---|---|
| Admin-plane Slurm management | Out of scope |
sacctmgr account administration | Out of scope |
| Reservation creation or lifecycle management | Out of scope |
| Federation / multi-cluster control | Out of scope |
| Cross-cluster service discovery | Out of scope; rendezvous is same-cluster shared-storage coordination only |
Generic scontrol mutation | Out of scope |
Broad cluster inspection tools such as a full sinfo / sprio / sreport frontend | Out of scope; weather is limited to a compact advisory snapshot |
| Background submit daemons or reservations | Out of scope; when is a foreground advisory monitor and does not reserve resources |
| Dynamic scheduling or bin packing across nodes | Not supported; use explicit x-slurm.placement selectors |
| Heterogeneous jobs and job arrays as first-class workflow concepts | Not supported in v1; sweeps deliberately submit many normal allocations instead of Slurm arrays |
Compose build, ports, custom networks, restart, deploy | Not supported |
Non-goals
hpc-compose should not grow into a generic Slurm administration layer. In particular, it will not broaden into sacctmgr, reservation management, federation control, or generic scontrol mutation. Those are real Slurm features, but they do not fit the “one application, one allocation, tracked runtime workflow” seam this tool is built around.
Runtime Backends
runtime.backend selects how each service is launched inside the Slurm step. The default is pyxis.
For a beginner explanation of Slurm steps, Pyxis, Enroot, and shared runtime caches, start with Slurm And Container Basics.
runtime:
backend: pyxis
Backend Summary
| Backend | Launch shape | Required tools | Image/artifact shape | Notes |
|---|---|---|---|---|
pyxis | srun --container-* | Slurm with Pyxis support plus Enroot on the submission host | remote images or local .sqsh / .squashfs | Default path and the only backend supported by local development workflows. |
apptainer | srun plus apptainer exec/run | apptainer on submission and compute nodes | remote images prepared or reused as .sif; local .sif accepted | Use when the site standardizes on Apptainer instead of Pyxis. |
singularity | srun plus singularity exec/run | singularity on submission and compute nodes | remote images prepared or reused as .sif; local .sif accepted | Similar to Apptainer for sites that still use Singularity. |
host | direct srun command | Slurm client tools and host software/modules | no container image | Services must set command or entrypoint; image prepare and container bind mounts are not applied. |
For Pyxis, check support with:
srun --help | grep container-image
For all backends, preflight checks the selected backend tools:
hpc-compose preflight -f compose.yaml
Local Mode
up --local, test --local, dev, and tmux are intentionally narrow:
- Linux only
runtime.backend: pyxisonly- Pyxis-compatible Enroot tooling on the host
- single-host specs only
- no distributed or partitioned placement
- no service-level MPI
- no Slurm arrays or scheduler dependencies
Use local mode to inspect and debug a Pyxis/Enroot single-host launch path. dev adds file-change restart requests to the local supervisor, and tmux tails tracked local service logs in panes. Neither command changes the process-supervision model, and local mode is not a replacement for Slurm distributed execution.
Host Runtime Notes
runtime.backend: host runs service commands directly under srun. It is useful for module-based workflows or nested schedulers that already manage their own software environment.
Because there is no container:
imageis optional- service
volumesare rejected x-runtime.prepareandx-enroot.prepareare rejectedx-slurm.mpi.host_mpi.bind_pathsis not meaningful
Use top-level or service-level x-env for host modules, Spack views, and environment variables.
Related Docs
Running Compose-Style Multi-Service Workflows on Slurm
This is the canonical explainer for hpc-compose.
hpc-compose exists because two common approaches leave a gap:
- plain
sbatchscripts give you control, but multi-service coordination, startup ordering, and repeatability stay ad hoc - Docker Compose is familiar, but its networking and orchestration assumptions do not map cleanly to one Slurm allocation
hpc-compose takes the narrow path between them: a Compose-like authoring model that still produces one inspectable Slurm job.
The Pain in Current Slurm Workflows
Once a job stops being a single process, the friction climbs quickly:
- helper services need explicit startup ordering
- cluster-specific environment setup gets mixed into hand-written shell
- debugging starts from generated state you never inspected beforehand
- repeated workflows drift because the real behavior lives across scripts, notes, and local conventions
This is especially common in research ML and HPC-adjacent work where one job may need:
- a serving process plus a client
- a database plus a worker
- a training step plus checkpoint export and resume handling
Why Docker Compose Does Not Fit Slurm Directly
Docker Compose is good at expressing a small multi-service application on one machine. Slurm solves a different problem: scheduling one batch allocation onto shared cluster resources.
That mismatch shows up in exactly the features hpc-compose leaves out:
ports- custom
networks - Compose
restart deploy- broad runtime compatibility with arbitrary Compose features
Those omissions are deliberate. The point is not to emulate all of Compose on a cluster. The point is to keep a familiar authoring shape for the subset that maps cleanly to one Slurm job.
The Narrow Execution Model
hpc-compose keeps the execution model explicit:
compose-like spec
|
+--> plan / validate / render on the submission host
|
+--> one generated batch script
|
v
one Slurm allocation
|
+--> primary-node helper services
+--> optional allocation-wide distributed service
+--> optional explicitly partitioned service steps
+--> shared /hpc-compose/job scratch for coordination
This gives you a few important properties:
- one inspectable unit of submission
- one obvious place to look when the job fails
- one explicit product boundary instead of hidden orchestration behavior
One Real Example
app-redis-worker.yaml is a good example of the intended shape:
- one Redis service
- one dependent worker service
- TCP readiness gating before the worker starts
- both services living inside the same allocation
That is awkward to hand-roll repeatedly with cluster scripts alone, but it does not justify a full orchestrator. This is the exact middle ground hpc-compose targets.
If you want the smallest possible first run, start with minimal-batch.yaml. If you want the smallest concrete inference flow, start with llm-curl-workflow-workdir.yaml.
Why the Inspectable Path Matters
The authoring flow is designed to answer the practical questions before you launch:
hpc-compose plan -f compose.yaml
hpc-compose plan --show-script -f compose.yaml
That lets you confirm:
- whether the spec is valid
- what service order will run
- what image and cache behavior the planner inferred
- what batch script you are actually handing to Slurm
For a Slurm-first tool, that inspectability matters more than feature breadth.
When Not To Use hpc-compose
Do not use hpc-compose when you need:
- custom container networking
- broad Docker Compose compatibility
- a long-running orchestration control plane
- dynamic cross-node scheduling instead of explicit
x-slurm.placementnode selectors
If that list rules out your workload, that is not a failure of the tool. It is the intended product boundary.
Read Next
Runbook
This runbook is the normal real-cluster flow for adapting a hpc-compose spec on a supported Linux Slurm submission host.
If you are new to Slurm, read Slurm And Container Basics first. If you are adapting to HAICORE@KIT, read HAICORE Guide alongside this runbook.
Commands below assume hpc-compose is on your PATH. If you are running from a local checkout, replace hpc-compose with target/release/hpc-compose.
Compose-aware commands accept -f / --file. When omitted, hpc-compose uses the active context compose file from .hpc-compose/settings.toml, then falls back to compose.yaml in the current directory. Global context flags are available everywhere:
--profile <NAME>selects a profile from.hpc-compose/settings.toml.--settings-file <PATH>uses an explicit settings file instead of upward auto-discovery.
Read Slurm And Container Basics, Execution Model, Runtime Backends, and Support Matrix before adapting a workflow to a new cluster.
Before You Start
Make sure you have:
- a Linux submission host with
srunandsbatch, - the runtime backend selected by
runtime.backend, scontrolwhenx-slurm.nodes > 1,- Pyxis support in
srunwhenruntime.backend: pyxis(srun --helpshould mention--container-image), - shared storage for the resolved cache directory,
- local source trees or local
.sqsh/.sifimages in place, - registry credentials when your cluster or registry requires them.
Backend-specific requirements are listed in Runtime Backends. Cluster profile generation and MPI smoke probes are covered in Cluster Profiles.
Normal Progression
For a new spec on a real cluster:
- Choose a starter from Examples, or run
hpc-compose new --template <name> --name my-app --output compose.yaml. - Run
hpc-compose setuponce if you want compose path, env files, env vars, and binary overrides stored in a project-local settings file. - Run
hpc-compose context --format jsonto verify resolved values and sources. - Set or confirm the resolved cache directory, then adjust cluster-specific resource settings.
- Run
hpc-compose plan -f compose.yamlandhpc-compose plan --verbose -f compose.yamlwhile adapting the file. - Run
hpc-compose up -f compose.yamlfor the normal cluster run. - If it fails, start with
hpc-compose debug -f compose.yaml --preflight, then use Troubleshooting and break outpreflight,prepare,render,status,ps,watch,stats, orlogsseparately.
For a minimal cluster smoke test from a checkout, set CACHE_DIR to shared storage and run scripts/cluster_smoke.sh. It validates, preflights, and renders by default; set HPC_COMPOSE_SMOKE_SUBMIT=1 only when you intentionally want it to launch the smoke job.
Project-Local Settings
hpc-compose can discover .hpc-compose/settings.toml by walking upward from the current directory. You can also pin a file with --settings-file.
Typical setup flow:
hpc-compose setup
hpc-compose context
hpc-compose --profile dev context --format json
Non-interactive setup is available for scripting:
hpc-compose setup --profile-name dev --compose-file compose.yaml --env-file .env --env-file .env.dev --cache-dir '<shared-cache-dir>' --default-profile dev --non-interactive
Settings file shape:
version = 1
default_profile = "dev"
[defaults]
compose_file = "compose.yaml"
env_files = [".env"]
[defaults.env]
CACHE_DIR = "/cluster/shared/hpc-compose-cache"
[defaults.cache]
dir = "/cluster/shared/hpc-compose-cache"
[profiles.dev]
compose_file = "compose.yaml"
env_files = [".env", ".env.dev"]
[profiles.dev.env]
RESUME_DIR = "/shared/$USER/runs/my-run"
MODEL_DIR = "$HOME/models"
[profiles.dev.cache]
dir = "/cluster/shared/dev-hpc-compose-cache"
[resource_profiles.cpu-small]
time = "00:30:00"
cpus_per_task = 4
mem = "16G"
[resource_profiles.gpu-small]
partition = "gpu"
time = "01:00:00"
gpus = 1
cpus_per_task = 8
mem = "32G"
Resolution precedence is fixed:
- CLI flags
- selected profile values
- shared settings defaults
- built-in CLI defaults
Use context whenever you want to inspect effective compose path, binaries, interpolation variables, runtime paths, and per-field sources.
Resource profiles are referenced from YAML with x-slurm.resources: gpu-small. They are Slurm resource defaults, not the same thing as the global --profile setting selector, and explicit x-slurm values in the spec override profile defaults.
Choose A Starting Example
The maintained selection guide is Examples. It includes:
- four promoted beginner paths,
- a novice ladder from authoring to distributed workloads,
- the full repository example matrix,
- companion notes for LLM worker examples,
- an adaptation checklist.
Keep docs/src/examples.md as the single source of example selection truth. The embedded YAML source appendix is Example Source.
1. Choose A Cache Directory Early
Set the cache default to a path visible from both the login node and compute nodes:
[profiles.dev.cache]
dir = "/cluster/shared/hpc-compose-cache"
Or set x-slurm.cache_dir directly in the spec when the cache path should travel with that file:
x-slurm:
cache_dir: /cluster/shared/hpc-compose-cache
Quick recipe:
export CACHE_DIR=/cluster/shared/hpc-compose-cache
mkdir -p "$CACHE_DIR"
test -w "$CACHE_DIR"
Rules:
- Do not use
/tmp,/var/tmp,/private/tmp, or/dev/shm. - If
cache_diris unset in the spec, resolution checks profile cache settings, then defaults cache settings, then$HOME/.cache/hpc-compose. - The default may work on some clusters, but a shared project/work/scratch path is safer.
- Validation can accept unsafe local paths;
preflightreports them as policy errors.
More cache details are in Cache Management.
2. Adapt The Example
Start with the nearest example and then change:
imagecommand/entrypointvolumesenvironmentx-slurmresource settingsx-runtime.preparecommands for dependencies or tooling
Recommended pattern:
- Put fast-changing application code in
volumes. - Put slower-changing dependency installation in
x-runtime.prepare.commands. - Add
readinessonly to services that other services truly depend on.
3. Validate The Spec
hpc-compose validate -f compose.yaml
hpc-compose validate -f compose.yaml --strict-env
Use validate first when changing field names, dependency shape, command/entrypoint form, paths, x-slurm, x-runtime, or compatibility x-enroot blocks.
If validate fails, fix that before doing anything more expensive. Use --strict-env when missing interpolation variables should fail instead of consuming ${VAR:-default} or ${VAR-default} fallbacks.
4. Plan The Run
hpc-compose plan -f compose.yaml
hpc-compose plan --verbose -f compose.yaml
hpc-compose plan --show-script -f compose.yaml
Check:
- service order,
- allocation geometry and service step geometry,
- normalized image references,
- host-to-container mount mappings,
- resolved environment values,
- runtime artifact paths,
- cache hit/miss expectations.
plan is purely static: it parses, validates, builds the normalized runtime plan, and can print the generated script to stdout, but it does not run preflight, prepare images, call sbatch, or write hpc-compose.sbatch. Add --explain for planner hints about cache paths, missing artifacts, resume/artifact settings, and the next command. plan --verbose can print secrets from resolved environment values.
5. Normal Run: Use up
hpc-compose up -f compose.yaml
up is the preferred end-to-end cluster flow. It runs preflight unless disabled, prepares images unless skipped, renders the script, calls sbatch, records tracked job metadata, polls scheduler state, and streams logs.
It also uses a spec-scoped lock under .hpc-compose/locks/ so two concurrent up invocations against the same compose file do not race through prepare/render/submit.
Useful options:
--script-out path/to/job.sbatchkeeps a copy of the rendered script.--force-rebuildrefreshes imported and prepared artifacts.--skip-preparereuses existing prepared artifacts.--no-preflightskips the preflight phase.--detachsubmits or launches, records tracking metadata, and returns without watching.--format text|jsonis accepted with--detachor--dry-run.--watch-queuewaits in line-oriented queue output until the Slurm job reachesRUNNING, then opens the normal watch view.--queue-warn-after <DURATION>warns once when--watch-queuestaysPENDINGlonger than the threshold; the default is10m, and0disables the warning.--watch-mode auto|tui|lineselects the live output mode;--no-tuiis a line-mode alias.--hold-on-exit never|failure|alwayscontrols whether the TUI stays open after the job reaches a terminal scheduler state.--resume-diff-onlyprints resume-sensitive config diffs without launching.--allow-resume-changesconfirms intentional resume-coupled config drift.
up --local is Linux + Pyxis-only and single-host. See Runtime Backends.
Array jobs should be submitted with up --detach; use SLURM_ARRAY_TASK_ID in the service command and output patterns such as %A_%a for task-specific logs. Scheduler dependencies declared with x-slurm.after_job or x-slurm.dependency are passed to sbatch --dependency=... at submit time. Arrays and scheduler dependencies are not supported by up --local.
For conditional submission on a busy partition, use when:
hpc-compose when -f compose.yaml --partition gpu8 --free-nodes 4
hpc-compose when -f compose.yaml --after-job 12345
hpc-compose when -f compose.yaml --between 22:00-06:00
when is a foreground monitor. Interrupt it with Ctrl-C to stop waiting before the job is submitted. It runs preflight, image preparation, and script rendering before the wait begins, so submission is immediate once the conditions match; use --skip-prepare only when the required runtime artifacts already exist. --detach applies after submission: it still waits in the foreground for conditions, then returns after tracking metadata is written instead of opening the watch view.
Idle-node checks are advisory, not reservations. Another user can still submit first, and Slurm may queue the job after when calls sbatch. Keep polling gentle on shared login nodes: the default 60s interval is a good starting point, and intervals below 30s should be reserved for short, intentional watches.
For interactive development inside one allocation, use alloc:
hpc-compose alloc -f compose.yaml
hpc-compose run app -- python -m pytest
Inside the allocation shell, run SERVICE -- CMD reuses the active allocation with srun instead of submitting a new sbatch job. alloc exports HPC_COMPOSE_* metadata for the compose file, cache directory, runtime backend, and allocated nodes.
6. Run Preflight When Debugging Cluster Readiness
hpc-compose preflight -f compose.yaml
hpc-compose preflight --verbose -f compose.yaml
hpc-compose preflight -f compose.yaml --strict
preflight checks selected-backend tools, Slurm tools, cache path policy, local mounts/images, registry credentials, cluster profile compatibility, distributed-readiness hazards, metrics collector tools, and resume path safety.
Generate a cluster capability profile on the target login node when you want validation and preflight to catch partition/backend/QOS/GPU/MPI mismatches earlier:
hpc-compose doctor cluster-report
See Cluster Profiles for generated profile details, site policy packs, and MPI smoke probes.
7. Prepare Images Separately When Needed
hpc-compose prepare -f compose.yaml
hpc-compose prepare -f compose.yaml --force
Use this when you want to build or refresh prepared images before submission, confirm cache reuse behavior, or debug preparation separately from job submission.
prepare needs the selected runtime backend tools, but it does not call sbatch.
8. Render The Batch Script
hpc-compose render -f compose.yaml --output /tmp/job.sbatch
This is useful when debugging generated srun arguments, mounts, environment passing, launch order, and readiness waits.
9. Inspect A Tracked Run
hpc-compose jobs list
hpc-compose status -f compose.yaml
hpc-compose status -f compose.yaml --array
hpc-compose ps -f compose.yaml
hpc-compose watch -f compose.yaml
hpc-compose replay -f compose.yaml --speed 10
hpc-compose logs -f compose.yaml --service app --follow
hpc-compose stats -f compose.yaml --format jsonl
For a failed run, a practical investigation path is hpc-compose jobs list, then hpc-compose replay -f compose.yaml --job-id <job-id> to find the failure moment, then debug, logs, or stats for deeper evidence. Use Runtime Observability for tracked state, replay, logs, metrics, and machine-readable output. Use Artifacts and Resume for artifact bundles and resume-aware attempts.
10. Manage Cache And Old State
hpc-compose cache list
hpc-compose cache inspect -f compose.yaml
hpc-compose cache prune --all-unused -f compose.yaml
hpc-compose cache prune --age 7 --cache-dir '<shared-cache-dir>'
hpc-compose clean -f compose.yaml --age 7 --dry-run
Use Cache Management for cache reuse and pruning. Use Troubleshooting before deleting tracked job directories.
What Changed And What Should I Run?
| If you changed… | Typical next step |
|---|---|
| YAML planning/runtime settings only | plan --verbose, then up |
Base image, x-runtime.prepare.commands, or prepare env | up --force-rebuild, or prepare --force when debugging separately |
Mounted runtime source under volumes | Usually just up |
| Cache entries this plan no longer references | cache prune --all-unused -f compose.yaml |
hpc-compose itself | Expect cache misses on the next prepare or up, then optionally prune old entries |
Related Docs
- Quickstart
- Examples
- Runtime Backends
- Troubleshooting
- Cluster Profiles
- Runtime Observability
- Cache Management
- Artifacts and Resume
- Spec Reference
Development Workflow
test, dev, and tmux are the v1 local-development layer. They reuse the same prepare, render, local supervisor, runtime state, and tracking paths as up, so a run started by one command remains visible to status, ps, logs, stats, watch, and debug.
Smoke-Test Specs
Use test for finite specs that prove a workflow starts, satisfies readiness gates, and exits cleanly:
hpc-compose test --local -f examples/dev-python-smoke.yaml
hpc-compose test --submit --time 00:01:00 --timeout 180s -f compose.smoke.yaml
hpc-compose test --submit --format json -f compose.smoke.yaml
test requires exactly one execution mode:
--localruns the rendered local supervisor on the current host.--submitcallssbatch; it defaults to--time 00:01:00and--timeout 180s.
A smoke test passes only when every service:
- appears in tracked runtime state,
- launched at least once,
- passed readiness when
readinessis configured, - completed successfully.
Services with failure_policy.mode: ignore still have to complete successfully for test to pass. That makes smoke tests stricter than production runs by design: ignored sidecars are useful operationally, but they should not silently hide a broken spec test.
Making Long-Running Specs Finite
Production services often run forever. For smoke tests, create a finite variant of the spec or override the service command in a copied file:
services:
app:
image: python:3.11-slim
working_dir: /workspace
volumes:
- ./app:/workspace
command:
- python
- -c
- "import main; print('smoke ok', flush=True)"
Keep the same image, mounts, environment, dependencies, and readiness where possible. Change only the command or entrypoint needed to prove startup and exit. If a dependent service uses condition: service_healthy, keep the upstream readiness probe real enough to catch wiring mistakes.
Hot Reload
dev is local-only:
hpc-compose dev -f examples/dev-python-app.yaml
hpc-compose dev -f compose.yaml --watch-path ./src --debounce-ms 500
It infers watch roots from host directories mounted through service volumes. File mounts, container-only paths, cache paths, missing paths, and non-directory paths are ignored. --watch-path adds an explicit directory and restarts every service when it changes.
File changes write restart requests into the tracked run’s dev control directory. The local supervisor handles those requests as development restarts, so readiness and completion state reset for the affected service without consuming failure_policy.restart_on_failure counters.
By default, Ctrl-C stops the local supervisor. Add --keep-running when you want to leave the tracked local run alive after exiting the watch loop.
Tmux Dashboard
tmux is a log dashboard, not a process supervisor:
hpc-compose tmux -f compose.yaml
hpc-compose tmux -f compose.yaml --job-id local-123
hpc-compose tmux -f compose.yaml --session demo --no-attach
Without --job-id, it launches a new local run. With --job-id, it attaches to an existing tracked local run. Each pane tails one service log with tail -F, and pane titles use service names. Use --no-attach when running from a non-interactive terminal or CI smoke check.
Shared Local Constraints
up --local, test --local, dev, and tmux share the same current constraints:
- Linux hosts only
runtime.backend: pyxisonly- Pyxis-compatible Enroot tooling on the host
- single-host specs only
- no distributed or partitioned placement
- no service-level MPI
- no Slurm arrays or scheduler dependencies
Use these commands to author and debug single-host launch behavior. Use test --submit or up on a Slurm login node for real scheduler behavior.
Example Recipe
The source-mounted app in examples/dev-python-app.yaml is intentionally long-running, so it is a good dev target:
hpc-compose dev -f examples/dev-python-app.yaml
hpc-compose tmux -f examples/dev-python-app.yaml --no-attach
The companion examples/dev-python-smoke.yaml keeps the same mounted source pattern but uses a finite command:
hpc-compose test --local -f examples/dev-python-smoke.yaml
hpc-compose test --submit --time 00:01:00 -f examples/dev-python-smoke.yaml
Troubleshooting
Use this page when the safe authoring path worked but the first real cluster run failed.
For background on Slurm allocations, sbatch, srun, Pyxis, and Enroot, see Slurm And Container Basics. For HAICORE-specific storage and runtime checks, see HAICORE Guide.
First Triage
hpc-compose validate -f compose.yaml
hpc-compose validate -f compose.yaml --strict-env
hpc-compose plan --verbose -f compose.yaml
hpc-compose debug -f compose.yaml --preflight
plan --verbose can print resolved environment values and final mount mappings. Treat its output as sensitive when the spec contains secrets. debug is read-only unless --preflight is passed; with --preflight, it reruns prerequisite checks and includes those findings in the triage report.
Common Symptoms
| Symptom | Likely cause | Next step |
|---|---|---|
required binary '...' was not found | Selected backend or Slurm client tool is not on PATH. | Run debug --preflight; pass --enroot-bin, --apptainer-bin, --singularity-bin, --srun-bin, or --sbatch-bin as needed. |
srun does not advertise --container-image | Pyxis support is unavailable or not loaded. | Move to a supported login node, load the site module, or choose another backend. |
| Cache directory warning/error | The resolved cache directory is not shared, writable, or policy-safe. | Choose a shared project/work/scratch path through x-slurm.cache_dir or setup --cache-dir, then rerun debug --preflight. |
| Missing local mount or image path | Relative paths are resolved from the compose file directory. | Check paths relative to the copied compose.yaml. |
| Mounted symlink exists on the host but fails in the container | The symlink target is outside the mounted directory. | Copy the real file into the mounted directory or mount the target directory. |
| Anonymous pull or registry warning | Registry credentials are missing or rate limits apply. | Configure credentials before relying on private or rate-limited images. |
| Services start in the wrong order | Dependency condition or readiness is too weak. | Use service_healthy with readiness, or service_completed_successfully for DAG stages. |
| No service logs exist | The batch script failed before launching a service. | Use debug to see scheduler state, the tracked top-level batch log tail, and missing-log hints. |
dev reports no watchable source directories | Services only mount files, missing paths, cache paths, or container-only paths. | Mount the source as a host directory or pass hpc-compose dev --watch-path ./src -f compose.yaml. |
| Readiness never passes | Probe target, pattern, host, or dependency timing does not match the real service. | Inspect the service log with logs --service <name> and try a finite hpc-compose test --local or short test --submit spec. |
| Smoke test times out | The spec is long-running, readiness blocks forever, or the scheduler job never reaches terminal state. | Make the smoke spec finite, lower service readiness timeouts, and use --format json to inspect the failed phase and service reason. |
tmux is unavailable or attach fails | tmux is not installed or the shell is non-interactive. | Install tmux, pass --tmux-bin <PATH>, or create the dashboard with --no-attach. |
| Local mode is unsupported | Local workflows require a Linux host with Pyxis-compatible Enroot behavior. | Use authoring commands on non-Linux hosts, then run test --submit or up on a supported Slurm login node. |
Readiness Issues
Use depends_on with condition: service_healthy when a dependent must wait for a dependency’s readiness probe. Plain list form means service_started.
Use condition: service_completed_successfully for one-shot DAG stages where the next service should start only after the previous stage exits with status 0, such as preprocess -> train -> postprocess.
When a TCP port opens before the service is fully usable, prefer HTTP or log-based readiness over TCP readiness.
For hpc-compose test, readiness failures are terminal smoke-test failures. A service with configured readiness must become healthy and then complete successfully; ignored sidecars are still expected to pass in a smoke spec.
Preview A Run
Use plan for the static preview. It never prepares images, runs preflight, calls sbatch, or writes hpc-compose.sbatch:
hpc-compose plan --show-script -f compose.yaml
Use up --dry-run only when you intentionally want to exercise preflight, prepare, and render without calling sbatch:
hpc-compose up --dry-run -f compose.yaml
Clean Old Tracked Runs
Tracked job metadata and logs accumulate in .hpc-compose/. Preview cleanup before deleting:
hpc-compose jobs list --disk-usage
hpc-compose clean -f compose.yaml --age 7 --dry-run
hpc-compose clean -f compose.yaml --age 7
Related Docs
Cluster Profiles
Cluster profiles let validate and preflight compare a spec against site-specific Slurm, runtime, MPI, storage, and policy hints.
For HAICORE-specific resource, workspace, and container notes, see HAICORE Guide.
Generate a best-effort profile on the target login node:
hpc-compose doctor cluster-report
This writes .hpc-compose/cluster.toml by default. Use --out - to print TOML instead.
For a live advisory snapshot of current conditions, use:
hpc-compose weather
weather reads stable labels and hints from the discovered cluster profile when present, but live node, queue, fairshare, and priority data come from one-shot Slurm probes and are not persisted in .hpc-compose/cluster.toml.
What Gets Discovered
The profile generator uses available local tools and environment hints:
sinfo,scontrol, andsrun --mpi=list- selected runtime binaries
- shared-path environment hints
- loaded MPI stack hints from
PATH,MPI_HOME,MPI_DIR,I_MPI_ROOT,EBROOTOPENMPI, andEBROOTMPICH - editable distributed defaults such as rendezvous port and
[distributed.env]
It does not run module avail. Module-only MPI installations can be added manually to the generated mpi_installations list.
Site Policy Packs
Support teams can edit optional sections such as:
[site][[software.modules]][[filesystems]][gpu][network][containers][slurm.defaults][slurm.required]
Policy sections warn and suggest snippets. They do not silently add modules, bind mounts, environment variables, or SBATCH directives to user specs.
MPI Smoke Probe
For MPI services, render a small rank-count probe against the service’s real runtime path:
hpc-compose doctor mpi-smoke -f compose.yaml --service trainer --script-out mpi-smoke.sbatch
Submit it only when you intentionally want to consume a Slurm allocation:
hpc-compose doctor mpi-smoke -f compose.yaml --service trainer --submit
The smoke plan keeps allocation and MPI launch settings but strips application workflow blocks such as setup, scratch staging, resume metadata, artifacts, and burst-buffer directives.
Fabric Smoke Probe
For distributed GPU or fabric-sensitive services, render a broader smoke probe:
hpc-compose doctor fabric-smoke -f compose.yaml --service trainer --checks auto --script-out fabric-smoke.sbatch
--checks auto always includes the MPI rank probe, adds NCCL when the selected service requests GPU resources, and collects UCX, OFI, and InfiniBand diagnostics when the corresponding tools are available. Use an explicit list such as --checks mpi,nccl when a missing tool should fail the probe instead of being reported as skipped.
Related Docs
HAICORE Guide
This page collects hpc-compose configuration notes for HAICORE@KIT. It is a practical starting point, not a replacement for the official NHR@KIT HAICORE documentation.
Before long or expensive runs, re-check current HAICORE policy pages for partitions, quotas, GPU limits, container requirements, and filesystem lifetime rules.
Where Commands Run
HAICORE is accessed through the login host documented by NHR@KIT:
ssh <username>@haicore.scc.kit.edu
Use the login node for editing, Git operations, hpc-compose plan, hpc-compose preflight, image preparation, and Slurm job management. Run compute work through Slurm with hpc-compose up, sbatch, or site-approved interactive Slurm commands.
Do not treat the login node as a place for long Python training, GPU work, data conversion, or large preprocessing jobs. Those belong inside a Slurm allocation.
HAICORE Slurm Settings To Know
The current HAICORE batch-system documentation describes Slurm partitions named normal and advanced. The normal partition is the general starting point; advanced requires special permission and allows larger jobs.
Common settings you will map into hpc-compose:
| HAICORE / Slurm setting | hpc-compose field | Notes |
|---|---|---|
| Partition | x-slurm.partition | Usually start with the site-documented general partition. |
| Account/project | x-slurm.account | Use the account string assigned by the site or project. |
| Wall time | x-slurm.time | Keep smoke tests short; request only what the run needs. |
| Nodes | x-slurm.nodes | normal is documented for single-node jobs; confirm before multi-node runs. |
| Tasks | x-slurm.ntasks, service x-slurm.ntasks | Process/rank count. |
| CPUs per task | x-slurm.cpus_per_task, service x-slurm.cpus_per_task | CPU threads per process/rank. |
| Memory | x-slurm.mem | Scheduler/runtime memory request, not storage. |
| Full GPUs | x-slurm.gres or service x-slurm.gres | HAICORE examples use gpu:full:N style requests. |
| MIG GPUs | x-slurm.gres or service x-slurm.gres | HAICORE documents MIG profiles such as gpu:1g.5gb:1; confirm current names. |
| Constraints | x-slurm.constraint or x-slurm.submit_args | HAICORE documents constraints such as LSDF and BEEOND. |
Example single-node GPU starting point:
name: haicore-smoke
x-slurm:
job_name: haicore-smoke
partition: normal
account: <account>
time: "00:10:00"
nodes: 1
cpus_per_task: 4
mem: 16G
gres: gpu:full:1
cache_dir: <workspace-path>/hpc-compose-cache
services:
app:
image: python:3.11-slim
command: python -c "import os, socket; print(socket.gethostname()); print(os.environ.get('SLURM_JOB_ID'))"
Preview before submitting:
hpc-compose plan -f compose.yaml
hpc-compose plan --show-script -f compose.yaml
hpc-compose preflight -f compose.yaml
Workspaces And Storage
HAICORE documents several storage types. For hpc-compose, the most important distinction is shared persistent-enough storage versus job-local temporary storage.
| Storage | Use with hpc-compose | Avoid using it for |
|---|---|---|
$HOME | Small configuration, source code, shell setup, credentials handled under site policy. | Large image caches, datasets, checkpoints, or logs from many jobs. |
| Workspace | x-slurm.cache_dir, Enroot data/cache, datasets, model files, run logs, artifacts, checkpoints. | Data that must be backed up elsewhere; workspaces are documented as not backed up and time-limited. |
$TMPDIR | Fast node-local temporary files created and consumed within one job. | x-slurm.cache_dir or anything needed by login-node prepare and later compute-node runtime. |
| BeeOND | Job-local shared scratch across nodes when explicitly requested. | Long-term cache, persistent checkpoints, or files needed after the job unless copied out. |
Create and locate a workspace with HAICORE’s workspace tools:
ws_allocate <workspace-name> <duration>
ws_find <workspace-name>
ws_list
ws_extend <workspace-name> <duration>
Use the path from ws_find for the cache:
export CACHE_DIR=<workspace-path>/hpc-compose-cache
mkdir -p "$CACHE_DIR"
test -w "$CACHE_DIR"
Then set it in your spec:
x-slurm:
cache_dir: ${CACHE_DIR}
The official HAICORE filesystem page documents workspace lifetime, extension limits, quotas, and backup policy. Treat workspace expiration as operational risk: long-running projects should have a habit of checking ws_list and copying durable results to the correct long-term location.
Containers On HAICORE
The official HAICORE container documentation says native Docker and rootless Docker are not supported on the HPC systems. The relevant paths are site-supported HPC runtimes, including Enroot/Pyxis and Apptainer.
For the default hpc-compose backend:
runtime:
backend: pyxis
Validate Pyxis support on the login node:
srun --help | grep container-image
hpc-compose preflight -f compose.yaml
HAICORE documents Pyxis as the Slurm integration for Enroot and lists container options such as --container-image, --container-name, --container-mounts, --container-mount-home, --container-writable, and --container-remap-root.
The HAICORE docs also list site-required Pyxis mounts for Slurm integration. Because mount paths are site policy and can change, inspect the current HAICORE container page before copying them into a spec. When needed, pass site-specific Pyxis flags through service-level extra_srun_args:
services:
app:
image: python:3.11-slim
command: python -c "print('hello from HAICORE')"
x-slurm:
extra_srun_args:
- "--container-mounts=<site-required-mounts>"
If the cluster recommends Apptainer for your workflow or Pyxis is not available in srun, choose the corresponding backend:
runtime:
backend: apptainer
See Runtime Backends for the backend behavior and required tools.
Enroot Cache Placement
HAICORE documents Enroot as available by default, with default data paths under the user’s home directory. For repeated container jobs, large images, or quota-sensitive projects, place runtime cache/data under a workspace-backed x-slurm.cache_dir.
hpc-compose sets per-job Enroot runtime paths below the configured cache directory. That keeps image runtime state close to the job and avoids filling $HOME accidentally.
BeeOND And Job-Local Scratch
HAICORE documents BeeOND as a job-local filesystem requested through a Slurm constraint:
x-slurm:
constraint: BEEOND
Use BeeOND for temporary high-throughput working data inside a job, then copy durable results back to a workspace or other approved persistent location. Do not put x-slurm.cache_dir on BeeOND because the cache must exist before the job and be reusable by later jobs.
Software Modules
HAICORE software is exposed through Lmod environment modules. For host-runtime or MPI workflows, keep module setup explicit in x-slurm.setup:
x-slurm:
setup:
- module purge
- module avail
- module load <module-name>
Do not leave module avail in production scripts if it produces too much output; it is useful while discovering the environment. Use module list in smoke tests when you need the batch log to record the active software stack.
Suggested First HAICORE Checklist
Run these on the HAICORE login node before the first real job:
ws_find <workspace-name>
sinfo
srun --help | grep container-image
hpc-compose plan --show-script -f compose.yaml
hpc-compose preflight -f compose.yaml
hpc-compose doctor cluster-report --out .hpc-compose/haicore-cluster.toml
Check the rendered script for:
- the intended
#SBATCH --partition, - the intended account/project,
- a short wall time for smoke tests,
- a workspace-backed
cache_dir, - expected GPU or MIG request,
- expected
srun --container-*options when using Pyxis.
Submit only after the static plan and preflight output are understandable:
hpc-compose up --detach -f compose.yaml
hpc-compose status -f compose.yaml
hpc-compose logs -f compose.yaml --follow
Common HAICORE Failure Modes
| Symptom | Likely cause | What to check |
|---|---|---|
| Workspace path is missing | Workspace expired or wrong name/path was used. | ws_list and ws_find <workspace-name>. |
| Cache path fails preflight | Path is not shared, writable, or policy-safe. | Move x-slurm.cache_dir to a workspace path. |
--container-image is unknown | Pyxis is not active in the current Slurm environment. | `srun –help |
| Job is rejected for partition/account | Site policy or project/account mismatch. | HAICORE batch docs, sacctmgr/support guidance, rendered #SBATCH lines. |
| GPU request is rejected | Wrong gres name, too many GPUs, or partition limit. | HAICORE batch docs and a tiny smoke job. |
| Job starts but cannot see data | Data is on node-local storage or an unmounted path. | Use workspace paths or explicit volumes. |
| Workspace fills or expires | Container cache, datasets, checkpoints, or logs accumulated. | ws_list, quota tools, cache cleanup, artifact retention policy. |
Official HAICORE References
- HAICORE overview
- Interactive login
- Hardware overview
- Batch system
- File systems and workspaces
- Software modules
- Containers
- BeeOND
Read Next
Runtime Observability
After a submission, hpc-compose records tracked metadata under:
${SLURM_SUBMIT_DIR:-$PWD}/.hpc-compose/${SLURM_JOB_ID}/
That directory lets follow-up commands reconnect without resubmitting.
Common Commands
hpc-compose status -f compose.yaml
hpc-compose ps -f compose.yaml
hpc-compose watch -f compose.yaml
hpc-compose watch -f compose.yaml --hold-on-exit always
hpc-compose replay -f compose.yaml --speed 10
hpc-compose replay -f compose.yaml --no-tui
hpc-compose logs -f compose.yaml --follow
hpc-compose logs -f compose.yaml --grep 'error|oom' --since 30m
hpc-compose stats -f compose.yaml
hpc-compose stats -f compose.yaml --accounting
hpc-compose inspect -f compose.yaml --rightsize
hpc-compose score 12345
hpc-compose germinate -f compose.yaml
hpc-compose sweep status -f compose.yaml
hpc-compose sweep list -f compose.yaml
hpc-compose diff 12345 12346 -f compose.yaml
| Command | Use it for |
|---|---|
status | Scheduler state, batch log path, runtime paths, and failure-policy state. |
ps | Stable per-service snapshot with readiness, status, restart counters, and log path. |
watch | Live terminal UI; falls back to line-oriented output on non-interactive terminals. |
replay | Best-effort DVR for a tracked run, reconstructed from existing runtime artifacts. |
logs | Text log output, optionally focused, searched, or coarsely time-filtered. |
stats | Tracked metrics, Slurm step statistics, and optional accounting rollups. |
inspect --rightsize | Post-run request-versus-usage recommendations for memory, CPUs, GPUs, and walltime. |
score | 0-100 post-run efficiency score with GPU, memory, compute-time, and kWh components. |
germinate | One-minute canary submission that writes latest-canary.json and recommends resource settings from fresh metrics. |
sweep status | Aggregate persisted sweep trials into completed, failed, running, pending, unknown, missing-tracking, and submit-failed counts. |
sweep list | List prior sweep manifests without querying the scheduler. |
diff | Compact comparison between two tracked submissions. |
Use --format json on non-streaming commands when automation needs stable fields. stats also supports --format csv and --format jsonl.
Watch UI
On an interactive terminal, watch and the default up follow mode open a live view with service state on the left and log output on the right. The UI automatically switches to a compact single-column view on narrow or short terminals. It keeps a detailed status view while the job runs and, by default, holds the final screen on failures so the failing service, final scheduler state, and next diagnostic commands stay visible.
Keybindings:
| Key | Action |
|---|---|
j, Down, Tab | Move to the next service. |
k, Up | Move to the previous service. |
g / G | Jump to the first or last service. |
/ | Filter services by name; press Enter to apply or Esc to cancel. |
Space | Pause or resume log following. |
PgUp / PgDn | Scroll the visible log pane while paused. |
End | Return to live-follow mode at the newest log lines. |
a | Toggle between the selected service log and all tracked service logs. |
? | Toggle in-UI help. |
q | Leave the watch view without cancelling the job. |
Use --hold-on-exit never|failure|always on up or watch to control whether the final TUI stays open after a terminal scheduler state. When the view is held, press d, l, or s to print the exact debug, logs, or stats command after leaving the alternate screen.
Use hpc-compose up --watch-queue when you want explicit queue polling before the watch view opens. It prints queue state changes, pending reason, and expected start time when Slurm exposes them; --queue-warn-after <DURATION> controls the one-time long-pending warning.
Use --watch-mode line or --no-tui when you are recording output, using a screen reader, running in CI, or working in a terminal where alternate-screen UIs are inconvenient. Line mode preserves detailed scheduler and log updates without alternate-screen control codes.
Replay
hpc-compose replay reconstructs a best-effort execution timeline after the run. It reuses the watch-style view, but reads only artifacts that already exist under the tracked job directory. This makes it useful for rewinding to the time a service failed, comparing the nearest prior metrics sample, or sharing a deterministic text/JSON summary without querying Slurm again.
hpc-compose replay -f compose.yaml
hpc-compose replay -f compose.yaml --speed 10
hpc-compose replay -f compose.yaml --job-id 12345 --service trainer
hpc-compose replay -f compose.yaml --no-tui
hpc-compose replay -f compose.yaml --format json
Replay controls:
| Key | Action |
|---|---|
Space | Pause or play the replay. |
+ / - | Move between speed presets such as 1x, 10x, and 100x. |
Left / Right | Seek backward or forward by five seconds. |
[ / ] | Jump to the previous or next reconstructed event. |
Home / End | Jump to the first or final replay frame. |
/, a, PgUp, PgDn, q | Same filter, log-pane, scroll, and quit behavior as watch. |
Replay data sources:
| Source | What replay uses | Fidelity notes |
|---|---|---|
state.json | Final per-service state, start/finish times, exit code fallback, placement metadata | This file is overwritten during the run, so intermediate readiness and scheduler transitions are not exact. |
service-exits/*.jsonl | Append-only service exit markers and restart evidence | Multiple exits reconstruct failure/restart sequences, but accepted restart relaunch time is inferred. |
metrics/*.jsonl | Historical GPU and Slurm sampler rows | Replay shows the latest metrics sample at or before the cursor and never displays future metrics as current. |
logs/*.log | Service log tails in the replay UI | Service logs do not include guaranteed per-line timestamps, so log panes are contextual tails, not exact log-time scrubbing. |
| Scheduler commands | Not queried during replay | Historical queue state, pending reason changes, and accounting gaps are not reconstructed. |
Use --no-tui for a static summary that exits immediately. Use --format json when notebooks, dashboards, or experiment records need the reconstructed events, frame summaries, artifact paths, and fidelity notes.
Logs
Runtime logs live under:
${SLURM_SUBMIT_DIR:-$PWD}/.hpc-compose/${SLURM_JOB_ID}/logs/<service>.log
Slurm may also write a top-level batch log such as slurm-<jobid>.out, or to the path configured with x-slurm.output. Check the batch log first when a job fails before any service log appears.
Service names containing non-alphanumeric characters are encoded in log filenames. Prefer [a-zA-Z0-9_-] in service names for readability.
Use --grep <pattern> to print only matching raw log lines across selected service logs. Use --since <duration> for coarse time-bounded initial output, for example 30s, 15m, 2h, 1d, or 1h30m. Because service logs do not include line timestamps, --since filters by each log file’s modification time rather than by individual line time. Follow mode still starts from the current end of each selected log and applies --grep to appended lines.
Event Hooks
Per-service x-slurm.hooks can run host-side observability scripts when restart_on_failure accepts a restart or when the rolling restart window blocks a crash loop. Hook stdout/stderr is appended to that service’s log, and non-zero hook exits are logged without changing the restart or failure outcome.
Use on: restart for retry notifications and on: window_exhausted for crash-loop alerts. Event hooks receive service identity, exit code, Slurm attempt, and restart-window counters through HPC_COMPOSE_* environment variables; see Spec reference for the full list.
Metrics
When x-slurm.metrics is enabled, sampler files are written under:
${SLURM_SUBMIT_DIR:-$PWD}/.hpc-compose/${SLURM_JOB_ID}/metrics/
meta.json
gpu.jsonl
gpu_processes.jsonl
slurm.jsonl
diagnostics/
The sampler can collect GPU snapshots through nvidia-smi and job-step CPU/memory snapshots through sstat. Collector failures are best-effort: missing nvidia-smi, missing sstat, or unsupported queries do not fail the batch job itself.
Add --accounting to stats when you need post-run sacct rollups for reporting. The accounting summary includes allocated CPU-hours, total CPU-hours when available, allocated GPU-hours, allocation-based memory byte-seconds, and observed maximum RSS. Memory byte-seconds are labeled as allocation-based because Slurm’s standard accounting fields do not reliably provide true per-line memory-seconds across all clusters.
Use hpc-compose inspect --rightsize -f compose.yaml after a tracked Slurm run to convert those observations into conservative resource suggestions. The assistant requires tracked submission metadata and compares explicit requests such as x-slurm.mem, x-slurm.time, x-slurm.gpus, and service x-slurm.cpus_per_task against sacct, sstat, and nvidia-smi sampler evidence. It only reports suggestions; it does not rewrite the compose file.
Use hpc-compose score <job-id> after a tracked Slurm run when you want a compact efficiency grade. The score reuses sampler history, sacct, sstat, and right-sizing recommendations, then reports GPU utilization, memory utilization, active compute-time versus requested walltime, and a best-effort kWh estimate. Energy uses sampled GPU power when available, otherwise falls back to power limits or configured TDP assumptions through --gpu-tdp-w, --cpu-watts-per-core, and --pue; it does not claim carbon intensity or emissions.
Use hpc-compose germinate -f compose.yaml before a full run when you want a short canary to gather fresh evidence. Canary runs write .hpc-compose/latest-canary.json so normal up metadata remains the latest production submission.
Sweep Manifests
hpc-compose sweep submit stores sweep state under .hpc-compose/sweeps/<sweep-id>/sweep.json and refreshes .hpc-compose/sweeps/latest.json. The manifest records the matrix mode, persisted random seed, trial ids, trial variables, rendered script paths, job ids, per-trial job record paths, submit times, and any submit error.
Each submitted trial also writes a normal job record under .hpc-compose/jobs/<job-id>.json with kind: sweep_trial and a sweep metadata block. Sweep-trial records deliberately do not replace normal latest.json or latest-run.json, so hpc-compose status, watch, and logs continue to target ordinary runs unless you pass an explicit job id.
hpc-compose sweep status -f compose.yaml --format json loads the manifest and queries the same scheduler/tracking snapshot code used for ordinary jobs. It reports per-trial state plus aggregate counts for completed, failed, running, pending, unknown, missing_tracking, and submit_failed. V1 does not parse metric files or infer the best trial; keep metric summaries in your training output or external experiment tracker.
Diffing Runs
Use hpc-compose diff <job-id-1> <job-id-2> to compare two tracked submissions. The compact text view highlights outcome, resource, and config changes; --format json returns the full uncapped diff for notebooks or experiment records. Older tracked jobs without config snapshots still compare outcome metadata and report a note that config comparison is unavailable.
Related Docs
Hyperparameter Sweeps
hpc-compose sweep turns one compose file with an embedded sweep block into many independent tracked Slurm jobs. Each trial is a normal sbatch submission with its own allocation, rendered script, job record, and scheduler state. The sweep manifest ties those jobs together for listing and aggregate status.
Quickstart
Start from a spec that can run with ordinary defaults, then add a top-level sweep block:
name: training-sweep
x-slurm:
time: "00:20:00"
cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}
sweep:
parameters:
lr: [0.001, 0.01, 0.1]
batch_size: [32, 64]
matrix: full
services:
trainer:
image: python:3.11-slim
environment:
LR: "${lr:-0.001}"
BATCH_SIZE: "${batch_size:-32}"
command: ["python", "train.py"]
Preview the expansion first:
hpc-compose sweep submit -f examples/training-sweep.yaml --dry-run
Then submit the trials:
hpc-compose sweep submit -f examples/training-sweep.yaml
hpc-compose sweep status -f examples/training-sweep.yaml
hpc-compose sweep list -f examples/training-sweep.yaml
Matrix Modes
matrix: full expands the full Cartesian product over sorted parameter names, so the example above produces six trials in stable t000, t001, … order.
Random sampling selects without replacement:
sweep:
parameters:
lr: [0.001, 0.01, 0.1]
batch_size: [32, 64]
matrix:
random: 5
seed: "paper-table-2"
With a seed, the selected trials are stable across machines. Without a seed, sweep submit derives one from the new sweep id and persists it in the manifest.
Interpolation Rules
Sweep parameter names are interpolation variable names. Values may be scalar strings, numbers, or booleans. For each trial, those variables override values from the environment and settings before planning, preparing, and rendering.
Reserved variables are also available:
| Variable | Value |
|---|---|
HPC_COMPOSE_SWEEP_ID | The persisted sweep id. |
HPC_COMPOSE_SWEEP_TRIAL | The stable trial label such as t000. |
HPC_COMPOSE_SWEEP_TRIAL_INDEX | Zero-based trial index. |
Normal commands still treat sweep as metadata. If plan, up, or render encounters ${lr} without a default, it fails unless lr is provided in the environment or settings. Use defaults such as ${lr:-0.001} when the base spec should remain runnable, and use sweep submit --dry-run as the validation path for missing sweep-only variables.
Fanout Guard
By default, submitted sweeps are capped at 100 trials. Larger matrices fail before calling sbatch:
hpc-compose sweep submit -f train.yaml
Raise the explicit ceiling when the fanout is intentional:
hpc-compose sweep submit -f train.yaml --max-trials 500
The guard applies to real submissions. Dry runs can inspect any matrix size.
Status Output
sweep status loads the manifest, queries the tracked state for submitted jobs, and aggregates:
completedfailedrunningpendingunknownmissing_trackingsubmit_failed
Use JSON for notebooks, dashboards, or CI automation:
hpc-compose sweep submit -f train.yaml --format json
hpc-compose sweep status -f train.yaml --format json
hpc-compose sweep status -f train.yaml --sweep-id sweep-123 --format json
hpc-compose sweep list -f train.yaml --format json
The JSON includes the sweep id, manifest path, matrix mode, persisted seed, trial variables, job ids, record paths, and per-trial status.
Manifest Layout
Sweep state is stored beside normal tracked jobs:
.hpc-compose/
sweeps/
latest.json
<sweep-id>/
sweep.json
t000.sbatch
t001.sbatch
jobs/
<job-id>.json
Sweep-trial records have kind: sweep_trial and include sweep metadata. They do not update the normal latest.json or latest-run.json pointers, so status, watch, and logs for ordinary runs keep their existing meaning.
V1 Limitations
- Sweeps must be embedded in the same compose file.
sweep.specis rejected in v1. - Each trial is a separate Slurm allocation. Sweeps are not Slurm arrays.
x-slurm.arrayis rejected duringsweep submit.- Trials submit sequentially. If a submission fails, later trials are not submitted and the partial manifest is kept.
sweep statussummarizes scheduler/tracking state only. It does not parse metric files or pick a best trial.
Right-Sizing With Canary Runs
hpc-compose germinate submits a short Slurm canary for an existing compose spec, forces runtime metrics on, waits for the canary to finish, and prints conservative resource recommendations for the original spec.
Canaries are short probes, not benchmark truth. They are useful for catching obvious over-requests such as asking for many GPUs when only one device is touched, or requesting far more memory than the process ever approaches during startup. They are not a substitute for full-run profiling when a workload has long warmup, data-dependent memory, lazy model loading, or late training phases.
Basic Workflow
hpc-compose germinate -f compose.yaml
hpc-compose germinate -f compose.yaml --format json
hpc-compose germinate -f compose.yaml --canary-time 00:01:00 --metrics-interval 5
The canary keeps partition, account, QoS, constraints, cache, runtime backend, and service topology from the original plan. It minimizes CPU, memory, and GPU requests in memory only, writes latest-canary.json, and leaves normal latest.json untouched.
Dry-run the canary script without submitting:
hpc-compose germinate -f compose.yaml --dry-run --script-out canary.sbatch
Output
Text output includes the canary job id, the standard right-sizing observations, and a YAML patch you can apply manually:
x-slurm:
mem: 16G
services:
trainer:
x-slurm:
cpus_per_task: 4
JSON output includes the same patch plus the full right-sizing report:
hpc-compose germinate -f compose.yaml --format json
Recommendation Rules
- CPU recommendations use observed CPU demand with conservative headroom and round up.
- Memory recommendations use the strongest available evidence from sampler rows,
sstat, andsacct, then round to Slurm-friendly units. - GPU recommendations shrink only when GPU sampler evidence shows fewer active devices.
- Walltime is observed but not down-sized from a one-minute canary.
Caveats
- Warmup-heavy jobs can look smaller than steady-state jobs.
- Data-dependent memory may peak after the canary exits.
- Lazy model loading can under-report memory and GPU use if no real request hits the model.
- Distributed training may need full topology even when a canary only exercises startup.
- Failed, OOM-like, time-limit, malformed-metrics, and missing-metrics cases are reported as diagnostics rather than YAML rewrites.
Start from examples/canary-right-size.yaml when you want a small, explicit spec to practice the workflow.
Cache Management
The resolved cache directory stores imported and prepared runtime artifacts. It comes from explicit x-slurm.cache_dir, then profile/default settings, then $HOME/.cache/hpc-compose. For real cluster runs, it must be visible from both the submission host and compute nodes.
Choose A Cache Path
Use a project scratch, work, or shared filesystem path:
export CACHE_DIR=/cluster/shared/hpc-compose-cache
mkdir -p "$CACHE_DIR"
test -w "$CACHE_DIR"
You can record that path in project settings instead of every compose file:
hpc-compose setup --profile-name dev --cache-dir "$CACHE_DIR" --default-profile dev --non-interactive
Do not use /tmp, /var/tmp, /private/tmp, or /dev/shm. Validation may accept those strings, but preflight reports them as unsafe because compute nodes must reuse artifacts prepared before submission.
Inspect Cache State
hpc-compose cache list
hpc-compose cache inspect -f compose.yaml
hpc-compose cache inspect -f compose.yaml --service app
Use cache inspect to answer:
- which artifact is being reused
- whether a prepared image came from a cached manifest
- whether a service rebuilds on every prepare because prepare mounts are present
Prune Cache Entries
Prune old entries by age:
hpc-compose --profile dev cache prune --age 14
Prune artifacts not referenced by the current plan:
hpc-compose cache prune --all-unused -f compose.yaml
Prune one cache directory directly:
hpc-compose cache prune --age 7 --cache-dir '<shared-cache-dir>'
--age and --all-unused are mutually exclusive.
Rendezvous Records
Cross-job rendezvous uses the same shared cache root:
<cache_dir>/rendezvous/<name>/latest.json
These records are small endpoint descriptors, not runtime images. They are pruned separately:
hpc-compose rendezvous list --cache-dir "$CACHE_DIR"
hpc-compose rendezvous prune --cache-dir "$CACHE_DIR"
Provider cleanup removes latest.json only when the finishing job still owns it, so an older provider cannot erase a newer provider’s record.
After Upgrading
Cache keys include the tool version, so upgrading hpc-compose invalidates existing cached artifacts. Expect a full rebuild on the next prepare or up, then optionally prune old entries:
hpc-compose cache prune --age 0
Related Docs
Cross-Job Rendezvous
hpc-compose rendezvous lets independent Slurm jobs coordinate through the shared cache directory. A provider job registers an address under <cache_dir>/rendezvous/<name>/latest.json; a later client job resolves that record and receives stable HPC_COMPOSE_RDZV_* environment variables.
This is same-cluster shared-storage discovery. It does not create DNS, tunnels, authentication, authorization, or a service mesh. Use it only inside a same-user or trusted shared-project cache boundary.
Provider
name: model-server
x-slurm:
cache_dir: ${CACHE_DIR}
services:
model:
image: python:3.12-slim
command: python -m http.server 8000
readiness:
type: tcp
port: 8000
x-slurm:
rendezvous:
register:
name: model-server
port: 8000
protocol: http
path: /
ttl_seconds: 3600
Provider registration is declarative. If readiness is configured, the rendered script registers after the readiness check succeeds. On cleanup, it removes latest.json only when the current job still owns the latest record.
Client
name: model-client
x-slurm:
cache_dir: ${CACHE_DIR}
rendezvous: model-server
services:
client:
image: curlimages/curl:8.10.1
command: curl -fsS "$HPC_COMPOSE_RDZV_MODEL_SERVER_URL"
Clients receive generic variables such as HPC_COMPOSE_RDZV_URL, plus name-scoped variables such as HPC_COMPOSE_RDZV_MODEL_SERVER_URL, HPC_COMPOSE_RDZV_MODEL_SERVER_HOST, and HPC_COMPOSE_RDZV_MODEL_SERVER_PORT.
Debugging CLI
hpc-compose rendezvous list --cache-dir "$CACHE_DIR"
hpc-compose rendezvous resolve model-server --cache-dir "$CACHE_DIR"
hpc-compose rendezvous register model-server --host node01 --port 8000 --job-id 12345 --cache-dir "$CACHE_DIR"
hpc-compose rendezvous prune --cache-dir "$CACHE_DIR"
register is mainly for debugging and custom workflows. Normal provider jobs should use services.<name>.x-slurm.rendezvous.register.
TTL and Staleness
Records have a TTL. Resolution ignores expired records, and prune removes expired latest and historical JSON files. If the provider job exits cleanly, cleanup removes the latest pointer only if it still points at that job, so a newer provider is not deregistered by an older job finishing later.
Requirements
x-slurm.cache_dirmust point to storage visible from the login node and compute nodes.- Provider and client jobs must use the same cache directory.
- Names are single safe path components: ASCII letters, digits,
.,_, and-.
See examples/rendezvous-model-server.yaml and examples/rendezvous-client.yaml for a runnable pair.
Artifacts And Resume
Artifacts are collected after a run for export and provenance. Resume state is the canonical live state a later attempt should load. Keep those roles separate.
Artifact Export
When x-slurm.artifacts is enabled, teardown collection writes:
${SLURM_SUBMIT_DIR:-$PWD}/.hpc-compose/${SLURM_JOB_ID}/artifacts/
manifest.json
payload/...
Export collected payloads after the job finishes:
hpc-compose artifacts -f compose.yaml
hpc-compose artifacts -f compose.yaml --bundle checkpoints --tarball
export_dir is resolved relative to the compose file and expands ${SLURM_JOB_ID} from tracked metadata. Named bundles are written under <export_dir>/bundles/<bundle>/, and provenance JSON is written under <export_dir>/_hpc-compose/bundles/<bundle>.json.
The bundle name default is reserved for top-level x-slurm.artifacts.paths.
Resume-Aware Runs
When x-slurm.resume is enabled, hpc-compose:
- mounts the shared resume path into every service at
/hpc-compose/resume - injects
HPC_COMPOSE_RESUME_DIR,HPC_COMPOSE_ATTEMPT, andHPC_COMPOSE_IS_RESUME - writes attempt-specific runtime outputs under
.hpc-compose/<jobid>/attempts/<attempt>/ - keeps
.hpc-compose/<jobid>/{logs,metrics,artifacts,state.json}pointed at the latest attempt for compatibility
Use the shared resume directory for the canonical checkpoint a restarted run should load next. Treat exported artifacts as retrieval and provenance output after the attempt finishes, not as the primary live resume source.
Useful Commands
hpc-compose up --resume-diff-only -f compose.yaml
hpc-compose up --allow-resume-changes -f compose.yaml
hpc-compose artifacts -f compose.yaml
Related Docs
CLI Reference
This page maps the public hpc-compose CLI by workflow. Use Quickstart for the shortest install-and-run path, Runbook for real-cluster operations, and Spec Reference for YAML field behavior.
Common Flags
| Flag | Use it for | Notes |
|---|---|---|
--profile <NAME> | Select a profile from the project-local settings file | Applies to every command. |
--settings-file <PATH> | Use an explicit settings file | Bypasses upward discovery of .hpc-compose/settings.toml. |
-f, --file <FILE> | Select the compose file on compose-aware commands | When omitted, hpc-compose uses the active context compose file or falls back to compose.yaml. |
| `–color auto | always | never` |
--quiet | Suppress non-essential progress labels | Useful when a wrapper only needs command output and errors. |
--format json | Machine-readable output | Preferred on non-streaming commands. --json remains available only as a compatibility alias on older machine-readable commands. |
Authoring and Setup
| Command | Use it for | Notes |
|---|---|---|
new (alias: init) | Generate a starter compose file from a built-in template | Use --list-templates and --describe-template <name> to inspect templates before writing a file. --cache-dir is optional and writes an explicit x-slurm.cache_dir. |
evolve | Learn spec features through a progressive valid-spec tutorial | Use --list-lessons, --describe-lesson <id>, and --until <step> to inspect or stop at a lesson step. --format json requires --yes. |
setup | Create or update the project-local settings file | Records compose path, env files, env vars, binary overrides, and an optional profile cache default. |
context | Print the resolved execution context | Shows the selected profile, binaries, interpolation vars, runtime paths, and value sources. |
completions | Generate shell completion scripts | Supports Bash, Zsh, Fish, PowerShell, and Elvish through Clap’s completion generator. |
hpc-compose new --list-templates
hpc-compose new --describe-template minimal-batch
hpc-compose new --template minimal-batch --name my-app --output compose.yaml
hpc-compose new --template minimal-batch --name my-app --cache-dir '<shared-cache-dir>' --output compose.yaml
hpc-compose evolve --list-lessons
hpc-compose evolve --describe-lesson progressive-complexity
hpc-compose evolve --output compose.yaml --name my-app
hpc-compose evolve --yes --until readiness --format json
hpc-compose setup
hpc-compose setup --profile-name dev --cache-dir '<shared-cache-dir>' --default-profile dev --non-interactive
hpc-compose context --format json
hpc-compose context --show-values --format json
hpc-compose completions zsh
evolve Options
evolve is authoring-only: it validates and writes candidate specs but does not prepare images, run preflight, or submit jobs. The default lesson is progressive-complexity, with steps minimal, second-service, readiness, failure-policy, and multi-node-placement.
--list-lessonsprints shipped lessons.--describe-lesson <LESSON>prints lesson steps and concepts.--lesson <LESSON>selects the lesson to run.--until <STEP>stops after a step id such asreadiness.--yesaccepts steps noninteractively.--format jsonis available for list/describe and for--yesruns.--forceallows overwriting the output file.
Plan and Run
| Command | Use it for | Notes |
|---|---|---|
plan | Validate and preview the static runtime plan | Recommended before every first run. --show-script prints the generated launcher to stdout without writing a file; --explain adds actionable cache, resume, preflight, and next-command hints. |
validate | Check YAML shape and field validation | Add --strict-env when interpolation fallbacks should fail. |
lint | Run stricter opinionated static checks | Flags risky-but-valid specs such as weak dependency readiness, unusual memory/CPU ratios, and ignored services that can write shared paths. Warnings fail by default; add --allow-warnings to make warning-only results successful. |
config | Show the fully interpolated effective config | Use --format json when you need stable machine-readable snapshots or resume diffs. config --variables reports only interpolation variables referenced by the compose file and redacts sensitive-looking names unless --show-values is passed. |
schema | Print the checked-in JSON Schema | Use it for editor integration and authoring tools. The same schema is published with the docs site for YAML Language Server and SchemaStore consumption. Rust validation remains the semantic source of truth. |
inspect | View the normalized runtime plan | --verbose can reveal resolved secrets and final mount mappings. Add --dependencies for a service DAG in text, DOT, or JSON form. |
preflight | Check host and cluster prerequisites | Use --strict when warnings should block a later run. |
doctor cluster-report | Generate a best-effort cluster capability profile | Writes .hpc-compose/cluster.toml by default; use --out - to print the TOML profile. |
doctor mpi-smoke | Render or run a small MPI probe for one service | Reports requested/advertised MPI types, MPI profile metadata, discovered MPI installs, host MPI binds/env, and rendered srun; add --submit to consume a Slurm allocation. |
doctor fabric-smoke | Render or run MPI/NCCL/UCX/OFI smoke probes for one MPI service | Use --checks auto or a comma-separated list such as mpi,nccl; render-only by default, --submit consumes a Slurm allocation. |
weather | Show advisory live cluster conditions | One-shot dashboard from sinfo, squeue, optional sshare, and optional sprio; does not reserve resources or change submission behavior. |
prepare | Import images and build prepared runtime artifacts | Use --force when the base image or prepare inputs changed. |
render | Write the generated launcher script without submitting | Good for reviewing the final batch script. |
up | Run the one-command launch/watch/logs workflow | Preferred normal run on a real cluster. Uses a spec-scoped .hpc-compose/locks/*.up.lock to prevent concurrent up races. |
test | Smoke-test a finite spec end to end | Requires explicit --local or --submit; every service must start, pass configured readiness, and complete successfully. |
dev | Run local hot-reload mode | Watches bind-mounted source directories and restarts affected services through the local supervisor. |
tmux | Open a multi-pane local service log dashboard | Tails one tracked local service log per pane; tmux does not own service processes. |
germinate | Submit a one-minute canary and recommend resource settings | Writes latest-canary.json, keeps normal latest.json untouched, and prints a manual YAML patch. |
sweep submit | Submit many independent trials from a top-level sweep block | Each trial is a tracked Slurm allocation. Use --dry-run first and --max-trials for intentional fanout above 100. |
when | Submit after cluster conditions are met | Prepares and renders now, then monitors typed conditions such as idle nodes, prior job completion, or a local time window before calling sbatch. |
alloc | Open an interactive Slurm allocation for iterative service runs | Uses top-level x-slurm allocation settings, exports HPC_COMPOSE_*, and lets run SERVICE -- CMD reuse the active allocation. |
run | Launch a one-off command | Service mode uses an existing compose service. Image mode uses --image IMAGE -- CMD and builds an ephemeral one-service plan. |
shell | Open an interactive Pyxis shell | Thin wrapper around srun --pty --container-image=<image> bash -l. |
hpc-compose plan -f compose.yaml
hpc-compose plan --explain -f compose.yaml
hpc-compose plan --show-script -f compose.yaml
hpc-compose validate -f compose.yaml
hpc-compose lint -f compose.yaml
hpc-compose lint -f compose.yaml --allow-warnings
hpc-compose lint -f compose.yaml --format json
hpc-compose config -f compose.yaml
hpc-compose config -f compose.yaml --variables
hpc-compose schema > hpc-compose.schema.json
hpc-compose inspect --verbose -f compose.yaml
hpc-compose inspect --dependencies -f compose.yaml
hpc-compose inspect --dependencies --dependencies-format dot -f compose.yaml
hpc-compose preflight -f compose.yaml
hpc-compose doctor cluster-report
hpc-compose doctor mpi-smoke -f compose.yaml --service trainer --script-out mpi-smoke.sbatch
hpc-compose doctor mpi-smoke -f compose.yaml --service trainer --submit
hpc-compose doctor fabric-smoke -f compose.yaml --service trainer --checks auto --script-out fabric-smoke.sbatch
hpc-compose doctor fabric-smoke -f compose.yaml --service trainer --checks mpi,nccl --submit
hpc-compose weather
hpc-compose weather --format json
hpc-compose prepare -f compose.yaml
hpc-compose render -f compose.yaml --output job.sbatch
hpc-compose up -f compose.yaml
hpc-compose up --hold-on-exit always -f compose.yaml
hpc-compose up --watch-queue --queue-warn-after 15m -f compose.yaml
hpc-compose up --detach --format json -f compose.yaml
hpc-compose test --local -f compose.yaml
hpc-compose test --submit --time 00:01:00 -f compose.yaml
hpc-compose dev -f examples/dev-python-app.yaml
hpc-compose tmux -f examples/dev-python-app.yaml --no-attach
hpc-compose germinate -f compose.yaml
hpc-compose germinate -f compose.yaml --format json
hpc-compose germinate -f compose.yaml --dry-run --script-out canary.sbatch
hpc-compose sweep submit -f compose.yaml --dry-run
hpc-compose sweep submit -f compose.yaml --max-trials 200
hpc-compose sweep status -f compose.yaml --format json
hpc-compose sweep list -f compose.yaml
hpc-compose when -f compose.yaml --partition gpu8 --free-nodes 4
hpc-compose when -f compose.yaml --after-job 12345
hpc-compose when -f compose.yaml --between 22:00-06:00
hpc-compose when --detach --format json -f compose.yaml --partition gpu8 --free-nodes 4
hpc-compose alloc -f compose.yaml
hpc-compose run app -- python -m smoke_test
hpc-compose run --image docker://python:3.12 --resources cpu-small -- python -V
hpc-compose shell --image docker://ubuntu:24.04
Editor Schema
The checked-in schema is draft-07 JSON Schema and is published with the docs site at /schema/hpc-compose.schema.json. SchemaStore should associate it only with hpc-compose-specific filenames: hpc-compose.yaml, hpc-compose.yml, *.hpc-compose.yaml, and *.hpc-compose.yml. Generic compose.yaml remains a supported input file, but it is intentionally not claimed for zero-config editor association.
up Options
Useful workflow flags:
--localruns a Pyxis/Enroot plan on the current Linux host instead of callingsbatch.--detachsubmits or launches and returns after tracking metadata is written.--format text|jsonis accepted with--detachor--dry-run.--watch-queuewaits in line-oriented queue output until the Slurm job reachesRUNNING, then opens the normal watch view.--queue-warn-after <DURATION>warns once when--watch-queuestaysPENDINGlonger than the threshold; the default is10m, and0disables the warning.--watch-mode auto|tui|lineselects the live output mode;--no-tuiis a line-mode alias.--hold-on-exit never|failure|alwayscontrols whether the TUI stays open after the job reaches a terminal scheduler state.--allow-resume-changesacknowledges an intentional change to resume-coupled config between tracked runs.--resume-diff-onlyprints the resume-sensitive config diff without submitting.--script-out <PATH>keeps a copy of the rendered batch script.--force-rebuildrefreshes imported and prepared artifacts before launch.--skip-prepareskips image import and prepare reuse checks.--keep-failed-prepleaves the failed Enroot rootfs behind for inspection.- Array jobs (
x-slurm.array) require--detachbecause live watch/log fan-out is not array-aware yet. - Scheduler dependencies from
x-slurm.after_jobandx-slurm.dependencyare passed assbatch --dependency=....
germinate Canary Runs
germinate is the conservative right-sizing workflow:
hpc-compose germinate -f compose.yaml
hpc-compose germinate -f compose.yaml --canary-time 00:01:00 --metrics-interval 5
hpc-compose germinate -f compose.yaml --pending-timeout 30m --format json
Useful options:
--canary-time <TIME>defaults to00:01:00.--metrics-interval <SECONDS>defaults to5and is forced on in the canary plan.--pending-timeout <DURATION>defaults to30m.--min-cpus <N>,--min-mem <MEM>, and--min-gpus <N>set canary floors.--dry-runrenders the canary script without callingsbatch.--skip-prepare,--force-rebuild,--keep-failed-prep,--no-preflight, and--script-outmatch the normal preparation flags.
The command rejects x-slurm.array in v1 and never rewrites your compose file automatically. See Right-Sizing With Canary Runs.
sweep Hyperparameter Sweeps
sweep expands the top-level sweep block in a compose file. Each generated trial is rendered and submitted as an independent tracked Slurm job; sweep status and sweep list read the persisted manifest under .hpc-compose/sweeps/.
hpc-compose sweep submit -f train.yaml --dry-run
hpc-compose sweep submit -f train.yaml --max-trials 200
hpc-compose sweep submit -f train.yaml --format json
hpc-compose sweep status -f train.yaml
hpc-compose sweep status -f train.yaml --sweep-id sweep-123 --format json
hpc-compose sweep list -f train.yaml --format json
sweep submit options:
| Option | Use it for |
|---|---|
-f, --file <FILE> | Select the compose file containing the embedded sweep block. |
--dry-run | Expand and validate all trials without writing manifests, scripts, or job records. |
--max-trials <N> | Permit real submissions above the default 100-trial fanout guard. |
--skip-prepare | Reuse existing prepared artifacts and skip image preparation. |
--force-rebuild | Refresh imported/prepared artifacts for each submitted trial. |
--no-preflight | Skip preflight checks before trial submission. |
| `–format text | json` |
sweep status options:
| Option | Use it for |
|---|---|
-f, --file <FILE> | Select the compose file whose sweep manifests should be read. |
--sweep-id <ID> | Inspect a specific sweep instead of .hpc-compose/sweeps/latest.json. |
| `–format text | json` |
sweep list options:
| Option | Use it for |
|---|---|
-f, --file <FILE> | Select the compose file whose sweep directory should be scanned. |
| `–format text | json` |
See Hyperparameter Sweeps for the sweep spec shape, interpolation rules, status categories, and v1 limitations.
when Conditional Submission
when is a foreground monitor for constrained partitions and off-hour workflows. It runs the normal pre-submit work first, then polls until every supplied condition is true:
hpc-compose when -f compose.yaml --partition gpu8 --free-nodes 4
hpc-compose when -f compose.yaml --after-job 12345 --after-job-condition afterok
hpc-compose when -f compose.yaml --between 22:00-06:00
Conditions are ANDed. --free-nodes counts only idle rows from sinfo -h -p <partition> -o "%T|%D" and requires --partition to match x-slurm.partition. --after-job polls squeue first and then sacct; afterok and afternotok fail immediately when the prior job reaches a terminal state that can never satisfy the requested condition. --between uses local login-node wall-clock time and supports wraparound windows such as 22:00-06:00.
Useful options:
--poll-interval <DURATION>defaults to60s; the minimum is5s.--timeout <DURATION>gives up if conditions are not met;0sperforms one check.--detachreturns after submission and tracking metadata are written.--format jsonis accepted with--detachand returns the condition summaries plus normal submission metadata.--skip-prepare,--force-rebuild,--keep-failed-prep,--no-preflight, and--script-outmatch the correspondinguppreparation flags.
Example JSON automation:
hpc-compose when --detach --format json -f compose.yaml --partition gpu8 --free-nodes 4
V1 has no x-when YAML field. Conditional submission is intentionally a CLI workflow layered over the normal compose spec.
up --local
up --local launches a Pyxis/Enroot plan on the current host instead of calling sbatch. It is useful for local authoring and script inspection, not for distributed Slurm execution.
hpc-compose up --local --dry-run -f compose.yaml
Current constraints:
- Linux hosts only
runtime.backend: pyxisonly- single-host specs only
- no distributed or partitioned placement
- no
services.<name>.x-slurm.extra_srun_args - no
services.<name>.x-slurm.mpi - no
x-slurm.array - no scheduler dependencies from
x-slurm.after_joborx-slurm.dependency - reservation-related
x-slurm.submit_argsare ignored x-slurm.erroris ignored, and local batch stderr is written into the tracked local batch log
up --local follows the tracked local launch immediately, just like up does for a submitted job. Add --detach when you want to launch and return.
In local mode the batch script also exports HPC_COMPOSE_BACKEND_OVERRIDE=local, HPC_COMPOSE_LOCAL_ENROOT_BIN pointing to the resolved enroot binary, and HPC_COMPOSE_LOCAL_BIN_DIR containing a generated srun shim. These variables are internal to hpc-compose and not intended for direct use in compose specs.
Development Workflow
test, dev, and tmux are intentionally small workflows layered over the same render/prepare/tracking machinery as up. See Development Workflow for the smoke-test guide, hot-reload behavior, and local-mode constraints.
test is for finite smoke specs:
hpc-compose test --local -f compose.yaml
hpc-compose test --submit --time 00:01:00 --timeout 180s -f compose.yaml
hpc-compose test --submit --format json -f compose.yaml
Success means all tracked services appear in runtime state, launched at least once, passed readiness when readiness is configured, and completed successfully. Long-running application specs should use a smoke-test variant of the command or service entrypoint that exits after proving the workflow.
Useful test options:
| Option | Use it for |
|---|---|
--local | Run the finite smoke spec through the local supervisor. |
--submit | Submit the finite smoke spec to Slurm; required before any scheduler submission happens. |
--time <TIME> | Override Slurm wall time for --submit; defaults to 00:01:00. |
--timeout <DURATION> | Stop waiting and best-effort cancel/cleanup after the timeout; defaults to 180s. |
--format json | Emit phase status, job id, script path, per-service results, and failure reason for automation. |
dev is local-only and watches host directories from service volumes:
hpc-compose dev -f examples/dev-python-app.yaml
hpc-compose dev -f compose.yaml --watch-path ./src --debounce-ms 500
Directory bind mounts are mapped back to affected services. File mounts, missing paths, container-only paths, cache paths, and non-directory paths are ignored. --watch-path adds an explicit directory and restarts all services when it changes. By default, leaving dev stops the local supervisor; use --keep-running when you want the tracked local job to continue.
Useful dev options:
| Option | Use it for |
|---|---|
--watch-path <PATH> | Add an explicit watch root when mounted source directories cannot be inferred. |
--debounce-ms <N> | Coalesce rapid file changes before requesting a restart. |
--keep-running | Leave the local supervisor alive when the watch loop exits. |
tmux opens a log dashboard for local runs:
hpc-compose tmux -f compose.yaml
hpc-compose tmux -f compose.yaml --job-id local-123
hpc-compose tmux -f compose.yaml --session demo --no-attach
When --job-id is omitted, tmux launches a new local run first. Each pane runs tail -F against one tracked service log and uses the service name as the pane title.
Useful tmux options:
| Option | Use it for |
|---|---|
--job-id <ID> | Attach the dashboard to an existing tracked local run. |
--session <NAME> | Choose the tmux session name instead of hpc-compose-<job-id>. |
--no-attach | Create/update the dashboard without requiring an interactive terminal. |
--lines <N> | Set the initial tail -n history for each pane. |
run and shell
run has two forms:
hpc-compose run [-f compose.yaml] SERVICE -- CMD [ARGS...]
hpc-compose run --image IMAGE [--resources NAME] [--time T] [--mem M] [--cpus-per-task N] [--gpus N] [--partition P] [--env K=V] [--local] -- CMD [ARGS...]
Service mode reuses the named service’s image, environment, mounts, working directory, and prepare rules, clears depends_on, and submits a fresh tracked run job. When launched inside hpc-compose alloc, service mode detects HPC_COMPOSE_ALLOCATION=1 and SLURM_JOB_ID, prints the active allocation id, runs the one-service launcher inside the allocation with srun, and records the latest run metadata against the allocation job id. Image mode creates an ephemeral one-service plan from CLI flags, then follows the normal render/prepare/submit path. --resources refers to [resource_profiles.<name>] in settings; it is not the global --profile selector.
alloc requests an interactive allocation through salloc:
hpc-compose alloc -f compose.yaml
hpc-compose alloc -f compose.yaml -- bash -lc 'hpc-compose run app -- python -m pytest'
It runs preflight and image preparation by default, accepts the matching up preparation flags (--no-preflight, --skip-prepare, --force-rebuild, and --keep-failed-prep), rejects x-slurm.array, and exports allocation metadata such as HPC_COMPOSE_COMPOSE_FILE, HPC_COMPOSE_CACHE_DIR, HPC_COMPOSE_NODELIST_FILE, and HPC_COMPOSE_PRIMARY_NODE.
shell is intentionally thinner:
hpc-compose shell --image IMAGE [--resources NAME] [--time T] [--mem M] [--cpus-per-task N] [--gpus N] [--partition P] [--env K=V]
It calls srun --pty directly with Pyxis --container-image and defaults to bash -l. It does not render an sbatch script or create tracked job metadata.
Accessible and Automation-Friendly Output
Use plain or structured output when terminal styling, progress labels, or alternate-screen interfaces make automation or assistive tooling harder:
hpc-compose --color never plan -f compose.yaml
hpc-compose --quiet validate -f compose.yaml
hpc-compose watch -f compose.yaml --watch-mode line
hpc-compose replay -f compose.yaml --no-tui
hpc-compose logs -f compose.yaml --service app --follow
hpc-compose logs -f compose.yaml --grep 'error|oom' --since 30m
hpc-compose status -f compose.yaml --format json
context and config --variables intentionally scope interpolation variables to names referenced by the compose file. Values whose names look secret-bearing, such as TOKEN, PASSWORD, SECRET, API_KEY, or PRIVATE_KEY, are shown as <redacted> by default; add --show-values only in trusted local diagnostics.
Tracked Runtime
| Command | Use it for | Notes |
|---|---|---|
debug | Diagnose the latest tracked run | Shows scheduler state, per-service state, batch and service log tails, missing-log hints, and a recommended next command. Add --preflight to rerun prerequisite checks. |
status | Summarize scheduler state, the top-level batch log, per-service outcomes, and failure-policy state | Prefer --format json for automation. Add --array to include merged squeue --array and sacct --array task rows. |
ps | Show a stable per-service runtime snapshot | Useful when you want a point-in-time view instead of the live TUI. |
watch | Reconnect to the live watch UI | Falls back to line-oriented output on non-interactive terminals. |
replay | Reanimate a tracked job timeline from existing artifacts | Best-effort DVR view built from final state, service-exit markers, metrics JSONL, and logs. Use --speed, --no-tui, or --format json as needed. |
logs | Print tracked service logs | Add --follow, --grep <pattern>, or coarse --since <duration> as needed. |
inspect --rightsize | Suggest conservative resource request reductions after a tracked run | Uses tracked sacct, sstat, and sampler evidence; supports --job-id and --format json. |
stats | Report tracked runtime metrics, step stats, and optional accounting | Supports --accounting, --format json, --format jsonl, and --format csv. |
score | Score post-run resource efficiency | Supports positional job ids, --format json, --pue, --gpu-tdp-w, and --cpu-watts-per-core. |
diff | Compare two tracked job submissions | Compact text by default; use --format json for full detail. |
artifacts | Export tracked artifact bundles after a run | Use --bundle <name> and --tarball when needed. |
cancel | Cancel the latest tracked job or an explicit job id | Uses tracked metadata instead of making you retype paths. |
down | Cancel a tracked job and clean tracked state | Supports --purge-cache when the tracked snapshot names concrete cache artifacts. |
jobs list | Scan the current repo tree for tracked runs | Start here when you need to rediscover an older run. |
clean | Remove old tracked job directories for one compose context | Use --dry-run first when you are unsure. |
rendezvous list | List live shared-cache service records | Defaults to the resolved cache dir; --cache-dir inspects a specific cache. |
rendezvous resolve NAME | Resolve one provider record | Prints endpoint fields or JSON for automation. |
rendezvous register NAME | Manually register a provider record | Intended for debugging and custom workflows; declarative specs usually register providers. |
rendezvous prune | Remove expired provider records | Cleans stale latest and historical rendezvous JSON files. |
hpc-compose debug -f compose.yaml
hpc-compose debug -f compose.yaml --preflight
hpc-compose jobs list
hpc-compose status -f compose.yaml --format json
hpc-compose status -f compose.yaml --array
hpc-compose status -f compose.yaml --job-id 12345_7 --array
hpc-compose ps -f compose.yaml
hpc-compose watch -f compose.yaml --watch-mode line
hpc-compose watch -f compose.yaml --hold-on-exit always
hpc-compose replay -f compose.yaml
hpc-compose replay -f compose.yaml --speed 10
hpc-compose replay -f compose.yaml --job-id 12345 --service app
hpc-compose replay -f compose.yaml --no-tui
hpc-compose replay -f compose.yaml --format json
hpc-compose logs -f compose.yaml --service app --follow
hpc-compose logs -f compose.yaml --grep 'error|oom' --since 30m
hpc-compose inspect -f compose.yaml --rightsize
hpc-compose stats -f compose.yaml --format jsonl
hpc-compose stats -f compose.yaml --accounting --format csv
hpc-compose score 12345
hpc-compose diff 12345 12346 -f compose.yaml
hpc-compose artifacts -f compose.yaml --bundle checkpoints --tarball
hpc-compose down -f compose.yaml
hpc-compose cancel -f compose.yaml
hpc-compose clean -f compose.yaml --age 7 --dry-run
hpc-compose rendezvous list
hpc-compose rendezvous resolve model-server
hpc-compose rendezvous register model-server --host node01 --port 8000 --job-id 12345
hpc-compose rendezvous prune
Cache Maintenance
| Command | Use it for | Notes |
|---|---|---|
cache list | Inspect cached imported and prepared image artifacts | Works without a compose file. |
cache inspect | Show cache reuse expectations for the current plan | Supports --service <name> for one service. |
cache prune | Remove old or unused cache entries | --age and --all-unused are mutually exclusive. |
hpc-compose cache list
hpc-compose cache inspect -f compose.yaml --service app
hpc-compose cache prune --age 7 --cache-dir '<shared-cache-dir>'
hpc-compose cache prune --all-unused -f compose.yaml
Related Docs
- Examples
- Execution Model
- Runbook
- Spec Reference
- Hyperparameter Sweeps
- Right-Sizing With Canary Runs
- Cross-Job Rendezvous
Spec reference
This page describes the Compose subset that hpc-compose accepts today. Unknown or unsupported fields are rejected unless this page explicitly says otherwise.
How To Use This Reference
This page is intentionally complete. If you are new, start with Quickstart, Examples, and Runtime Backends, then use the table below to jump into the field group you need.
| Need | Section |
|---|---|
| Overall YAML shape | Top-level shape and Top-level fields |
| Shared templates and overrides | extends |
| Runtime backend choice | runtime and Runtime Backends |
| Slurm allocation settings | x-slurm |
| Hyperparameter sweeps | sweep and Hyperparameter Sweeps |
| Service command, image, env, and mounts | Service fields, Image rules, command and entrypoint, environment, volumes |
| Startup ordering | depends_on, readiness, and healthcheck |
| Multi-node placement and MPI | Multi-node placement rules, services.<name>.x-slurm.placement, and services.<name>.x-slurm.mpi |
| Prepared images | x-runtime.prepare and x-enroot.prepare |
| Metrics, artifacts, and resume | x-slurm.metrics, x-slurm.artifacts, and x-slurm.resume |
| Unsupported Compose features | Unsupported Compose keys |
Top-level shape
name: demo
version: "1"
runtime:
backend: pyxis
x-slurm:
time: "00:30:00"
services:
app:
image: python:3.11-slim
command: python -m main
Top-level fields
| Field | Shape | Default | Notes |
|---|---|---|---|
extends | string | omitted | Top-level authoring-only path to a base spec. The base is resolved before interpolation, validation, planning, and config output. |
name | string | omitted | Used as the Slurm job name when x-slurm.job_name is not set. |
version | string "1" or integer 1 | 1 | hpc-compose spec schema version. Omit for v1 or set explicitly to "1"; Docker Compose values such as "3.9" are rejected after migration. |
runtime | mapping | backend: pyxis | Selects the service runtime backend and GPU passthrough policy. |
services | mapping | required | Must contain at least one service. |
steps | mapping | alias for services | Use either services or steps, not both. |
modules | list of strings | omitted | List-only shorthand for top-level x-env.modules.load; cannot be combined with x-env.modules. |
x-env | mapping | omitted | Structured host-side module, Spack view, and environment setup shared by all services. |
x-slurm | mapping | omitted | Top-level Slurm settings and shared runtime defaults. |
sweep | mapping | omitted | Embedded hyperparameter sweep metadata consumed by hpc-compose sweep submit/status/list. Normal commands treat it as metadata. |
extends
extends is an authoring feature for sharing base specs and service templates without copying large cluster-specific blocks. It is resolved before interpolation, validation, planning, rendering, tracked metadata, and hpc-compose config; the effective config no longer contains any extends keys.
Top-level extends points at a base YAML file:
extends: cluster-base.yaml
x-slurm:
time: "02:00:00"
services:
trainer:
command: python train.py
Service-level extends supports three forms:
services:
api:
extends: base-service
worker:
extends: service-templates.yaml
trainer:
extends:
file: ml-templates.yaml
service: gpu-worker
Rules:
- Top-level
extendsmust be a file path string. - A service string that looks like a YAML file path, such as
base.yaml,../base.yml, or a path with a separator, uses the same service name from that file. Other strings refer to a service in the same file. - A service mapping can select
{ file, service }; omitfileto select a service from the same file. - Extends references are recursive and cycles are rejected.
- Maps merge recursively. Sequences append base-first. Child scalars replace base scalars.
- Service
volumesmerge by container target, so a child mount for/datareplaces the base mount for/datawhile unrelated base mounts are kept. - Relative host paths in the final plan still resolve against the leaf compose file passed with
-f. - There is no delete or unset syntax in this version.
sweep
sweep defines trial variables for hpc-compose sweep submit. It is a top-level metadata block; every generated trial is still planned, rendered, submitted, and tracked as a normal one-allocation job.
Full Cartesian product:
sweep:
parameters:
lr: [0.001, 0.01, 0.1]
batch_size: [32, 64]
matrix: full
Random sample without replacement:
sweep:
parameters:
lr: [0.001, 0.01, 0.1]
batch_size: [32, 64]
matrix:
random: 5
seed: "optional-stable-seed"
Rules:
parametersmust contain at least one key, and every value list must contain at least one scalar.- Parameter keys must be valid interpolation variable names:
[A-Za-z_][A-Za-z0-9_]*. - Parameter keys must not use the reserved
HPC_COMPOSE_SWEEP_prefix. - Parameter values may be strings, numbers, or booleans. They are passed to interpolation as strings.
matrix: fullexpands the Cartesian product deterministically over sorted parameter names.matrix.randommust be at least 1 and cannot exceed the total number of combinations.matrix.seedis optional. If omitted,sweep submitderives a seed from the new sweep id and persists it.sweep.specis rejected in v1; embed the sweep in the same compose file.
For each trial, sweep variables override existing interpolation variables from .env, environment, settings, or --env. These reserved variables are also available:
| Variable | Meaning |
|---|---|
HPC_COMPOSE_SWEEP_ID | Persisted sweep id. |
HPC_COMPOSE_SWEEP_TRIAL | Trial label such as t000. |
HPC_COMPOSE_SWEEP_TRIAL_INDEX | Zero-based trial index. |
Normal commands do not expand the sweep matrix. If the runnable spec contains ${lr} with no default, ordinary plan, up, and render still fail unless lr is provided. Use defaults such as ${lr:-0.001} when the base spec should remain runnable, or use hpc-compose sweep submit --dry-run to validate sweep-only variables.
hpc-compose sweep submit rejects x-slurm.array, because every sweep trial is already its own allocation. See Hyperparameter Sweeps for manifests, status aggregation, and examples.
x-env
x-env is structured host-side software setup. It is available at the top level and under services.<name>.
x-env:
modules:
- cuda/12.4
- openmpi/5
spack:
view: /shared/spack/views/ml
env:
HDF5_USE_FILE_LOCKING: "FALSE"
services:
app:
image: python:3.11-slim
x-env:
modules:
purge: false
load:
- netcdf/4.9
env:
OMP_NUM_THREADS: "8"
Supported forms:
modules: [name, ...]modules: { purge: bool, load: [name, ...] }spack: { view: /path/to/view }env: { KEY: VALUE }
Rules:
- Top-level
x-envrenders beforex-slurm.setup. - Service-level
x-envrenders immediately before that service’ssrun. enventries are exported on the host and forwarded into Pyxis containers.- Service-level
x-env.envoverrides top-levelx-env.envwhen the same variable is set. - Top-level
modules: [...]and service-levelmodules: [...]are shorthand for the matchingx-env.modules.loadlist. The shorthand is list-only and cannot be combined withx-env.modulesat the same scope. spack.viewprependsbin,lib,lib64, and Python site-package paths only when those directories exist.- Modules and Spack views are host-side setup. Container filesystem visibility still requires explicit
volumes,x-slurm.mpi.host_mpi.bind_paths, or other site-specific binds.
Settings-aware command table
Use these commands and global flags when you want the project-local settings file (.hpc-compose/settings.toml) to remember compose path, env files, env vars, and binary overrides.
| Command or flag | Purpose | Notes |
|---|---|---|
--profile <NAME> | Select the profile from settings | Global flag; applies to every subcommand. |
--settings-file <PATH> | Use an explicit settings file | Global flag; bypasses upward auto-discovery of .hpc-compose/settings.toml. |
hpc-compose setup | Create or update the project-local settings file | Interactive by default; supports --non-interactive with --profile-name, --compose-file, --env-file, --env, --binary, --cache-dir, and --default-profile. |
hpc-compose context | Print fully resolved execution context | Shows selected settings/profile, compose path, binaries, referenced interpolation vars, runtime paths, and value sources; supports --format json. Sensitive-looking interpolation values are redacted unless --show-values is passed. |
hpc-compose validate --strict-env | Fail when interpolation fell back to defaults | Detects when ${VAR:-...} or ${VAR-...} consumed fallback values because VAR was missing. |
hpc-compose lint | Run opinionated authoring checks | Builds on validation and planning, then reports stable finding codes for risky dependency, memory, and shared-write patterns. |
hpc-compose schema | Print the checked-in JSON Schema | Useful for editor integration and authoring tools. Rust validation remains the semantic source of truth. |
x-slurm
These fields live under the top-level x-slurm block.
| Field | Shape | Default | Notes |
|---|---|---|---|
resources | string | omitted | Name of a [resource_profiles.<name>] entry in .hpc-compose/settings.toml. Profile values are defaults only; explicit x-slurm fields win. |
job_name | string | name when present | Rendered as #SBATCH --job-name. |
partition | string | omitted | Passed through to #SBATCH --partition. |
account | string | omitted | Passed through to #SBATCH --account. |
qos | string | omitted | Passed through to #SBATCH --qos. |
time | string | omitted | Passed through to #SBATCH --time. |
nodes | positive integer | omitted | Slurm allocation node count. Defaults to 1 when omitted. |
ntasks | positive integer | omitted | Passed through to #SBATCH --ntasks. |
ntasks_per_node | positive integer | omitted | Passed through to #SBATCH --ntasks-per-node. |
cpus_per_task | positive integer | omitted | Top-level Slurm CPU request. |
mem | string | omitted | Passed through to #SBATCH --mem. |
gres | string | omitted | Passed through to #SBATCH --gres. |
gpus | positive integer | omitted | Used only when gres is not set. |
gpus_per_node | positive integer | omitted | Passed through to #SBATCH --gpus-per-node. |
gpus_per_task | positive integer | omitted | Passed through to #SBATCH --gpus-per-task. |
cpus_per_gpu | positive integer | omitted | Passed through to #SBATCH --cpus-per-gpu. |
mem_per_gpu | string | omitted | Passed through to #SBATCH --mem-per-gpu. |
gpu_bind | string | omitted | Passed through to #SBATCH --gpu-bind. |
cpu_bind | string | omitted | Passed through to #SBATCH --cpu-bind. |
mem_bind | string | omitted | Passed through to #SBATCH --mem-bind. |
distribution | string | omitted | Passed through to #SBATCH --distribution. |
hint | string | omitted | Passed through to #SBATCH --hint. |
constraint | string | omitted | Passed through to #SBATCH --constraint. |
output | string | omitted | Passed through to #SBATCH --output. |
error | string | omitted | Passed through to #SBATCH --error. |
chdir | string | omitted | Passed through to #SBATCH --chdir. |
array | string | omitted | Slurm array spec such as 0, 1-10, 1-10:2, 0,3,8-12, or 0-99%10. Rendered as #SBATCH --array. |
after_job | string or mapping | omitted | Scheduler dependency on a prior job id. String shorthand means afterany:<id>; mapping supports { id, condition }. |
dependency | string | omitted | Currently supports singleton, combined with after_job when both are set. |
cache_dir | string | settings profile, settings defaults, then $HOME/.cache/hpc-compose | Must resolve to shared storage visible from the login node and the compute nodes. |
scratch | mapping | omitted | Optional scratch path mounted into services and exposed as HPC_COMPOSE_SCRATCH_DIR. |
stage_in | list of mappings | omitted | Copy or rsync host paths before services launch. |
stage_out | list of mappings | omitted | Copy or rsync paths during teardown, optionally by outcome. |
burst_buffer | mapping | omitted | Raw #BB / #DW directives for site-specific burst-buffer systems. |
metrics | mapping | omitted | Enables runtime metrics sampling. |
artifacts | mapping | omitted | Enables tracked artifact collection and export metadata. |
resume | mapping | omitted | Enables checkpoint-aware resume semantics with a shared host path mounted into every service. |
notify | mapping | omitted | First-class Slurm email notification settings. |
setup | list of strings | omitted | Raw shell lines inserted into the generated batch script before service launches. |
submit_args | list of strings | omitted | Extra raw Slurm arguments appended as #SBATCH ... lines. |
rendezvous | string, list, or mapping | omitted | Resolve cross-job service records from the shared cache and inject HPC_COMPOSE_RDZV_* env vars. |
Resource profiles
Resource profiles are reusable settings defaults, distinct from the global --profile setting selector. Define them in .hpc-compose/settings.toml:
[resource_profiles.gpu-small]
partition = "gpu"
time = "01:00:00"
gpus = 1
cpus_per_task = 8
mem = "32G"
Reference one from the spec:
x-slurm:
resources: gpu-small
mem: 64G
The profile fills only omitted resource fields. In the example above, partition, time, gpus, and cpus_per_task come from the profile, while the explicit mem: 64G wins. Profiles intentionally exclude behavior such as job_name, cache_dir, arrays, dependencies, submit_args, setup hooks, scratch/staging, artifacts, resume, notify, and metrics.
Allowed profile fields are: partition, account, qos, time, nodes, ntasks, ntasks_per_node, cpus_per_task, mem, gres, gpus, gpus_per_node, gpus_per_task, cpus_per_gpu, mem_per_gpu, gpu_bind, cpu_bind, mem_bind, distribution, hint, and constraint.
x-slurm.array
x-slurm:
array: 0-99%10
output: logs/%A_%a.out
services:
worker:
image: python:3.12-slim
command: python worker.py
array accepts Slurm list, range, step, and concurrency forms such as 0, 1-10, 1-10:2, 0,3,8-12, and 0-99%10. Values with spaces, null bytes, malformed ranges, negative numbers, zero step, or zero concurrency are rejected.
Array jobs currently require hpc-compose up --detach; live watch/log fan-out for per-task array elements is future work. --local rejects array specs. Slurm provides SLURM_ARRAY_JOB_ID, SLURM_ARRAY_TASK_ID, SLURM_ARRAY_TASK_COUNT, SLURM_ARRAY_TASK_MAX, SLURM_ARRAY_TASK_MIN, and SLURM_ARRAY_TASK_STEP; for Pyxis jobs, hpc-compose forwards these names into the container when x-slurm.array is set. Prefer output patterns such as %A_%a so task logs do not overwrite each other.
x-slurm.after_job and x-slurm.dependency
x-slurm:
after_job:
id: "12345"
condition: afterok
dependency: singleton
after_job: "12345" is shorthand for afterany:12345. Mapping form accepts id plus condition, where condition is afterany, afterok, or afternotok. Job ids must be numeric Slurm ids such as 12345, or array elements such as 12345_7.
dependency: singleton is separate because Slurm’s singleton dependency does not take a job id. When both fields are set, hpc-compose submits one command-line dependency string such as --dependency=afterok:12345,singleton.
Dependencies are passed to sbatch as CLI arguments, not rendered as #SBATCH lines, because dependency job ids are commonly dynamic. --local rejects scheduler dependencies.
x-slurm.setup
x-slurm:
setup:
- module load enroot
- source /shared/env.sh
- Shape: list of strings
- Default: omitted
- Notes:
- Each line is emitted verbatim into the generated bash script.
- The script runs under
set -euo pipefail. - Shell quoting and escaping are the user’s responsibility.
x-slurm.submit_args
x-slurm:
submit_args:
- "--mail-type=END"
- "--mail-user=user@example.com"
- "--reservation=gpu-reservation"
- Shape: list of strings
- Default: omitted
- Notes:
- Each entry is emitted as
#SBATCH {arg}. - Entries are rejected if they contain line breaks or null bytes.
- Entries are not validated against Slurm option syntax.
- First-class fields reject conflicting raw entries for the same option. Use
x-slurm.array,x-slurm.after_job, orx-slurm.dependencyinstead of raw--arrayor--dependency.
- Each entry is emitted as
x-slurm.notify
x-slurm:
notify:
email:
to: user@example.com
on: [end, fail]
| Field | Shape | Default | Notes |
|---|---|---|---|
notify.email | mapping | omitted | Required when notify is present. |
notify.email.to | string | required | Rendered as #SBATCH --mail-user. |
notify.email.on | list of events | [end, fail] | Rendered as #SBATCH --mail-type. |
Supported events:
| Event | Slurm mail type |
|---|---|
start | BEGIN |
end | END |
fail | FAIL |
all | ALL |
Rules:
- When
onis omitted or empty, defaults to[end, fail]. - If
allis present, it replaces all other events. - Cannot be combined with raw
--mail-typeor--mail-userinx-slurm.submit_args.
x-slurm.cache_dir
- Shape: string
- Default precedence: explicit
x-slurm.cache_dir, then[profiles.<name>.cache].dir, then[defaults.cache].dir, then$HOME/.cache/hpc-compose. - Notes:
- Relative paths and environment variables are resolved against the compose file directory.
- Settings cache paths are resolved against the settings base directory.
- Paths under
/tmp,/var/tmp,/private/tmp, and/dev/shmare accepted by parsing and planning, butpreflightreports them as unsafe because they are not valid shared-cache locations for login-node prepare plus compute-node reuse. - The path must be visible from both the login node and the compute nodes.
Settings example:
[defaults.cache]
dir = "/cluster/shared/hpc-compose-cache"
[profiles.dev.cache]
dir = "/cluster/shared/dev-hpc-compose-cache"
runtime
runtime:
backend: apptainer
gpu: auto
| Field | Shape | Default | Notes |
|---|---|---|---|
backend | pyxis, apptainer, singularity, or host | pyxis | Selects the runtime used inside Slurm steps. |
gpu | auto, none, or nvidia | auto | For Apptainer/Singularity, controls --nv; auto enables it when Slurm GPU resources are requested. |
Backend notes:
pyxisusessrun --container-*flags and Enroot.sqshartifacts.apptainerandsingularitybuild or reuse.sifartifacts and launch them throughapptainer exec/runorsingularity exec/runinsidesrun.hostruns commands directly undersrun; services must setcommandorentrypoint, and image prepare blocks, servicevolumes, andx-slurm.mpi.host_mpi.bind_pathsare not allowed because no container bind mount is applied.x-enroot.prepareis a Pyxis/Enroot compatibility spelling. Preferx-runtime.preparefor new specs, especially with Apptainer/Singularity.
x-slurm.scratch, stage_in, stage_out, and burst_buffer
x-slurm:
scratch:
scope: shared
base: /scratch/$USER/jobs
mount: /scratch
cleanup: on_success
stage_in:
- from: /shared/input
to: /scratch/input
mode: rsync
stage_out:
- from: /scratch/output
to: /shared/results/${SLURM_JOB_ID}
when: always
mode: copy
burst_buffer:
directives:
- "#BB create_persistent name=data capacity=100G"
scratch.baseis a host path.scratch.mountis the container-visible mount point.scratch.scopeisnode_localorshared; cluster profiles can warn when a shared scratch path does not look shared.scratch.cleanupisalways,on_success, ornever.stage_inruns before services launch;stage_outruns during teardown.modeisrsyncorcopy;rsyncfalls back tocp -Rwhenrsyncis unavailable.stage_out.whenisalways,on_success, oron_failure.${SLURM_JOB_ID}is preserved in scratch and staging paths for runtime expansion.burst_buffer.directivesentries are emitted as raw batch-script directives and must start with#BBor#DW.
Multi-node placement rules
x-slurm.nodes > 1reserves a multi-node allocation.- Helper services remain single-node steps and are pinned to the allocation’s primary node.
- When a multi-node job has exactly one service, that service defaults to the distributed full-allocation step.
- Services may use
services.<name>.x-slurm.placementto select explicit allocation node indices. - Overlapping explicit placements are rejected unless one side sets
allow_overlap: trueor usesshare_with. - Any service spanning more than one node may use
readiness.type: sleeporreadiness.type: log, or TCP/HTTP readiness only with an explicit non-local host or URL.
x-slurm.metrics
x-slurm:
metrics:
interval_seconds: 5
collectors: [gpu, slurm]
- Shape: mapping
- Default: omitted
- Notes:
- Omitting the block disables runtime metrics sampling.
- If the block is present and
enabledis omitted, metrics sampling is enabled. interval_secondsdefaults to5and must be at least1.collectorsdefaults to[gpu, slurm].- Supported collectors:
gpusamples device and process telemetry throughnvidia-smislurmsamples job-step CPU and memory data throughsstat
- In multi-node jobs,
gpusampling launches one best-effort sampler task per allocated node and writes node metadata into GPU rows; legacy rows withoutnoderemain readable as primary-node samples. - Sampler files are written under
${SLURM_SUBMIT_DIR:-$PWD}/.hpc-compose/${SLURM_JOB_ID}/metricson the host and are also visible inside containers at/hpc-compose/job/metrics. - Diagnostics are written under
metrics/diagnostics/when available, includingnvidia-smi topo -m,nvidia-smi -q, selected fabric/GPU environment variables, and best-effortibstat,ibv_devinfo,ucx_info -v, andfi_infooutput.
x-slurm.rendezvous
Client-side cross-job discovery resolves records from <cache_dir>/rendezvous/<name>/latest.json before launching services:
x-slurm:
cache_dir: /cluster/shared/hpc-compose-cache
rendezvous: model-server
The mapping form supports multiple names and a timeout:
x-slurm:
rendezvous:
discover:
- model-server
- tokenizer
timeout_seconds: 60
require: true
Resolved records become generic variables such as HPC_COMPOSE_RDZV_URL and name-scoped variables such as HPC_COMPOSE_RDZV_MODEL_SERVER_URL.
- Collector failures are best-effort and do not fail the batch job.
x-slurm.artifacts
x-slurm:
artifacts:
collect: always
export_dir: ./results/${SLURM_JOB_ID}
paths:
- /hpc-compose/job/metrics/**
bundles:
checkpoints:
paths:
- /hpc-compose/job/checkpoints/*.pt
- Shape: mapping
- Default: omitted
- Notes:
- Omitting the block disables tracked artifact collection.
collectdefaults toalways. Supported values arealways,on_success, andon_failure.export_diris required and is resolved relative to the compose file directory whenhpc-compose artifactsruns.${SLURM_JOB_ID}is preserved inexport_diruntilhpc-compose artifactsexpands it from tracked metadata.pathsremains supported as the implicitdefaultbundle.bundlesis optional. Bundle names must match[A-Za-z0-9_-]+, anddefaultis reserved for top-levelpaths.- At least one source path must be present in
pathsorbundles. - Every source path must be an absolute container-visible path rooted at
/hpc-compose/job. - Paths under
/hpc-compose/job/artifactsare rejected. - Collection happens during batch teardown and is best-effort.
- Collected payloads and
manifest.jsonare written under${SLURM_SUBMIT_DIR:-$PWD}/.hpc-compose/${SLURM_JOB_ID}/artifacts/. hpc-compose artifacts --bundle <name>exports only the selected bundle or bundles.hpc-compose artifacts --tarballalso writes one<bundle>.tar.gzarchive per exported bundle.- Export writes per-bundle provenance metadata under
<export_dir>/_hpc-compose/bundles/<bundle>.json.
x-slurm.resume
x-slurm:
resume:
path: /shared/$USER/runs/my-run
- Shape: mapping
- Default: omitted
- Notes:
- Omitting the block disables resume semantics.
pathis required and must be an absolute host path./hpc-compose/...paths are rejected becausepathmust point at shared host storage, not a container-visible path./tmpand/var/tmptechnically validate, butpreflightwarns because those paths are not reliable resume storage.- When enabled,
hpc-composemountspathinto every service at/hpc-compose/resume. - Services also receive
HPC_COMPOSE_RESUME_DIR,HPC_COMPOSE_ATTEMPT, andHPC_COMPOSE_IS_RESUME. - The canonical resume source is the shared
path, not exported artifact bundles. - Attempt-specific runtime state moves under
${SLURM_SUBMIT_DIR:-$PWD}/.hpc-compose/${SLURM_JOB_ID}/attempts/<attempt>/, and the top-levellogs,metrics,artifacts, andstate.jsonpaths continue to point at the latest attempt for compatibility.
Allocation metadata inside services
Every service receives:
HPC_COMPOSE_PRIMARY_NODEHPC_COMPOSE_NODE_COUNTHPC_COMPOSE_NODELISTHPC_COMPOSE_NODELIST_FILEHPC_COMPOSE_SERVICE_PRIMARY_NODEHPC_COMPOSE_SERVICE_NODE_COUNTHPC_COMPOSE_SERVICE_NODELISTHPC_COMPOSE_SERVICE_NODELIST_FILE
The allocation-wide data is also written under /hpc-compose/job/allocation/primary_node and /hpc-compose/job/allocation/nodes.txt. Service-scoped node lists are written under /hpc-compose/job/allocation/service-nodelists/.
Multi-node services also receive distributed launch helpers:
HPC_COMPOSE_DIST_MASTER_ADDRHPC_COMPOSE_DIST_MASTER_PORTHPC_COMPOSE_DIST_RDZV_ENDPOINTHPC_COMPOSE_DIST_NNODESHPC_COMPOSE_DIST_NODE_RANKHPC_COMPOSE_DIST_LOCAL_RANKHPC_COMPOSE_DIST_GLOBAL_RANKHPC_COMPOSE_DIST_NPROC_PER_NODEHPC_COMPOSE_DIST_WORLD_SIZEHPC_COMPOSE_DIST_HOSTFILE
HPC_COMPOSE_DIST_NPROC_PER_NODE is derived from a service environment override, GPU requests, ntasks_per_node, then 1. The distributed hostfile is written under /hpc-compose/job/allocation/distributed-hostfiles/. When a discovered .hpc-compose/cluster.toml contains [distributed.env], those profile variables are injected only for multi-node services; explicit service environment values win on name conflicts and are still the durable config source.
Services that configure services.<name>.x-slurm.mpi also receive:
HPC_COMPOSE_MPI_TYPEHPC_COMPOSE_MPI_PROFILEwhenx-slurm.mpi.profileis setHPC_COMPOSE_MPI_IMPLEMENTATIONwhenx-slurm.mpi.implementationis set or implied byx-slurm.mpi.profileHPC_COMPOSE_MPI_HOSTFILE
The MPI hostfile is written under /hpc-compose/job/allocation/mpi-hostfiles/ and contains the service’s effective node list. When ntasks_per_node is known, each host line includes slots=<ntasks_per_node>. For a single-node service with ntasks but no ntasks_per_node, the hostfile uses slots=<ntasks>. Otherwise it emits one node per line without slots.
MPI services also forward common PMI, PMIx, and Slurm rank variables into the container through Pyxis --container-env, including PMI_RANK, PMI_SIZE, PMIX_RANK, PMIX_NAMESPACE, SLURM_PROCID, SLURM_LOCALID, SLURM_NODEID, SLURM_NTASKS, and SLURM_TASKS_PER_NODE.
gres and gpus
When both gres and gpus are set at the same level, gres takes priority and gpus is ignored.
Service fields
| Field | Shape | Default | Notes |
|---|---|---|---|
extends | string or mapping | omitted | Authoring-only service template reference. See extends. |
image | string | required unless runtime.backend: host | Can be a remote image reference, a local .sqsh / .squashfs path for Pyxis, or a local .sif path for Apptainer/Singularity. |
command | string or list of strings | omitted | Shell form or exec form. |
entrypoint | string or list of strings | omitted | Must use the same form as command when both are present. |
script | string | omitted | Multi-line shell script sugar for command: ["/bin/sh", "-lc", script]; mutually exclusive with command and entrypoint. |
environment | mapping or list of KEY=VALUE strings | omitted | Both forms normalize to key/value pairs. |
modules | list of strings | omitted | List-only shorthand for service x-env.modules.load; cannot be combined with service x-env.modules. |
volumes | list of host_path:container_path strings | omitted | Runtime bind mounts. Host paths resolve against the compose file directory. |
working_dir | string | omitted | Valid only when the service also has an explicit command or entrypoint. |
depends_on | list or mapping | omitted | Dependency list with service_started or service_healthy conditions. |
readiness | mapping | omitted | Post-launch readiness gate. |
healthcheck | mapping | omitted | Compose-compatible sugar for a subset of readiness. Mutually exclusive with readiness. |
assert | mapping | omitted | Post-run service contract checked during batch cleanup and surfaced in status. |
x-env | mapping | omitted | Structured host-side module, Spack view, and environment setup for this service. |
x-slurm | mapping | omitted | Per-service Slurm overrides. |
x-runtime | mapping | omitted | Backend-neutral image preparation rules. |
x-enroot | mapping | omitted | Pyxis/Enroot preparation compatibility alias. |
Image rules
Remote images
- Any image reference without an explicit
://scheme is prefixed withdocker://. - Explicit schemes are allowed only for
docker://,dockerd://, andpodman://. - Other schemes are rejected.
- Shell variables in the image string are expanded at plan time.
- Unset variables expand to empty strings.
Local images
- Pyxis local image paths must point to
.sqshor.squashfsfiles. - Apptainer/Singularity local image paths must point to
.siffiles. - Relative paths are resolved against the compose file directory.
- Paths that look like build contexts are rejected.
command, entrypoint, and script
Both fields accept either:
- a string, interpreted as shell form
- a list of strings, interpreted as exec form
Rules:
- If both fields are present, they must use the same form.
- Mixed string/array combinations are rejected.
- If neither field is present, the image default entrypoint and command are used.
- If
working_diris set, at least one ofcommandorentrypointmust also be set. - A multi-line string-form
commandis automatically normalized to["/bin/sh", "-lc", command]so YAML block scalars run as one shell script. - Single-line string-form
commandremains shell form. scriptis a convenience field for multi-line shell snippets and normalizes tocommand: ["/bin/sh", "-lc", script].scriptcannot be combined withcommandorentrypoint.
environment
Accepted forms:
environment:
APP_ENV: prod
LOG_LEVEL: info
environment:
- APP_ENV=prod
- LOG_LEVEL=info
Rules:
- List items must use
KEY=VALUEsyntax. .envfrom the compose file directory is loaded automatically when present.- Shell environment variables override
.env;.envfills only missing variables. environment,x-runtime.prepare.env, and compatibilityx-enroot.prepare.envvalues support$VAR,${VAR},${VAR:-default}, and${VAR-default}interpolation.- Missing variables without defaults are errors.
- Use
$$for a literal dollar sign in interpolated fields. - String-form shell snippets are still literal. For example,
$PATHinside a string-formcommandis not expanded at plan time.
volumes
Accepted form:
volumes:
- ./app:/workspace
- /shared/data:/data
- /shared/reference:/reference:ro
Rules:
- Host paths are resolved against the compose file directory.
- Runtime mounts accept
host_path:container_pathandhost_path:container_path:ro|rw. - Pyxis mounts are passed through
srun --container-mounts=...; Apptainer/Singularity mounts are passed as--bind. - Every service also gets an automatic shared mount at
/hpc-compose/job, backed by${SLURM_SUBMIT_DIR:-$PWD}/.hpc-compose/${SLURM_JOB_ID}on the host. /hpc-compose/jobis reserved and cannot be used as an explicit volume destination.
Warning
If a mounted file is a symlink, the symlink target must also be visible from inside the mounted directory. Otherwise the path can exist on the host but fail inside the container.
depends_on
Accepted forms:
depends_on:
- redis
depends_on:
redis:
condition: service_started
depends_on:
redis:
condition: service_healthy
Rules:
- List form means
condition: service_started. - Map form accepts
condition: service_started,condition: service_healthy, andcondition: service_completed_successfully. service_healthyrequires the dependency service to definereadiness.service_startedwaits only for the dependency process to be launched and still alive.service_healthywaits for the dependency readiness check to succeed.service_completed_successfullywaits for the dependency to exit with status0before launching the dependent service, which is useful for one-shot DAG stages such as preprocess -> train -> postprocess.
readiness
Supported types:
Sleep
readiness:
type: sleep
seconds: 5
secondsis required.
TCP
readiness:
type: tcp
host: 127.0.0.1
port: 6379
timeout_seconds: 30
hostdefaults to127.0.0.1.timeout_secondsdefaults to60.
Log
readiness:
type: log
pattern: "Server started"
timeout_seconds: 60
timeout_secondsdefaults to60.
HTTP
readiness:
type: http
url: http://127.0.0.1:8080/health
status_code: 200
timeout_seconds: 30
status_codedefaults to200.timeout_secondsdefaults to60.- The readiness check polls the URL through
curl.
healthcheck
healthcheck is accepted as migration sugar and is normalized into the readiness model.
services:
redis:
image: redis:7
healthcheck:
test: ["CMD", "nc", "-z", "127.0.0.1", "6379"]
timeout: 30s
Rules:
healthcheckandreadinessare mutually exclusive.- Supported probe forms are a constrained subset:
["CMD", "nc", "-z", HOST, PORT]["CMD-SHELL", "nc -z HOST PORT"]- recognized
curlprobes againsthttp://orhttps://URLs - recognized
wget --spiderprobes againsthttp://orhttps://URLs
timeoutmaps totimeout_seconds.disable: truedisables readiness for that service.interval,retries, andstart_periodare parsed but rejected in v1.- HTTP-style healthchecks normalize to
readiness.type: httpwithstatus_code: 200.
assert
assert defines post-run contracts for a service. Checks run in the rendered script’s cleanup() after services are reaped and before artifact collection or stage-out. Any failed assertion marks the job failed, even when the service uses x-slurm.failure_policy.mode: ignore.
services:
train:
image: trainer:latest
command: python train.py
assert:
exit_code: 0
artifacts_contain: "model/*.pt"
max_duration_seconds: 7200
| Field | Shape | Notes |
|---|---|---|
exit_code | integer 0..255 | Expected final service exit code. |
artifacts_contain | string | Glob that must match at least one path. Relative patterns resolve under /hpc-compose/job; absolute patterns must stay under /hpc-compose/job. |
max_duration_seconds | positive integer | Maximum wall-clock seconds from first service launch to terminal service exit, including restart time. |
At least one assertion field is required. Assertion results are written into runtime state.json; hpc-compose status --format json includes them under each service’s assertions object.
Service-level x-slurm
These fields live under services.<name>.x-slurm.
| Field | Shape | Default | Notes |
|---|---|---|---|
nodes | positive integer | omitted | Legacy shorthand: 1 for a helper step, or the full top-level allocation node count for a full-allocation distributed service. Partial multi-node counts require placement.node_count. |
placement | mapping | omitted | Explicit node-index placement inside the allocation. |
ntasks | positive integer | omitted | Adds --ntasks to that service’s srun. |
ntasks_per_node | positive integer | omitted | Adds --ntasks-per-node to that service’s srun. |
cpus_per_task | positive integer | omitted | Adds --cpus-per-task to that service’s srun. |
gpus | positive integer | omitted | Adds --gpus when gres is not set. |
gres | string | omitted | Adds --gres to that service’s srun. Takes priority over gpus. |
gpus_per_node | positive integer | omitted | Adds --gpus-per-node to that service’s srun. |
gpus_per_task | positive integer | omitted | Adds --gpus-per-task to that service’s srun. |
cpus_per_gpu | positive integer | omitted | Adds --cpus-per-gpu to that service’s srun. |
mem_per_gpu | string | omitted | Adds --mem-per-gpu to that service’s srun. |
gpu_bind | string | omitted | Adds --gpu-bind to that service’s srun. |
cpu_bind | string | omitted | Adds --cpu-bind to that service’s srun. |
mem_bind | string | omitted | Adds --mem-bind to that service’s srun. |
distribution | string | omitted | Adds --distribution to that service’s srun. |
hint | string | omitted | Adds --hint to that service’s srun. |
time_limit | string | omitted | Advisory per-service time limit. Validated against Slurm time formats but not passed to srun. inspect surfaces warnings when the limit exceeds allocation time or conflicts with dependencies. Accepted formats: MM, MM:SS, HH:MM:SS, D-HH, D-HH:MM, D-HH:MM:SS. |
extra_srun_args | list of strings | omitted | Appended directly to the service’s srun command. |
mpi | mapping | omitted | Adds first-class MPI launch metadata and srun --mpi=<type>. |
failure_policy | mapping | omitted | Per-service failure handling (fail_job, ignore, restart_on_failure). |
prologue | string or mapping | omitted | Per-service shell hook run before each launch attempt. String shorthand runs on the host. |
epilogue | string or mapping | omitted | Per-service shell hook run after each service exit attempt. String shorthand runs on the host. |
hooks | list of mappings | omitted | Host-side event hooks for failure-policy transitions such as accepted restarts and crash-loop window exhaustion. |
rendezvous | mapping | omitted | Provider registration config for cross-job service discovery. |
services.<name>.x-slurm.rendezvous
Provider-side registration writes an atomic shared-cache record after readiness succeeds when readiness is configured:
services:
model:
image: python:3.12-slim
command: python -m http.server 8000
readiness:
type: tcp
port: 8000
x-slurm:
rendezvous:
register:
name: model-server
port: 8000
protocol: http
path: /
ttl_seconds: 3600
Names are single safe path components using ASCII letters, digits, ., _, and -. Rendezvous is same-cluster shared-storage coordination only; it does not provide DNS, tunneling, or authentication.
services.<name>.x-slurm.prologue / epilogue
services:
trainer:
image: trainer:latest
command: python train.py
x-slurm:
prologue: |
module load cuda/12.1
nvidia-smi
epilogue:
context: container
script: |
tar czf /shared/logs-${SLURM_JOB_ID}.tar.gz /hpc-compose/job/logs
- Shape: either a block string, or a mapping with
scriptand optionalcontext. context:host(default) orcontainer.- Hook scripts are emitted as trusted shell and are not Compose-interpolated, so runtime variables such as
${SLURM_JOB_ID}are preserved. - Hooks run once per service launch attempt, including
restart_on_failureretries. - Host hooks run in the generated batch supervisor on the allocation’s primary execution context. Container hooks wrap the service command inside the container and can use
/hpc-compose/job. - Hook stdout/stderr is written to the service log.
- Container hooks require an explicit
commandorentrypoint; image-default services cannot be wrapped.
services.<name>.x-slurm.hooks
services:
trainer:
image: trainer:latest
command: python train.py
x-slurm:
failure_policy:
mode: restart_on_failure
hooks:
- on: restart
context: host
script: |
echo "Service $HPC_COMPOSE_SERVICE_NAME restarted (attempt $HPC_COMPOSE_ATTEMPT)" >> /shared/restart.log
- on: window_exhausted
script: |
curl -X POST "$WEBHOOK_URL" -d '{"alert": "crash loop detected"}'
- Shape: list of mappings with
on,script, and optionalcontext. on:restartorwindow_exhausted.context:hostonly. Omittedcontextdefaults tohost;containeris rejected for event hooks.restartruns after a non-zero exit has passed the lifetime and rolling-window guards, after restart counters are recorded, and before backoff/relaunch.window_exhaustedruns only when the rolling-window guard blocks another restart. It does not run for lifetimemax_restartsexhaustion.- Event hooks are best-effort observability hooks. A non-zero hook exit is logged to the service log and does not change the restart or failure-policy outcome.
- Event hook scripts are emitted as trusted shell and are not Compose-interpolated.
- Event hooks receive
HPC_COMPOSE_HOOK_PHASE,HPC_COMPOSE_SERVICE_NAME,HPC_COMPOSE_SERVICE_LOG,HPC_COMPOSE_SERVICE_EXIT_CODE,HPC_COMPOSE_ATTEMPT,HPC_COMPOSE_RESTART_COUNT,HPC_COMPOSE_MAX_RESTARTS,HPC_COMPOSE_WINDOW_SECONDS,HPC_COMPOSE_MAX_RESTARTS_IN_WINDOW, andHPC_COMPOSE_RESTART_FAILURES_IN_WINDOW.
services.<name>.x-slurm.placement
services:
a:
image: app:a
x-slurm:
placement: { node_range: "0-3" }
b:
image: app:b
x-slurm:
placement: { node_range: "4-7" }
ps:
image: app:b
x-slurm:
placement: { share_with: b }
Exactly one selector is required:
| Field | Shape | Notes |
|---|---|---|
node_range | string | Zero-based inclusive allocation indices, for example "0-3" or "0-3,6". |
node_count | integer | Selects this many eligible nodes starting at start_index, default 0. |
node_percent | integer 1..100 | Selects ceil(percent * eligible_nodes / 100), minimum one node. |
share_with | string | Reuses another service’s resolved node set for explicit co-location. |
Optional fields:
start_index: applies tonode_countandnode_percent.exclude: zero-based allocation indices removed from the eligible set and passed tosrun --exclude.allow_overlap: permits intentional overlap with another explicit placement.
Node indices are resolved against the Slurm allocation order from scontrol show hostnames "$SLURM_JOB_NODELIST". At runtime, containers receive both allocation-wide metadata (HPC_COMPOSE_NODELIST) and service-scoped metadata (HPC_COMPOSE_SERVICE_NODELIST, HPC_COMPOSE_SERVICE_NODELIST_FILE, HPC_COMPOSE_SERVICE_PRIMARY_NODE, HPC_COMPOSE_SERVICE_NODE_COUNT).
services.<name>.x-slurm.mpi
services:
trainer:
image: mpi-image:latest
command: /usr/local/bin/train
x-slurm:
nodes: 2
ntasks_per_node: 4
mpi:
type: pmix_v4
profile: openmpi
implementation: openmpi
launcher: srun
expected_ranks: 8
host_mpi:
bind_paths:
- /opt/site/openmpi:/opt/site/openmpi:ro
env:
MPI_DIR: /opt/site/openmpi
- Shape: mapping
- Default: omitted
typeis an exactsrun --mpi=<type>plugin token. Common values includepmix,pmix_v4,pmi2,pmi1, andopenmpi; usesrun --mpi=listorhpc-compose doctor cluster-reporton the target cluster to discover site-specific values.- Notes:
- Rendered as
--mpi=<type>on the service’ssruncommand. profileis optional compatibility metadata used for validation, cluster-profile diagnostics, anddoctor mpi-smokeoutput. Supported values areopenmpi,mpich, andintel_mpi.profiledoes not auto-select or rewritetype; use the exact token that your cluster reports throughsrun --mpi=list.launcherdefaults tosrun; v1 rejects other launchers.implementationis optional metadata for diagnostics. Supported values areopenmpi,mpich,intel_mpi,mvapich2,cray_mpi,hpe_mpi, andunknown.- When both
profileandimplementationare set, they must describe the same MPI family. expected_ranks, when set, must match the resolved Slurm task geometry.host_mpi.bind_pathsuseshost_path:container_path[:ro|rw]syntax, is validated like service volumes, and is automatically mounted into the service.host_mpi.envis injected into the service environment after normal service environment entries.- Cannot be combined with raw
--mpi...entries inextra_srun_args. - MPI services receive
HPC_COMPOSE_MPI_TYPEandHPC_COMPOSE_MPI_HOSTFILE. - MPI services also receive
HPC_COMPOSE_MPI_PROFILEwhenprofileis set andHPC_COMPOSE_MPI_IMPLEMENTATIONwhenimplementationis set or implied byprofile. hpc-compose doctor mpi-smoke -f compose.yaml --service trainerrenders a smoke probe for the service; add--submitto run it through Slurm.hpc-compose doctor fabric-smoke -f compose.yaml --service trainer --checks autoextends the same pattern with NCCL, UCX, OFI, and InfiniBand diagnostics when available. Smoke plans keep allocation and MPI launch settings, but strip application workflow blocks such as setup, scratch staging, resume metadata, artifacts, and burst-buffer directives.
- Rendered as
Profile-specific compatibility checks are intentionally conservative:
profile: openmpiexpects a PMIx-capabletypesuch aspmixorpmix_v*, withpmi2accepted as a fallback.profile: mpichexpectspmi2or a PMIx-capable setup.profile: intel_mpiexpectspmi2; preflight and doctor warn when noI_MPI_PMI_LIBRARYor cluster-profile PMI2 library is visible.
services.<name>.x-slurm.failure_policy
services:
worker:
image: python:3.11-slim
x-slurm:
failure_policy:
mode: restart_on_failure
max_restarts: 3
backoff_seconds: 5
window_seconds: 60
max_restarts_in_window: 3
| Field | Shape | Default | Notes |
|---|---|---|---|
mode | fail_job | ignore | restart_on_failure | fail_job | fail_job keeps fail-fast behavior. ignore keeps the job running after non-zero exits. restart_on_failure restarts on non-zero exits only. |
max_restarts | integer | 3 when mode=restart_on_failure | Required to be at least 1 after defaults are applied. Valid only for restart_on_failure. |
backoff_seconds | integer | 5 when mode=restart_on_failure | Fixed delay between restart attempts. Required to be at least 1 after defaults are applied. Valid only for restart_on_failure. |
window_seconds | integer | 60 when mode=restart_on_failure | Rolling window for counting restart-triggering exits. Required to be at least 1 after defaults are applied. Valid only for restart_on_failure. |
max_restarts_in_window | integer | resolved max_restarts when mode=restart_on_failure | Maximum restart-triggering exits allowed within window_seconds. Required to be at least 1 after defaults are applied. Valid only for restart_on_failure. |
Rules:
- In a multi-node allocation, implicit helper services are pinned to
HPC_COMPOSE_PRIMARY_NODE. - Explicit service placements may not overlap unless one side sets
placement.allow_overlap: trueor usesplacement.share_with. max_restarts,backoff_seconds,window_seconds, andmax_restarts_in_windoware rejected unlessmode: restart_on_failure.- Restart attempts count relaunches after the initial launch.
- Restarts trigger only for non-zero exits.
restart_on_failureenforces both a lifetime cap (max_restarts) and a rolling-window cap (max_restarts_in_windowwithinwindow_seconds) during one live batch-script execution.- If you omit the rolling-window fields,
restart_on_failurestill enables default crash-loop protection withwindow_seconds: 60andmax_restarts_in_window: <resolved max_restarts>. - Services configured with
mode: ignorecannot be used as dependencies independs_on.
Examples:
Use the defaults when you only need bounded retries:
services:
worker:
image: python:3.11-slim
x-slurm:
failure_policy:
mode: restart_on_failure
That resolves to:
max_restarts: 3backoff_seconds: 5window_seconds: 60max_restarts_in_window: 3
Use explicit fields when you need a larger lifetime budget but still want a tighter crash-loop guard:
services:
worker:
image: python:3.11-slim
x-slurm:
failure_policy:
mode: restart_on_failure
max_restarts: 8
backoff_seconds: 10
window_seconds: 60
max_restarts_in_window: 3
Semantics:
- The initial launch does not count as a restart.
restart_countcounts granted relaunches after the initial launch.max_restarts_in_windowcounts restart-triggering non-zero exits whose timestamps still satisfynow - event < window_seconds.- If a non-zero exit would exceed the rolling-window cap, the job fails immediately and that blocked exit is not recorded as a consumed restart.
- Successful exits do not trigger restarts and do not add entries to the rolling window.
- The rolling window is attempt-local to one live batch-script execution. It is not hydrated from prior
state.json, resume metadata, or Slurm requeue history. x-slurm.hookscan observe acceptedrestartevents and blockedwindow_exhaustedevents without changing the policy decision.
Tracked state:
status --format jsonincludesfailure_policy_mode,restart_count,max_restarts,window_seconds,max_restarts_in_window,restart_failures_in_window, andlast_exit_codefor each tracked service.- Text
statusrenders the live rolling-window budget aswindow=<current>/<max>@<seconds>s.
Unknown keys under top-level x-slurm or per-service x-slurm cause hard errors.
x-runtime.prepare and x-enroot.prepare
x-runtime.prepare lets a service build a prepared runtime image from its base image before submission. x-enroot.prepare remains accepted as a Pyxis-only compatibility spelling.
services:
app:
image: python:3.11-slim
x-runtime:
prepare:
commands:
- pip install --no-cache-dir numpy pandas
mounts:
- ./requirements.txt:/tmp/requirements.txt
env:
PIP_CACHE_DIR: /tmp/pip-cache
root: true
| Field | Shape | Default | Notes |
|---|---|---|---|
commands | list of strings | required when prepare is present | Each command runs through the selected backend’s writable prepare flow. |
mounts | list of host_path:container_path strings | omitted | Visible only during prepare. Relative host paths resolve against the compose file directory. |
env | mapping or list of KEY=VALUE strings | omitted | Passed only during prepare. Values support the same interpolation rules as environment. |
root | boolean | true | Controls whether prepare commands request root/fakeroot behavior where the backend supports it. |
Rules:
- If
x-runtime.prepareorx-enroot.prepareis present,commandscannot be empty. - A service may not set both spellings.
x-enroot.prepareis rejected whenruntime.backendis notpyxis.- If
prepare.mountsis non-empty, the service rebuilds on everyprepareorup. - Remote base images are imported under
cache_dir/base. - Prepared images are exported under
cache_dir/prepared. - Unknown keys under
x-runtime,x-enroot, orpreparecause hard errors.
Unsupported Compose keys
These keys are rejected with explicit messages:
buildportsnetworksnetwork_mode- Compose
restart(useservices.<name>.x-slurm.failure_policy) deploy
Any other unknown key at the service level is also rejected.
Migration to Spec v2
This page is reserved for the first breaking hpc-compose spec release. Current hpc-compose builds support spec version 1; use version: "1" or omit the field for v1 specs.
Known v2 migration hint:
stepswas renamed toservicesin v2. Rename top-levelsteps:toservices:before validating with a v2-aware hpc-compose build.
Example Source
This appendix embeds the runnable repository example YAML files directly from examples/.
Some repository examples keep an explicit ${CACHE_DIR:-/cluster/shared/hpc-compose-cache} for portability, while starter examples rely on the settings/builtin cache default. Before running on a real cluster, configure a shared path visible from both the submission host and the compute nodes:
export CACHE_DIR=/cluster/shared/hpc-compose-cache
mkdir -p "$CACHE_DIR"
test -w "$CACHE_DIR"
App Redis Worker
Source: examples/app-redis-worker.yaml
name: redis-demo
x-slurm:
job_name: redis-demo
time: "00:15:00"
mem: 8G
cpus_per_task: 2
cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}
services:
redis:
image: redis:7
command: redis-server --save "" --appendonly no
readiness:
type: tcp
host: 127.0.0.1
port: 6379
timeout_seconds: 30
x-slurm:
cpus_per_task: 1
worker:
image: redis:7
depends_on:
redis:
condition: service_healthy
command:
- /bin/sh
- -lc
- |
redis-cli -h 127.0.0.1 ping
while true; do
redis-cli -h 127.0.0.1 incr jobs
sleep 2
done
x-slurm:
cpus_per_task: 1
Canary Right Size
Source: examples/canary-right-size.yaml
name: canary-right-size
x-slurm:
job_name: canary-right-size
partition: gpu
time: "04:00:00"
mem: 64G
gpus: 4
cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}
metrics:
enabled: true
interval_seconds: 10
services:
trainer:
image: python:3.12-slim
command:
- /bin/sh
- -lc
- |
python - <<'PY'
import time
data = bytearray(512 * 1024 * 1024)
print(f"allocated {len(data)} bytes")
time.sleep(20)
PY
x-slurm:
cpus_per_task: 8
Dev Python App
Source: examples/dev-python-app.yaml
name: dev-python-app
x-slurm:
job_name: dev-python-app
time: "00:30:00"
mem: 8G
cpus_per_task: 2
services:
app:
image: python:3.11-slim
working_dir: /workspace
volumes:
- ./app:/workspace
command:
- python
- -m
- main
x-runtime:
prepare:
commands:
- pip install --no-cache-dir fastapi uvicorn openai
Dev Python Smoke
Source: examples/dev-python-smoke.yaml
name: dev-python-smoke
x-slurm:
job_name: dev-python-smoke
time: "00:01:00"
mem: 2G
cpus_per_task: 1
services:
app:
image: python:3.11-slim
working_dir: /workspace
volumes:
- ./app:/workspace
command:
- python
- -c
- "import main; print('smoke ok', flush=True)"
x-runtime:
prepare:
commands:
- pip install --no-cache-dir fastapi uvicorn openai
Fairseq Preprocess
Source: examples/fairseq-preprocess.yaml
name: fairseq-preprocess
x-slurm:
job_name: fairseq-preprocess
time: "02:00:00"
mem: 32G
cpus_per_task: 8
cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}
services:
preprocess:
image: python:3.11-slim
volumes:
- /shared/$USER/data/raw:/data/raw
- /shared/$USER/data/processed:/data/processed
environment:
INPUT_DIR: /data/raw
OUTPUT_DIR: /data/processed
NUM_WORKERS: "8"
command:
- /bin/sh
- -lc
- |
python -c "
import os, json, hashlib, multiprocessing
from pathlib import Path
from concurrent.futures import ProcessPoolExecutor
input_dir = Path(os.environ['INPUT_DIR'])
output_dir = Path(os.environ['OUTPUT_DIR'])
num_workers = int(os.environ['NUM_WORKERS'])
output_dir.mkdir(parents=True, exist_ok=True)
files = sorted(input_dir.glob('*.txt'))
if not files:
print(f'No .txt files found in {input_dir}')
exit(1)
print(f'Found {len(files)} input files')
def process_file(path):
text = path.read_text(encoding='utf-8', errors='replace')
lines = [l.strip() for l in text.splitlines() if l.strip()]
tokens = []
for line in lines:
tokens.extend(line.lower().split())
out = output_dir / f'{path.stem}.jsonl'
with open(out, 'w') as f:
for i, line in enumerate(lines):
record = {
'id': f'{path.stem}_{i}',
'text': line,
'tokens': len(line.split()),
}
f.write(json.dumps(record) + '\n')
return path.name, len(lines), len(tokens)
with ProcessPoolExecutor(max_workers=num_workers) as pool:
results = list(pool.map(process_file, files))
total_lines = sum(r[1] for r in results)
total_tokens = sum(r[2] for r in results)
for name, lines, tokens in results:
print(f' {name}: {lines} lines, {tokens} tokens')
print(f'Total: {total_lines} lines, {total_tokens} tokens across {len(files)} files')
manifest = {
'files': len(files),
'total_lines': total_lines,
'total_tokens': total_tokens,
}
(output_dir / 'manifest.json').write_text(json.dumps(manifest, indent=2))
print('Preprocessing complete')
"
x-slurm:
cpus_per_task: 8
Llama App
Source: examples/llama-app.yaml
name: llama-stack
x-slurm:
job_name: llama-stack
time: "02:00:00"
mem: 32G
cpus_per_task: 8
gpus: 1
cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}
services:
llama:
image: ghcr.io/ggml-org/llama.cpp:server-cuda
volumes:
- ./models:/models
command:
- /bin/sh
- -lc
- exec /app/llama-server -m /models/model.gguf --host 0.0.0.0 --port 8080
readiness:
type: tcp
host: 127.0.0.1
port: 8080
timeout_seconds: 60
x-slurm:
gpus: 1
cpus_per_task: 4
app:
image: python:3.11-slim
depends_on:
llama:
condition: service_healthy
working_dir: /workspace
volumes:
- ./app:/workspace
environment:
LLM_BASE_URL: http://127.0.0.1:8080/v1
command:
- python
- -m
- main
x-runtime:
prepare:
commands:
- pip install --no-cache-dir openai fastapi uvicorn
x-slurm:
cpus_per_task: 2
Llama UV Worker
Source: examples/llama-uv-worker.yaml
name: llama-uv-worker
x-slurm:
job_name: llama-uv-worker
time: "01:00:00"
mem: 32G
cpus_per_task: 8
gpus: 1
cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}
services:
llama:
image: ghcr.io/ggml-org/llama.cpp:server-cuda
environment:
GGUF_MODEL_PATH: /models/model.gguf
volumes:
- ./models:/models
command:
- /bin/sh
- -lc
- |
set -eu
rm -f /hpc-compose/job/request.done
/app/llama-server -m "$$GGUF_MODEL_PATH" --host 0.0.0.0 --port 8080 &
server_pid=$$!
while [ ! -f /hpc-compose/job/request.done ]; do
if ! kill -0 "$$server_pid" 2>/dev/null; then
wait "$$server_pid"
exit $$?
fi
sleep 1
done
kill "$$server_pid" 2>/dev/null || true
wait "$$server_pid" || true
readiness:
type: log
pattern: "main: model loaded"
timeout_seconds: 300
x-slurm:
gpus: 1
cpus_per_task: 4
worker:
image: python:3.11-slim
working_dir: /workspace
volumes:
- ./llama-uv-worker:/workspace
depends_on:
llama:
condition: service_healthy
environment:
OPENAI_BASE_URL: http://127.0.0.1:8080/v1
MODEL_NAME: local-model
REQUEST_DONE_PATH: /hpc-compose/job/request.done
command:
- /bin/sh
- -lc
- |
set -eu
UV_CACHE_DIR=/hpc-compose/job/.uv-cache uv run worker.py
x-runtime:
prepare:
commands:
- pip install --no-cache-dir uv
x-slurm:
cpus_per_task: 2
LLM Curl Workflow
Source: examples/llm-curl-workflow.yaml
name: llm-curl-workflow
x-slurm:
job_name: llm-curl-workflow
time: "00:30:00"
mem: 32G
cpus_per_task: 8
gpus: 1
cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}
services:
llm:
image: ghcr.io/ggml-org/llama.cpp:server-cuda
volumes:
- ./models:/models
command:
- /bin/sh
- -lc
- |
set -eu
rm -f /hpc-compose/job/request.done
/app/llama-server -m /models/model.gguf --host 0.0.0.0 --port 8080 &
server_pid=$$!
while [ ! -f /hpc-compose/job/request.done ]; do
if ! kill -0 "$$server_pid" 2>/dev/null; then
wait "$$server_pid"
exit $$?
fi
sleep 1
done
kill "$$server_pid" 2>/dev/null || true
wait "$$server_pid" || true
readiness:
type: log
pattern: "main: model loaded"
timeout_seconds: 300
x-slurm:
gpus: 1
cpus_per_task: 4
curl_client:
image: debian:bookworm-slim
depends_on:
llm:
condition: service_healthy
environment:
LLM_BASE_URL: http://127.0.0.1:8080
command:
- /bin/sh
- -lc
- |
set -eu
cat >/tmp/request.json <<'JSON'
{
"model": "local-model",
"messages": [
{
"role": "system",
"content": "You are a concise assistant."
},
{
"role": "user",
"content": "Explain what readiness checks do in one sentence."
}
],
"temperature": 0.2,
"max_tokens": 64
}
JSON
echo "Sending test request to $$LLM_BASE_URL/v1/chat/completions"
curl --fail --show-error --silent \
-H 'Content-Type: application/json' \
--data @/tmp/request.json \
"$$LLM_BASE_URL/v1/chat/completions"
touch /hpc-compose/job/request.done
x-runtime:
prepare:
commands:
- apt-get update
- apt-get install -y --no-install-recommends bash ca-certificates curl
- rm -rf /var/lib/apt/lists/*
x-slurm:
cpus_per_task: 1
LLM Curl Workflow Workdir
Source: examples/llm-curl-workflow-workdir.yaml
name: llm-curl-workflow
x-slurm:
job_name: llm-curl-workflow
time: "00:30:00"
mem: 32G
cpus_per_task: 8
gpus: 1
# Uncomment if your cluster requires them.
# partition: gpu
# account: my-project
# Set CACHE_DIR to a path visible from the submission host and compute nodes.
cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}
services:
llm:
image: ghcr.io/ggml-org/llama.cpp:server-cuda
environment:
MODEL_FILE: model.gguf
volumes:
- $HOME/models:/models
command:
- /bin/sh
- -lc
- |
set -eu
rm -f /hpc-compose/job/request.done
/app/llama-server -m /models/$$MODEL_FILE --host 0.0.0.0 --port 8080 &
server_pid=$$!
while [ ! -f /hpc-compose/job/request.done ]; do
if ! kill -0 "$$server_pid" 2>/dev/null; then
wait "$$server_pid"
exit $$?
fi
sleep 1
done
kill "$$server_pid" 2>/dev/null || true
wait "$$server_pid" || true
readiness:
type: log
pattern: "main: model loaded"
timeout_seconds: 300
x-slurm:
gpus: 1
cpus_per_task: 4
curl_client:
image: debian:bookworm-slim
depends_on:
llm:
condition: service_healthy
environment:
LLM_BASE_URL: http://127.0.0.1:8080
command:
- /bin/sh
- -lc
- |
set -eu
cat >/tmp/request.json <<'JSON'
{
"model": "local-model",
"messages": [
{
"role": "system",
"content": "You are a concise assistant."
},
{
"role": "user",
"content": "Explain what readiness checks do in one sentence."
}
],
"temperature": 0.2,
"max_tokens": 64
}
JSON
echo "Sending test request to $$LLM_BASE_URL/v1/chat/completions"
curl --fail --show-error --silent \
-H 'Content-Type: application/json' \
--data @/tmp/request.json \
"$$LLM_BASE_URL/v1/chat/completions"
touch /hpc-compose/job/request.done
x-runtime:
prepare:
commands:
- apt-get update
- apt-get install -y --no-install-recommends bash ca-certificates curl
- rm -rf /var/lib/apt/lists/*
x-slurm:
cpus_per_task: 1
Minimal Batch
Source: examples/minimal-batch.yaml
name: minimal-batch
x-slurm:
job_name: minimal-batch
time: "00:10:00"
mem: 4G
cpus_per_task: 2
services:
app:
image: python:3.11-slim
command: python -c "print('Hello from Slurm!')"
MPI Hello
Source: examples/mpi-hello.yaml
name: mpi-hello
x-slurm:
job_name: mpi-hello
time: "00:15:00"
mem: 8G
cpus_per_task: 4
cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}
services:
mpi:
image: debian:bookworm-slim
command:
- /bin/sh
- -lc
- /usr/local/bin/mpi_hello
x-runtime:
prepare:
commands:
- apt-get update
- apt-get install -y --no-install-recommends libopenmpi-dev openmpi-bin gcc
- |
cat > /tmp/hello.c << 'EOF'
#include <mpi.h>
#include <stdio.h>
int main(int argc, char **argv) {
MPI_Init(&argc, &argv);
int rank, size;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
printf("Hello from rank %d of %d\n", rank, size);
MPI_Finalize();
return 0;
}
EOF
mpicc /tmp/hello.c -o /usr/local/bin/mpi_hello
- rm -rf /var/lib/apt/lists/* /tmp/hello.c
x-slurm:
ntasks: 4
cpus_per_task: 4
mpi:
type: pmix
profile: openmpi
implementation: openmpi
MPI PMIx v4 Host MPI
Source: examples/mpi-pmix-v4-host-mpi.yaml
name: mpi-pmix-v4-host-mpi
runtime:
backend: pyxis
x-slurm:
job_name: mpi-pmix-v4-host-mpi
time: "00:20:00"
nodes: 2
ntasks_per_node: 2
cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}
services:
mpi:
image: debian:bookworm-slim
command:
- /bin/sh
- -lc
- |
echo "mpi_type=$$HPC_COMPOSE_MPI_TYPE"
echo "hostfile=$$HPC_COMPOSE_MPI_HOSTFILE"
cat "$$HPC_COMPOSE_MPI_HOSTFILE"
/opt/site/openmpi/bin/mpirun --version || true
x-slurm:
nodes: 2
ntasks_per_node: 2
mpi:
type: pmix_v4
profile: openmpi
implementation: openmpi
launcher: srun
expected_ranks: 4
host_mpi:
bind_paths:
- /opt/site/openmpi:/opt/site/openmpi:ro
env:
MPI_HOME: /opt/site/openmpi
Multi Node MPI
Source: examples/multi-node-mpi.yaml
name: multi-node-mpi
x-slurm:
job_name: multi-node-mpi
time: "00:20:00"
nodes: 2
ntasks_per_node: 2
cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}
services:
bootstrap:
image: alpine:3.20
command:
- /bin/sh
- -lc
- |
echo "primary=$(cat /hpc-compose/job/allocation/primary_node)"
sleep 30
readiness:
type: sleep
seconds: 1
x-slurm:
nodes: 1
mpi:
image: python:3.11-slim
depends_on:
bootstrap:
condition: service_healthy
command:
- /bin/sh
- -lc
- |
echo "primary=$(cat /hpc-compose/job/allocation/primary_node)"
echo "nodes=$(tr '\n' ' ' < /hpc-compose/job/allocation/nodes.txt)"
echo "mpi_hostfile=$$HPC_COMPOSE_MPI_HOSTFILE"
cat "$$HPC_COMPOSE_MPI_HOSTFILE"
python - <<'PY'
import os
print("mpi placeholder")
print("node_count", os.environ["HPC_COMPOSE_NODE_COUNT"])
print("mpi_type", os.environ["HPC_COMPOSE_MPI_TYPE"])
PY
readiness:
type: sleep
seconds: 2
x-slurm:
nodes: 2
ntasks_per_node: 2
mpi:
type: pmix
profile: openmpi
implementation: openmpi
launcher: srun
expected_ranks: 4
Multi Node Partitioned
Source: examples/multi-node-partitioned.yaml
name: multi-node-partitioned
x-slurm:
job_name: multi-node-partitioned
time: "00:20:00"
nodes: 8
cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}
services:
service-a:
image: alpine:3.20
command:
- /bin/sh
- -lc
- |
echo "service-a nodes=$$HPC_COMPOSE_SERVICE_NODELIST"
sleep 30
readiness:
type: sleep
seconds: 1
x-slurm:
placement:
node_range: "0-3"
service-b:
image: alpine:3.20
command:
- /bin/sh
- -lc
- |
echo "service-b nodes=$$HPC_COMPOSE_SERVICE_NODELIST"
sleep 30
readiness:
type: sleep
seconds: 1
x-slurm:
placement:
node_range: "4-7"
parameter-server:
image: alpine:3.20
depends_on:
service-b:
condition: service_healthy
command:
- /bin/sh
- -lc
- |
echo "co-located with service-b on $$HPC_COMPOSE_SERVICE_NODELIST"
sleep 30
readiness:
type: sleep
seconds: 1
x-slurm:
placement:
share_with: service-b
monitor:
image: alpine:3.20
command:
- /bin/sh
- -lc
- |
echo "monitor nodes=$$HPC_COMPOSE_SERVICE_NODELIST"
sleep 30
x-slurm:
placement:
node_percent: 25
allow_overlap: true
Multi Node Torchrun
Source: examples/multi-node-torchrun.yaml
name: multi-node-torchrun
x-slurm:
job_name: multi-node-torchrun
time: "04:00:00"
nodes: 2
gpus_per_node: 4
cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}
services:
trainer:
image: pytorch/pytorch:2.4.1-cuda12.1-cudnn9-runtime
command:
- /bin/sh
- -lc
- |
echo "master=$$HPC_COMPOSE_DIST_MASTER_ADDR"
echo "nodes=$$HPC_COMPOSE_SERVICE_NODELIST"
echo "node_rank=$$HPC_COMPOSE_DIST_NODE_RANK"
torchrun \
--nnodes="$$HPC_COMPOSE_DIST_NNODES" \
--nproc-per-node="$$HPC_COMPOSE_DIST_NPROC_PER_NODE" \
--node-rank="$$HPC_COMPOSE_DIST_NODE_RANK" \
--rdzv-backend=c10d \
--rdzv-endpoint="$$HPC_COMPOSE_DIST_RDZV_ENDPOINT" \
train.py
readiness:
type: sleep
seconds: 5
x-slurm:
nodes: 2
ntasks_per_node: 1
gpus_per_node: 4
Multi Node Deepspeed
Source: examples/multi-node-deepspeed.yaml
name: multi-node-deepspeed
x-slurm:
job_name: multi-node-deepspeed
time: "04:00:00"
nodes: 2
gpus_per_node: 4
cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}
services:
trainer:
image: pytorch/pytorch:2.4.1-cuda12.1-cudnn9-runtime
command:
- /bin/sh
- -lc
- |
echo "master=$$HPC_COMPOSE_DIST_MASTER_ADDR"
echo "nodes=$$HPC_COMPOSE_SERVICE_NODELIST"
echo "node_rank=$$HPC_COMPOSE_DIST_NODE_RANK"
deepspeed \
--no_ssh \
--hostfile "$$HPC_COMPOSE_DIST_HOSTFILE" \
--num_nodes "$$HPC_COMPOSE_DIST_NNODES" \
--num_gpus "$$HPC_COMPOSE_DIST_NPROC_PER_NODE" \
--node_rank "$$HPC_COMPOSE_DIST_NODE_RANK" \
--master_addr "$$HPC_COMPOSE_DIST_MASTER_ADDR" \
--master_port "$$HPC_COMPOSE_DIST_MASTER_PORT" \
train.py
readiness:
type: sleep
seconds: 5
x-slurm:
nodes: 2
ntasks_per_node: 1
gpus_per_node: 4
Multi Node Accelerate
Source: examples/multi-node-accelerate.yaml
name: multi-node-accelerate
x-slurm:
job_name: multi-node-accelerate
time: "04:00:00"
nodes: 2
gpus_per_node: 4
cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}
services:
trainer:
image: pytorch/pytorch:2.4.1-cuda12.1-cudnn9-runtime
command:
- /bin/sh
- -lc
- |
echo "master=$$HPC_COMPOSE_DIST_MASTER_ADDR"
echo "nodes=$$HPC_COMPOSE_SERVICE_NODELIST"
echo "machine_rank=$$HPC_COMPOSE_DIST_NODE_RANK"
accelerate launch \
--multi_gpu \
--num_machines "$$HPC_COMPOSE_DIST_NNODES" \
--num_processes "$$HPC_COMPOSE_DIST_WORLD_SIZE" \
--machine_rank "$$HPC_COMPOSE_DIST_NODE_RANK" \
--main_process_ip "$$HPC_COMPOSE_DIST_MASTER_ADDR" \
--main_process_port "$$HPC_COMPOSE_DIST_MASTER_PORT" \
train.py
readiness:
type: sleep
seconds: 5
x-slurm:
nodes: 2
ntasks_per_node: 1
gpus_per_node: 4
Multi Node Horovod
Source: examples/multi-node-horovod.yaml
name: multi-node-horovod
x-slurm:
job_name: multi-node-horovod
time: "04:00:00"
nodes: 2
gpus_per_node: 4
cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}
services:
trainer:
image: horovod/horovod:latest
command:
- /bin/sh
- -lc
- |
echo "rank=$$SLURM_PROCID local_rank=$$SLURM_LOCALID world=$$SLURM_NTASKS"
python train_horovod.py
readiness:
type: sleep
seconds: 5
x-slurm:
nodes: 2
ntasks_per_node: 4
gpus_per_node: 4
mpi:
type: pmix
profile: openmpi
expected_ranks: 8
Multi Node Jax
Source: examples/multi-node-jax.yaml
name: multi-node-jax
x-slurm:
job_name: multi-node-jax
time: "04:00:00"
nodes: 2
gpus_per_node: 4
cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}
services:
trainer:
image: jaxai/jax:latest
command:
- /bin/sh
- -lc
- |
echo "coordinator=$$HPC_COMPOSE_DIST_RDZV_ENDPOINT"
echo "process_id=$$HPC_COMPOSE_DIST_NODE_RANK processes=$$HPC_COMPOSE_DIST_NNODES"
python train_jax.py
readiness:
type: sleep
seconds: 5
x-slurm:
nodes: 2
ntasks_per_node: 1
gpus_per_node: 4
Nccl Tests
Source: examples/nccl-tests.yaml
name: nccl-tests
x-slurm:
job_name: nccl-tests
time: "00:30:00"
nodes: 2
gpus_per_node: 4
cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}
services:
all-reduce:
image: nvcr.io/nvidia/pytorch:24.08-py3
command:
- /bin/sh
- -lc
- |
echo "rank=$$SLURM_PROCID local_rank=$$SLURM_LOCALID world=$$SLURM_NTASKS"
if command -v all_reduce_perf >/dev/null 2>&1; then
all_reduce_perf -b 8 -e 4G -f 2 -g 1
elif [ -x /workspace/nccl-tests/build/all_reduce_perf ]; then
/workspace/nccl-tests/build/all_reduce_perf -b 8 -e 4G -f 2 -g 1
else
echo "all_reduce_perf not found; use an image with nccl-tests installed" >&2
exit 127
fi
readiness:
type: sleep
seconds: 2
x-slurm:
nodes: 2
ntasks_per_node: 4
gpus_per_node: 4
mpi:
type: pmix
profile: openmpi
expected_ranks: 8
Ray Symmetric
Source: examples/ray-symmetric.yaml
name: ray-symmetric
x-slurm:
job_name: ray-symmetric
time: "02:00:00"
nodes: 2
cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}
services:
ray:
image: rayproject/ray:2.49.0-py310
command:
- /bin/sh
- -lc
- |
ray symmetric-run \
--address "$$HPC_COMPOSE_DIST_RDZV_ENDPOINT" \
--min-nodes "$$HPC_COMPOSE_DIST_NNODES" \
-- \
python app.py
readiness:
type: sleep
seconds: 10
x-slurm:
nodes: 2
ntasks_per_node: 1
Rendezvous Client
Source: examples/rendezvous-client.yaml
name: rendezvous-client
x-slurm:
job_name: model-client
time: "00:10:00"
mem: 2G
cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}
rendezvous: model-server
services:
client:
image: curlimages/curl:8.10.1
command:
- /bin/sh
- -lc
- |
curl -fsS "$${HPC_COMPOSE_RDZV_MODEL_SERVER_URL}"
Rendezvous Model Server
Source: examples/rendezvous-model-server.yaml
name: rendezvous-model-server
x-slurm:
job_name: model-server
partition: gpu
time: "02:00:00"
mem: 32G
gpus: 1
cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}
services:
model:
image: python:3.12-slim
command:
- /bin/sh
- -lc
- |
python -m http.server 8000
readiness:
type: tcp
port: 8000
timeout_seconds: 60
x-slurm:
rendezvous:
register:
name: model-server
port: 8000
protocol: http
path: /
ttl_seconds: 3600
Ray Head Workers
Source: examples/ray-head-workers.yaml
name: ray-head-workers
x-slurm:
job_name: ray-head-workers
time: "02:00:00"
nodes: 2
cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}
services:
head:
image: rayproject/ray:2.49.0-py310
command:
- /bin/sh
- -lc
- |
ray start --head --node-ip-address="$$HPC_COMPOSE_SERVICE_PRIMARY_NODE" --port=6379 --block
readiness:
type: sleep
seconds: 10
x-slurm:
nodes: 1
worker:
image: rayproject/ray:2.49.0-py310
command:
- /bin/sh
- -lc
- |
ray start --address="$$HPC_COMPOSE_PRIMARY_NODE:6379" --block
depends_on:
head:
condition: service_healthy
x-slurm:
nodes: 1
placement:
node_range: "1"
Dask Scheduler Workers
Source: examples/dask-scheduler-workers.yaml
name: dask-scheduler-workers
x-slurm:
job_name: dask-scheduler-workers
time: "02:00:00"
nodes: 2
cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}
services:
scheduler:
image: ghcr.io/dask/dask:latest
command:
- /bin/sh
- -lc
- |
dask scheduler --host "$$HPC_COMPOSE_SERVICE_PRIMARY_NODE" --port 8786
readiness:
type: tcp
host: 127.0.0.1
port: 8786
timeout_seconds: 60
x-slurm:
nodes: 1
workers:
image: ghcr.io/dask/dask:latest
command:
- /bin/sh
- -lc
- |
dask worker "tcp://$$HPC_COMPOSE_PRIMARY_NODE:8786"
depends_on:
scheduler:
condition: service_healthy
x-slurm:
nodes: 2
ntasks_per_node: 1
Spark Standalone
Source: examples/spark-standalone.yaml
name: spark-standalone
x-slurm:
job_name: spark-standalone
time: "02:00:00"
nodes: 2
cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}
services:
master:
image: apache/spark:3.5.3
command:
- /bin/sh
- -lc
- |
/opt/spark/sbin/start-master.sh --host "$$HPC_COMPOSE_SERVICE_PRIMARY_NODE" --port 7077
tail -f /opt/spark/logs/*
readiness:
type: tcp
host: 127.0.0.1
port: 7077
timeout_seconds: 60
x-slurm:
nodes: 1
workers:
image: apache/spark:3.5.3
command:
- /bin/sh
- -lc
- |
/opt/spark/sbin/start-worker.sh "spark://$$HPC_COMPOSE_PRIMARY_NODE:7077"
tail -f /opt/spark/logs/*
depends_on:
master:
condition: service_healthy
x-slurm:
nodes: 2
ntasks_per_node: 1
app:
image: apache/spark:3.5.3
command:
- /bin/sh
- -lc
- |
spark-submit --master "spark://$$HPC_COMPOSE_PRIMARY_NODE:7077" app.py
depends_on:
master:
condition: service_healthy
x-slurm:
nodes: 1
Flux Nested
Source: examples/flux-nested.yaml
name: flux-nested
runtime:
backend: host
x-slurm:
job_name: flux-nested
time: "01:00:00"
nodes: 2
cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}
services:
flux:
command:
- /bin/sh
- -lc
- |
flux start bash -lc 'flux run --label-io -N "$$HPC_COMPOSE_DIST_NNODES" hostname'
x-slurm:
nodes: 2
ntasks_per_node: 1
Nextflow Bridge
Source: examples/nextflow-bridge.yaml
name: nextflow-bridge
runtime:
backend: host
x-slurm:
job_name: nextflow-bridge
time: "02:00:00"
nodes: 1
cpus_per_task: 8
mem: 16G
cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}
artifacts:
export_dir: ./results/${SLURM_JOB_ID}
paths:
- /hpc-compose/job/nextflow-work/**
- /hpc-compose/job/reports/**
- /hpc-compose/job/logs/**
services:
nextflow:
command:
- /bin/sh
- -lc
- |
mkdir -p /hpc-compose/job/nextflow-work /hpc-compose/job/reports
nextflow run "$${NEXTFLOW_PIPELINE:-main.nf}" \
-work-dir /hpc-compose/job/nextflow-work \
-with-report /hpc-compose/job/reports/report.html \
-with-trace /hpc-compose/job/reports/trace.txt \
$${NEXTFLOW_ARGS:-}
environment:
NEXTFLOW_PIPELINE: main.nf
NEXTFLOW_ARGS: ""
x-slurm:
ntasks: 1
Snakemake Bridge
Source: examples/snakemake-bridge.yaml
name: snakemake-bridge
runtime:
backend: host
x-slurm:
job_name: snakemake-bridge
time: "02:00:00"
nodes: 1
cpus_per_task: 8
mem: 16G
cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}
artifacts:
export_dir: ./results/${SLURM_JOB_ID}
paths:
- /hpc-compose/job/snakemake-work/**
- /hpc-compose/job/reports/**
- /hpc-compose/job/logs/**
services:
snakemake:
command:
- /bin/sh
- -lc
- |
mkdir -p /hpc-compose/job/snakemake-work /hpc-compose/job/reports
snakemake \
--snakefile "$${SNAKEMAKE_FILE:-Snakefile}" \
--cores "$${SNAKEMAKE_CORES:-$${SLURM_CPUS_PER_TASK:-1}}" \
--directory "$${SNAKEMAKE_WORKDIR:-.}" \
--printshellcmds \
$${SNAKEMAKE_ARGS:-}
environment:
SNAKEMAKE_FILE: Snakefile
SNAKEMAKE_WORKDIR: "."
SNAKEMAKE_ARGS: ""
x-slurm:
ntasks: 1
Multi Stage Pipeline
Source: examples/multi-stage-pipeline.yaml
name: multi-stage-pipeline
x-slurm:
job_name: multi-stage-pipeline
time: "00:30:00"
mem: 8G
cpus_per_task: 4
cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}
services:
producer:
image: python:3.11-slim
command:
- /bin/sh
- -lc
- |
python -c "
import csv, random, os
output = '/hpc-compose/job/output.csv'
with open(output, 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow(['id', 'value', 'category'])
for i in range(1000):
writer.writerow([i, round(random.gauss(50, 15), 2), random.choice(['A', 'B', 'C'])])
print(f'Wrote 1000 rows to {output}')
print('producer complete')
"
readiness:
type: log
pattern: "producer complete"
timeout_seconds: 60
x-slurm:
cpus_per_task: 1
consumer:
image: python:3.11-slim
depends_on:
producer:
condition: service_healthy
command:
- /bin/sh
- -lc
- |
python -c "
import csv, collections
with open('/hpc-compose/job/output.csv') as f:
reader = csv.DictReader(f)
rows = list(reader)
by_cat = collections.defaultdict(list)
for row in rows:
by_cat[row['category']].append(float(row['value']))
print(f'Read {len(rows)} rows')
for cat in sorted(by_cat):
vals = by_cat[cat]
print(f' {cat}: count={len(vals)}, mean={sum(vals)/len(vals):.2f}')
print('consumer complete')
"
x-slurm:
cpus_per_task: 1
Pipeline DAG
Source: examples/pipeline-dag.yaml
name: pipeline-dag
x-slurm:
job_name: pipeline-dag
time: "00:20:00"
mem: 4G
cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}
services:
preprocess:
image: alpine:3.20
command:
- /bin/sh
- -lc
- |
mkdir -p /hpc-compose/job/pipeline
printf 'records=3\n' > /hpc-compose/job/pipeline/prepared.txt
train:
image: alpine:3.20
depends_on:
preprocess:
condition: service_completed_successfully
command:
- /bin/sh
- -lc
- |
cat /hpc-compose/job/pipeline/prepared.txt
printf 'accuracy=0.91\n' > /hpc-compose/job/pipeline/model.txt
postprocess:
image: alpine:3.20
depends_on:
train:
condition: service_completed_successfully
command:
- /bin/sh
- -lc
- |
cat /hpc-compose/job/pipeline/model.txt
printf 'done\n' > /hpc-compose/job/pipeline/report.txt
Postgres ETL
Source: examples/postgres-etl.yaml
name: postgres-etl
x-slurm:
job_name: postgres-etl
time: "01:00:00"
mem: 16G
cpus_per_task: 4
cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}
services:
postgres:
image: postgres:16
environment:
POSTGRES_USER: etl
POSTGRES_PASSWORD: etl
POSTGRES_DB: pipeline
readiness:
type: tcp
host: 127.0.0.1
port: 5432
timeout_seconds: 30
x-slurm:
cpus_per_task: 2
etl:
image: python:3.11-slim
depends_on:
postgres:
condition: service_healthy
environment:
DATABASE_URL: postgresql://etl:etl@127.0.0.1:5432/pipeline
command:
- /bin/sh
- -lc
- |
python -c "
import psycopg2, os
conn = psycopg2.connect(os.environ['DATABASE_URL'])
cur = conn.cursor()
cur.execute('CREATE TABLE IF NOT EXISTS results (id SERIAL, value FLOAT)')
for i in range(100):
cur.execute('INSERT INTO results (value) VALUES (%s)', (i * 1.5,))
conn.commit()
cur.execute('SELECT count(*), avg(value) FROM results')
count, avg = cur.fetchone()
print(f'Inserted {count} rows, average value: {avg:.2f}')
conn.close()
"
x-runtime:
prepare:
commands:
- pip install --no-cache-dir psycopg2-binary
x-slurm:
cpus_per_task: 2
Restart Policy
Source: examples/restart-policy.yaml
name: restart-policy
x-slurm:
job_name: restart-policy
time: "00:10:00"
mem: 4G
cpus_per_task: 1
cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}
services:
flaky-worker:
image: python:3.11-slim
command:
- /bin/sh
- -lc
- |
python - <<'PY'
import pathlib
import sys
import time
state_dir = pathlib.Path("/hpc-compose/job/restart-policy")
counter_path = state_dir / "attempts.txt"
state_dir.mkdir(parents=True, exist_ok=True)
attempts = int(counter_path.read_text()) if counter_path.exists() else 0
attempts += 1
counter_path.write_text(f"{attempts}\n")
print(f"attempt {attempts}")
if attempts <= 2:
print("simulating transient failure")
sys.exit(42)
print("work completed after transient failures")
time.sleep(1)
PY
x-slurm:
failure_policy:
mode: restart_on_failure
max_restarts: 5
backoff_seconds: 2
window_seconds: 60
max_restarts_in_window: 3
Training Checkpoints
Source: examples/training-checkpoints.yaml
name: training-checkpoints
x-slurm:
job_name: training-checkpoints
time: "04:00:00"
mem: 64G
cpus_per_task: 8
gpus: 1
cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}
services:
trainer:
image: pytorch/pytorch:2.3.1-cuda12.1-cudnn9-runtime
volumes:
- /shared/$USER/checkpoints:/checkpoints
environment:
CHECKPOINT_DIR: /checkpoints
NUM_EPOCHS: "10"
command:
- /bin/sh
- -lc
- |
python -c "
import os, torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'Training on {device}')
ckpt_dir = os.environ['CHECKPOINT_DIR']
os.makedirs(ckpt_dir, exist_ok=True)
model = torch.nn.Linear(128, 10).to(device)
optimizer = torch.optim.Adam(model.parameters())
data = torch.randn(256, 128, device=device)
for epoch in range(int(os.environ['NUM_EPOCHS'])):
out = model(data)
loss = out.sum()
optimizer.zero_grad()
loss.backward()
optimizer.step()
path = os.path.join(ckpt_dir, f'checkpoint_epoch_{epoch}.pt')
torch.save({'epoch': epoch, 'model': model.state_dict()}, path)
print(f'Epoch {epoch}: loss={loss.item():.4f}, saved {path}')
print('Training complete')
"
x-slurm:
gpus: 1
cpus_per_task: 4
Training Resume
Source: examples/training-resume.yaml
name: training-resume
x-slurm:
job_name: training-resume
time: "04:00:00"
mem: 64G
cpus_per_task: 8
gpus: 1
cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}
resume:
path: /shared/$USER/runs/training-resume
artifacts:
export_dir: ./results/${SLURM_JOB_ID}
paths:
- /hpc-compose/job/checkpoints/**
services:
trainer:
image: pytorch/pytorch:2.3.1-cuda12.1-cudnn9-runtime
environment:
NUM_EPOCHS: "10"
command:
- /bin/sh
- -lc
- |
python - <<'PY'
import json
import os
import pathlib
import time
resume_dir = pathlib.Path(os.environ["HPC_COMPOSE_RESUME_DIR"])
attempt = os.environ["HPC_COMPOSE_ATTEMPT"]
is_resume = os.environ["HPC_COMPOSE_IS_RESUME"] == "1"
checkpoint_dir = pathlib.Path("/hpc-compose/job/checkpoints")
latest_state_path = resume_dir / "latest.json"
resume_dir.mkdir(parents=True, exist_ok=True)
checkpoint_dir.mkdir(parents=True, exist_ok=True)
start_epoch = 0
if latest_state_path.exists():
state = json.loads(latest_state_path.read_text())
start_epoch = state["next_epoch"]
print(f"Resuming run at epoch {start_epoch} (attempt {attempt})")
else:
print(f"Starting fresh run (attempt {attempt})")
for epoch in range(start_epoch, int(os.environ["NUM_EPOCHS"])):
state = {
"completed_epoch": epoch,
"next_epoch": epoch + 1,
"attempt": int(attempt),
"is_resume": is_resume,
}
latest_state_path.write_text(json.dumps(state, indent=2) + "\n")
artifact_path = checkpoint_dir / f"checkpoint_epoch_{epoch}.json"
artifact_path.write_text(json.dumps(state, indent=2) + "\n")
print(f"Epoch {epoch}: wrote {artifact_path}")
time.sleep(1)
PY
Training Sweep
Source: examples/training-sweep.yaml
name: training-sweep
x-slurm:
job_name: training-sweep
time: "00:20:00"
mem: 8G
cpus_per_task: 2
cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}
sweep:
parameters:
lr: [0.001, 0.01, 0.1]
batch_size: [32, 64]
matrix: full
services:
trainer:
image: python:3.11-slim
environment:
LR: "${lr:-0.001}"
BATCH_SIZE: "${batch_size:-32}"
SWEEP_ID: "${HPC_COMPOSE_SWEEP_ID:-manual}"
TRIAL_ID: "${HPC_COMPOSE_SWEEP_TRIAL:-manual}"
command:
- python
- -c
- |
import os
import random
lr = float(os.environ["LR"])
batch_size = int(os.environ["BATCH_SIZE"])
random.seed(f"{lr}:{batch_size}")
score = 0.8 + random.random() * 0.05
print(f"sweep={os.environ['SWEEP_ID']} trial={os.environ['TRIAL_ID']}")
print(f"lr={lr} batch_size={batch_size} score={score:.4f}")
vLLM OpenAI
Source: examples/vllm-openai.yaml
name: vllm-openai
x-slurm:
job_name: vllm-openai
time: "01:00:00"
mem: 64G
cpus_per_task: 8
gpus: 1
cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}
services:
vllm:
image: vllm/vllm-openai:latest
environment:
MODEL_NAME: facebook/opt-125m
command:
- /bin/sh
- -lc
- |
set -eu
rm -f /hpc-compose/job/request.done
python -m vllm.entrypoints.openai.api_server \
--model $$MODEL_NAME \
--host 0.0.0.0 \
--port 8000 &
server_pid=$$!
while [ ! -f /hpc-compose/job/request.done ]; do
if ! kill -0 "$$server_pid" 2>/dev/null; then
wait "$$server_pid"
exit $$?
fi
sleep 1
done
kill "$$server_pid" 2>/dev/null || true
wait "$$server_pid" || true
readiness:
type: log
pattern: "Uvicorn running on"
timeout_seconds: 300
x-slurm:
gpus: 1
cpus_per_task: 4
client:
image: python:3.11-slim
depends_on:
vllm:
condition: service_healthy
environment:
OPENAI_BASE_URL: http://127.0.0.1:8000/v1
MODEL_NAME: facebook/opt-125m
command:
- /bin/sh
- -lc
- |
set -eu
python -c "
import openai, os
client = openai.OpenAI(
base_url=os.environ['OPENAI_BASE_URL'],
api_key='unused',
)
response = client.chat.completions.create(
model=os.environ['MODEL_NAME'],
messages=[
{'role': 'system', 'content': 'You are a concise assistant.'},
{'role': 'user', 'content': 'What is HPC in one sentence?'},
],
max_tokens=64,
temperature=0.2,
)
print(response.choices[0].message.content)
"
touch /hpc-compose/job/request.done
x-runtime:
prepare:
commands:
- pip install --no-cache-dir openai
x-slurm:
cpus_per_task: 2
vLLM UV Worker
Source: examples/vllm-uv-worker.yaml
name: vllm-uv-worker
x-slurm:
job_name: vllm-uv-worker
time: "01:00:00"
mem: 64G
cpus_per_task: 8
gpus: 1
cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}
services:
vllm:
image: vllm/vllm-openai:latest
environment:
MODEL_NAME: facebook/opt-125m
command:
- /bin/sh
- -lc
- |
set -eu
rm -f /hpc-compose/job/request.done
python -m vllm.entrypoints.openai.api_server \
--model "$$MODEL_NAME" \
--host 0.0.0.0 \
--port 8000 &
server_pid=$$!
while [ ! -f /hpc-compose/job/request.done ]; do
if ! kill -0 "$$server_pid" 2>/dev/null; then
wait "$$server_pid"
exit $$?
fi
sleep 1
done
kill "$$server_pid" 2>/dev/null || true
wait "$$server_pid" || true
readiness:
type: log
pattern: "Uvicorn running on"
timeout_seconds: 300
x-slurm:
gpus: 1
cpus_per_task: 4
worker:
image: python:3.11-slim
working_dir: /workspace
volumes:
- ./vllm-uv-worker:/workspace
depends_on:
vllm:
condition: service_healthy
environment:
OPENAI_BASE_URL: http://127.0.0.1:8000/v1
MODEL_NAME: facebook/opt-125m
REQUEST_DONE_PATH: /hpc-compose/job/request.done
command:
- /bin/sh
- -lc
- |
set -eu
UV_CACHE_DIR=/hpc-compose/job/.uv-cache uv run worker.py
x-runtime:
prepare:
commands:
- pip install --no-cache-dir uv
x-slurm:
cpus_per_task: 2
Codex Skill
This repository ships a Codex skill at skills/hpc-compose/ for agents that need to help users set up, adapt, validate, and troubleshoot hpc-compose workflows.
Use it when a user asks for tasks such as:
- make my repository work with hpc-compose
- migrate this Docker Compose or Slurm workflow to hpc-compose
- prepare this project for HAICORE or another Slurm cluster
- debug hpc-compose validation, preflight, or run failures
What It Contains
The skill keeps the main trigger and workflow in SKILL.md, then uses progressively loaded references for details:
| Path | Purpose |
|---|---|
skills/hpc-compose/SKILL.md | Trigger description, core workflow, adaptation rules, and output expectations. |
skills/hpc-compose/references/hpc-compose-workflow.md | hpc-compose command path, Docker Compose migration, backend selection, verification, and troubleshooting. |
skills/hpc-compose/references/haicore-kit.md | HAICORE/NHR@KIT Slurm, GPU, filesystem, cache, Pyxis/Enroot, and verification guidance. |
skills/hpc-compose/references/cluster-adaptation.md | General Slurm cluster reconnaissance and portable adaptation guidance. |
skills/hpc-compose/scripts/hpc_compose_repo_probe.py | Heuristic repository probe for migration clues. |
Using The Skill
Install or copy skills/hpc-compose/ into the Codex skills directory, typically $CODEX_HOME/skills/hpc-compose or ~/.codex/skills/hpc-compose, then start a fresh Codex session so skill discovery can reload.
Example prompt:
Use $hpc-compose to make this repository run with hpc-compose on HAICORE.
For local reconnaissance, run:
python3 skills/hpc-compose/scripts/hpc_compose_repo_probe.py .
The probe is intentionally heuristic. Treat its output as an inventory and hypothesis generator, then verify with repository files, cluster documentation, and hpc-compose static checks.
Agent Expectations
Agents using this skill should:
- inspect the target repository before proposing a spec
- check current cluster documentation for site-specific details
- prefer hpc-compose static checks before real Slurm submissions
- ask before commands that submit or cancel jobs or consume allocation quota
- report observations, hypotheses, recommendations, and open questions when cluster facts remain uncertain
Roadmap
This roadmap is intentionally short. hpc-compose is not trying to become a general-purpose orchestrator.
Authoring Ergonomics
- make the supported Compose subset easier to discover from examples and docs
- keep
validate,inspect,config, andrenderas the fast path for authoring confidence - improve starter templates and example selection before adding more surface area
Runtime Visibility
- make tracked jobs easier to reconnect to and reason about
- keep improving
status,ps,watch,stats, and artifact export for real cluster debugging - prefer inspectable generated state over hidden orchestration behavior
Cluster Compatibility
- expand confidence on more Linux cluster environments before broadening scope
- keep support policy explicit through the support matrix
- improve docs and examples around shared storage, Pyxis, and Enroot expectations
If your workflow falls outside this roadmap, that is useful feedback. Open an adoption feedback issue with your cluster type, workload type, and main friction point.
Architecture for Contributors
The library crate owns the core staged pipeline. The binary entrypoint delegates to command-family modules under src/commands/, while presentation lives under src/output/. Reusable planning, prepare, render, tracking, cache, context, and template logic stay in the library modules.
Module map
spec: parse, interpolate, and validate the supported Compose subsetplanner: normalize the parsed spec into a deterministic planlint: run opinionated static checks over validated planscontext: resolve.hpc-compose/settings.toml, profiles, env files, interpolation variables, and binary overridescluster: generate and apply best-effort cluster capability profiles fromdoctor cluster-reportpreflight: check login-node prerequisites and cluster policy issuesprepare: import base images and rebuild prepared runtime artifactsrender: generate the finalsbatchscript and service launch commandsjob: track submissions, logs, metrics, replay, status, and artifact exporttracked_paths: centralize the.hpc-compose/layout used by render and job trackingcache: persist cache manifests for imported and prepared imagesinit: expose the shipped example templates forhpc-compose newplus the legacyinitaliasschemaandmanpages: expose the checked-in JSON Schema and generated section-1 manpage flowcommands/spec: binary-only handlers forplan,validate,lint,render,prepare,preflight,config, andinspectcommands/runtime: binary-only handlers forup,debug,run,status,ps,watch,replay,stats,artifacts,logs,down,cancel, andcleancommands/cache: binary-only handlers for cache inspection and pruningcommands/init: binary-only handlers fornew/init,setup,context, andcompletionswatch_ui: terminal UI controller and renderer forup,watch, and replay playbackoutput: binary-only text, JSON, CSV, and JSONL formatting helpers
Execution flow
ComposeSpec::loadparses YAML, resolves authoringextends, validates supported keys, interpolates variables, and applies semantic validation.planner::build_planresolves paths, command shapes, dependencies, and prepare blocks into a normalized plan.prepare::build_runtime_plancomputes concrete cache artifact locations.contextand optional cluster profiles provide resolved paths, binaries, env, and compatibility warnings.preflight::runchecks cluster prerequisites before submission.prepare::prepare_runtime_planimports or rebuilds artifacts when needed.render::render_scriptemits the batch script consumed bysbatch.jobpersists tracked metadata under.hpc-compose/and powersstatus,ps,watch,replay,stats,logs,cancel, and artifact export.job::replayreconstructs a best-effort timeline from existing state, service-exit, metrics, and log artifacts while reusing the watch renderer for playback.commands/*turns CLI variants into library calls, andoutputformats the final presentation.
Tracked Runtime Layout
tracked_paths is the single source of truth for the tracked-job layout shared by render and job.
- Compose-level metadata lives under
.hpc-compose/next to the compose file. - Per-job runtime state lives under
${SLURM_SUBMIT_DIR}/.hpc-compose/<job-id>/. - Root-level
logs/,metrics/,artifacts/, andstate.jsonare the latest-view paths used by status and export commands. - Resume-aware runs still write attempt-specific state under
attempts/<attempt>/.... - The batch script updates root-level latest symlinks so contributor-facing tooling can read the most recent attempt without reconstructing shell logic independently.
Contributor commands
cargo test
cargo test --test cli_runtime
cargo test --test release_metadata
cargo doc --no-deps
mdbook build docs
cargo run --features manpage-bin --bin gen-manpages -- --check
Documentation split
- Use this mdBook for user-facing workflows, examples, and reference material.
- Use rustdoc for contributor-facing internals and the library module map.
- Keep README short and point readers into the book instead of duplicating long-form guidance.