hpc-compose

Compose-style multi-service workflows, compiled into one inspectable Slurm job.

One allocation · one script · Slurm-native runtime.

hpc-compose gives research and HPC teams a small YAML authoring model for services, startup order, readiness checks, runtime backends, logs, artifacts, and follow-up commands.

services:
  app:
    image: python:3.12-slim
    command: python train.py

$ hpc-compose plan --show-script -f compose.yaml
spec is valid
service order: app
#SBATCH --job-name=my-app

Use hpc-compose when you want Docker Compose-style authoring on Slurm without adding Kubernetes, a long-running control plane, or custom cluster-side services.

Start with the Support Matrix before planning a real runtime workflow. Linux is the maintained runtime target; macOS is intended for authoring, validation, rendering, and inspection.

Safe First Path

These commands are safe from a laptop, workstation, or login node because new writes a local starter spec and plan is purely static. It does not call sbatch, import images, or write a script file:

hpc-compose new --template minimal-batch --name my-app --output compose.yaml
hpc-compose plan -f compose.yaml
hpc-compose plan --show-script -f compose.yaml

plan validates the spec and resolves service order; plan --show-script adds the rendered batch script. Expected output includes:

spec is valid
service order: app
Rendered script:
#SBATCH --job-name=my-app

For real cluster runs, configure a cache path visible from both the Slurm submission host and compute nodes, either in x-slurm.cache_dir, hpc-compose setup --cache-dir, or [defaults.cache] / [profiles.<name>.cache] settings. From a source checkout, you can also inspect the checked-in examples with hpc-compose plan -f examples/minimal-batch.yaml.

Run hpc-compose up -f compose.yaml only on a supported Linux Slurm submission host with the runtime backend your spec selects. If it fails, start with hpc-compose debug -f compose.yaml --preflight.

If you have a source checkout and want to exercise real sbatch without a cluster login, use the Local Slurm Dev Cluster as a host-backend smoke test.

Download the asciinema-style quickstart demo cast if you want the same flow as a terminal recording.

Terms To Know

Term	Meaning
spec	The YAML file that describes services, runtime backend, and Slurm settings.
allocation	The Slurm job allocation where all planned services run.
runtime backend	The mechanism used to launch services: Pyxis/Enroot, Apptainer, Singularity, or host.
preflight	Checks that inspect local tools, paths, backend support, and optional cluster profiles before a run.
prepare	The login-node image import/customization phase used before compute-node runtime.
tracked job	Submission metadata under `.hpc-compose/jobs/<job-id>.json` plus runtime artifacts under `<runtime-root>/<job-id>`, which let `status`, `ps`, `watch`, `logs`, `stats`, and `artifacts` reconnect later.
`x-slurm`	The spec section for Slurm settings and hpc-compose runtime extensions.

See the Glossary for the full set of terms.

What It Is For

model serving plus helper services inside one Slurm allocation
data and ETL pipelines with startup ordering or stage-completion dependencies
training jobs with checkpoint export, artifact tracking, and resume-aware reruns
explicit multi-node launch patterns that still fit inside one allocation

What It Is Not

hpc-compose is not a full Docker Compose runtime and is not a general cluster orchestrator.

Unsupported Compose features include:

build:
ports
networks / network_mode
Compose restart as a Docker key
deploy
dynamic node bin packing

For exact boundaries, read Execution Model, Slurm Capability Scope, and Spec Reference.

Support Matrix

This page separates what hpc-compose can build, what CI currently exercises, and what is officially supported for real workflows.

Support levels

Level	Meaning
Officially supported	Maintained target for user-facing workflows and issue triage
CI-tested	Exercised in the repository’s automated checks today
Release-built	Prebuilt archive is published, but that is not a promise of full runtime support

Officially supported

Platform	Scope	Notes
Linux `x86_64`	Full CLI and runtime workflows	Requires Slurm client tools plus at least one supported runtime backend: Pyxis/Enroot, Apptainer, Singularity, or host software modules
Linux `arm64`	Full CLI and runtime workflows	Same cluster requirements as Linux `x86_64`
macOS `x86_64`	Authoring and local non-runtime commands	Suitable for project-local authoring flows such as `new`, `setup`, `context`, `plan`, `validate`, `inspect`, `render`, and `completions`; not for Slurm/Enroot runtime commands
macOS `arm64`	Authoring and local non-runtime commands	Same scope as macOS `x86_64`

CI-tested

Platform	What is tested today
Ubuntu 24.04 `x86_64`	formatting, clippy, unit/integration tests, docs build, link checks, installer smoke tests, and coverage
macOS `arm64`	authoring-focused tests, validate/render/schema smoke tests, installer smoke tests, and Homebrew smoke tests
macOS `x86_64`	authoring-focused tests, validate/render/schema smoke tests, and Homebrew smoke tests

Current CI validates full runtime-facing behavior on Ubuntu and authoring/distribution behavior on macOS. Other published builds should be treated as lower-confidence until corresponding CI coverage exists.

Release-built

Platform	Status
Linux `x86_64`	Release archive published
Linux `arm64`	Release archive published
macOS `x86_64`	Release archive published
macOS `arm64`	Release archive published
Windows `x86_64`	Release archive published, but runtime workflows are not officially supported

Windows status

Windows archives are published so users can inspect the CLI surface or experiment with non-runtime commands, but Windows is currently release-built only:

Slurm plus HPC runtime workflows are not an officially supported Windows target.
Issues that are specific to Windows runtime execution may be closed as out of scope until the support policy changes.

Cluster assumptions for full support

For full runtime support on Linux, the target environment should provide:

sbatch, srun, and related Slurm client tools on the submission host
one supported runtime path:
- Pyxis container support in srun plus Enroot on the submission host,
- Apptainer on the submission host and compute nodes,
- Singularity on the submission host and compute nodes,
- or module/vendor software available on the host runtime path
shared storage for the resolved cache directory

Use Runtime Backends, Runbook, and Execution Model before adapting a real workload to a cluster.

Installation
Quickstart
Why hpc-compose
Runtime Backends
Runbook

Installation

For normal use, install from a published GitHub Release. Build from source when you are developing the project or need to inspect a local checkout before using it on a cluster.

Install From A Published Release

Latest release (zero edits)

The quickest path needs no version substitution. When HPC_COMPOSE_VERSION is unset, the installer resolves the latest published GitHub Release tag automatically and downloads the matching asset:

curl -fsSL https://raw.githubusercontent.com/NicolasSchuler/hpc-compose/main/install.sh | sh

This runs the moving install.sh from main, but it always installs from a published releases/download/<tag>/... asset (never unreleased main).

Pinned release (reproducible / recommended for clusters)

For reproducible installs across a shared cluster, pin the exact release tag from the GitHub Releases page so every machine lands on the same build:

RELEASE_TAG=vX.Y.Z
curl -fsSL "https://raw.githubusercontent.com/NicolasSchuler/hpc-compose/${RELEASE_TAG}/install.sh" \
  | env HPC_COMPOSE_VERSION="${RELEASE_TAG}" sh

The installer downloads the matching archive for the current Linux or macOS machine, verifies the published .sha256 sidecar, installs hpc-compose into ~/.local/bin by default, and installs shipped Unix manpages when present.

After installation, make sure the install directory is on your shell PATH and verify the binary:

export PATH="$HOME/.local/bin:$PATH"
command -v hpc-compose
hpc-compose --version

Useful overrides:

RELEASE_TAG=vX.Y.Z

curl -fsSL "https://raw.githubusercontent.com/NicolasSchuler/hpc-compose/${RELEASE_TAG}/install.sh" \
  | env HPC_COMPOSE_INSTALL_DIR=/usr/local/bin HPC_COMPOSE_VERSION="$RELEASE_TAG" sh

Installer availability does not imply full runtime support. Check the Support Matrix before assuming a platform can run submission, prepare, or watch workflows end to end.

About The `main` Installer Script

Fetching install.sh from main without HPC_COMPOSE_VERSION does not install unreleased main:

curl -fsSL https://raw.githubusercontent.com/NicolasSchuler/hpc-compose/main/install.sh | sh

That command runs the moving script from main, but the script resolves the latest published GitHub Release and downloads from releases/download/<tag>/.... Use the version-pinned command above for reproducible installs. Use a source checkout when you want unreleased code.

Manual Release Download

Prebuilt archives are published on the release page. Pick the archive that matches your platform.

Example for Linux x86_64:

RELEASE_TAG=vX.Y.Z

curl -L "https://github.com/NicolasSchuler/hpc-compose/releases/download/${RELEASE_TAG}/hpc-compose-${RELEASE_TAG}-x86_64-unknown-linux-musl.tar.gz" -o hpc-compose.tar.gz
tar -xzf hpc-compose.tar.gz
./hpc-compose --help

Linux x86_64 releases use a musl target to avoid common cluster glibc mismatches. Unix release archives also contain share/man/man1/. See CLI Reference for browsing the installed man pages.

Windows release archives are zip-only for inspection and checksum parity. The installer script and end-to-end Slurm runtime workflows target Unix-like systems; use Windows primarily through WSL or a remote Linux/macOS authoring environment.

Native Packages

Published Linux releases may include .deb and .rpm assets:

RELEASE_TAG=vX.Y.Z

sudo apt install "./hpc-compose-${RELEASE_TAG}-x86_64-unknown-linux-musl.deb"
sudo dnf install "./hpc-compose-${RELEASE_TAG}-x86_64-unknown-linux-musl.rpm"

Package availability does not change runtime support policy. Linux cluster workflows still need Slurm client tools, the selected runtime backend, and shared storage for the resolved cache directory.

Homebrew On macOS

The repository exposes a same-repo Homebrew tap:

brew install NicolasSchuler/hpc-compose/hpc-compose

The formula is refreshed by release automation when a Homebrew-published release is cut. Check brew info NicolasSchuler/hpc-compose/hpc-compose when you need to confirm the formula version before installing.

macOS support is for authoring and local non-runtime commands such as new, plan, validate, inspect, render, and completions; it is not a supported Slurm runtime target.

Verify A Release

Use GitHub-native verification as the primary trust path for published binaries.

Verify the release:

RELEASE_TAG=vX.Y.Z
gh release verify "$RELEASE_TAG" -R NicolasSchuler/hpc-compose

Verify a downloaded asset:

RELEASE_TAG=vX.Y.Z
ASSET="hpc-compose-${RELEASE_TAG}-x86_64-unknown-linux-musl.tar.gz"

gh release download "$RELEASE_TAG" -R NicolasSchuler/hpc-compose -p "$ASSET"
gh release verify-asset "$RELEASE_TAG" "./$ASSET" -R NicolasSchuler/hpc-compose

Verify the artifact attestation directly:

gh attestation verify "./$ASSET" \
  --repo NicolasSchuler/hpc-compose \
  --signer-workflow NicolasSchuler/hpc-compose/.github/workflows/release.yml

Published releases also ship SHA256SUMS and per-asset .sha256 files. Those checksums are primarily for installer compatibility, mirroring, and corruption checks; attestations are the stronger authenticity signal.

Internal Mirrors And Cluster-Admin Installs

For internal mirrors, preserve release filenames exactly, including:

platform archives or native packages
SHA256SUMS
each per-asset .sha256 sidecar

Then point the installer at the mirrored base URL and pin the matching version:

RELEASE_TAG=vX.Y.Z
curl -fsSL "https://raw.githubusercontent.com/NicolasSchuler/hpc-compose/${RELEASE_TAG}/install.sh" \
  | env HPC_COMPOSE_BASE_URL="https://mirror.example.org/hpc-compose/${RELEASE_TAG}" \
        HPC_COMPOSE_VERSION="$RELEASE_TAG" sh

HPC_COMPOSE_VERSION is required when HPC_COMPOSE_BASE_URL is set so the installer, mirrored assets, and checksum files stay aligned.

Build From Source

Use this path for development, unreleased testing, or local inspection. Building from source requires a Rust 1.88 or newer toolchain (the crate uses edition 2024):

git clone https://github.com/NicolasSchuler/hpc-compose.git
cd hpc-compose
cargo build --release
./target/release/hpc-compose --help

Before using a local build on a cluster workflow, validate the binary and one example spec. validate and plan are static and need no cache directory:

target/release/hpc-compose validate -f examples/minimal-batch.yaml
target/release/hpc-compose plan --verbose -f examples/minimal-batch.yaml

To point the binary at a shared cache when you do run a job, set HPC_COMPOSE_CACHE_DIR, pass --cache-dir, or set x-slurm.cache_dir in the spec.

Local Docs Commands

The repo ships two documentation layers:

mdbook for the user manual
cargo doc for contributor-facing crate internals

Useful commands:

mdbook build docs
mdbook serve docs
cargo doc --no-deps

Regenerate checked-in manpages from a checkout with:

cargo run --locked --features manpage-bin --bin gen-manpages
cargo test --locked --test release_metadata
man -l man/man1/hpc-compose.1

Support Matrix
Quickstart
Why hpc-compose
Runtime Backends

Quickstart

This is the shortest safe path from an empty shell to a static plan, a first real Slurm run, and one-command failure triage.

If Slurm terms such as sbatch, srun, allocation, job step, Pyxis, or Enroot are unfamiliar, read Slurm And Container Basics before the first real cluster run.

1. Install The CLI

Installation is the single owner of install, verify, mirror, and source-build commands. Install the CLI from there, confirm hpc-compose --version works, then return here.

2. Learn The Safe Authoring Path First

The safe authoring path runs entirely on a laptop, workstation, or login node — new writes a local starter spec and plan is purely static (no sbatch, no image import):

hpc-compose new --template minimal-batch --name my-app --output compose.yaml
hpc-compose plan -f compose.yaml
hpc-compose plan --show-script -f compose.yaml

plan validates the spec and resolves service order; plan --show-script adds the rendered batch script. Run that block first on macOS, a laptop, or any machine where you want to evaluate the authoring model before touching a real cluster. The Overview page covers the same walkthrough with full expected output.

If you want a guided learning path instead of a single starter template, run the Spec Metamorphosis tutorial:

hpc-compose evolve --output compose.yaml

The normal workflow to remember is:

hpc-compose plan -f compose.yaml
hpc-compose up -f compose.yaml
hpc-compose debug -f compose.yaml --preflight

3. Choose A Starting Spec

Use the built-in starter templates when you want a fresh compose.yaml with your application name filled in:

hpc-compose new \
  --template minimal-batch \
  --name my-app \
  --output compose.yaml

Add --cache-dir '<shared-cache-dir>' when you want the generated file to include an explicit x-slurm.cache_dir. Otherwise the plan uses the active settings cache default or $HOME/.cache/hpc-compose.

From a source checkout, you can also inspect a known-good repository example:

hpc-compose plan -f examples/minimal-batch.yaml

The Examples page is the single selection guide for beginner, LLM, training, distributed, and pipeline workflows.

Use Spec Metamorphosis when you want to learn those concepts progressively in one evolving valid spec.

4. Pick And Test A Cache Directory

cache_dir is optional in the spec, but real clusters usually need a site-specific shared path because image preparation happens before the job starts and compute nodes must later see those artifacts.

Ask your cluster documentation or support team for a project scratch, work, or shared filesystem path, then test it:

export CACHE_DIR=/cluster/shared/hpc-compose-cache
mkdir -p "$CACHE_DIR"
test -w "$CACHE_DIR"

Persist it in project settings when you want the same value every time:

hpc-compose setup --profile-name dev --cache-dir "$CACHE_DIR" --default-profile dev --non-interactive

Or keep using an environment-backed explicit spec value and persist it next to your copied spec:

printf 'CACHE_DIR=%s\n' "$CACHE_DIR" > .env

Do not use /tmp, /var/tmp, /private/tmp, or /dev/shm for x-slurm.cache_dir. Validation may accept those strings, but preflight reports them as unsafe because prepare happens before runtime and compute nodes must later see the cached artifacts.

5. Before Your First Cluster Run

Command category	Where to run it	Required tools	Notes
Authoring: `new`, `plan`, `validate`, `inspect`, `render`, `config`, `schema`	laptop, workstation, or login node	`hpc-compose`	`plan` is the recommended static pre-run check.
Local real-scheduler smoke test	source checkout on a machine with Docker/Podman	`docker compose` or `podman compose`	The Local Slurm Dev Cluster runs real local `sbatch`; use `runtime.backend: host`.
Prepare: `prepare`	Linux host with selected runtime backend	Pyxis needs Enroot; Apptainer needs `apptainer`; Singularity needs `singularity`; host backend needs no container runtime	Does not call `sbatch`, but needs runtime tools for image work.
Cluster checks: `preflight`, `doctor cluster-report`	Linux Slurm login node	Slurm client tools plus selected backend tools	Use `preflight --strict` when warnings should block launch.
Run: `up`, `run`	Linux Slurm login node	`sbatch`, `srun`, scheduler tools, selected backend tools	`up` is the normal cluster execution path.
Local launch: `up --local`	Linux host only	Enroot and `runtime.backend: pyxis`	Single-host only; not a distributed Slurm substitute.

For Pyxis, srun --help should mention --container-image.

Everything above is safe on any machine. Everything below requires a real Slurm submission host.

The steps up to here only author specs, prepare a cache path, and read static plans. From this point the commands call sbatch, srun, and the runtime backend, so run them only on a supported Linux Slurm submission host.

6. Submit On A Real Cluster

When you move to a supported Linux submission host, the normal run is:

hpc-compose up -f compose.yaml

up runs preflight, prepares missing artifacts, renders the batch script, submits it through sbatch, then follows scheduler state and tracked logs. On the first run (or after cache eviction) the prepare step imports your container image with enroot — a multi-GB download, then extract and squashfs build — which can take several minutes; later runs reuse the cache, and an interactive terminal streams live import sub-progress. On an interactive TTY it opens the full-screen watch UI; otherwise it falls back to line-oriented output. Add --watch-queue when you want line-oriented queue polling until the Slurm job reaches RUNNING before the normal watch view opens; --queue-warn-after <DURATION> controls the one-time long-pending warning. The watch UI holds the final screen on failures by default; use --hold-on-exit never|failure|always to tune that behavior. Use hpc-compose up --detach -f compose.yaml when you want submit-and-return behavior.

Success looks like:

the job is submitted or launched
a tracked job id is recorded
the watch UI or text follower shows scheduler progress
status, ps, and logs can reconnect to the tracked run later

7. If The First Cluster Run Fails

Symptom	Best next command	Why
Missing `sbatch`, `srun`, `enroot`, `apptainer`, or `singularity`	`hpc-compose debug -f compose.yaml --preflight`	Reruns prerequisite checks and keeps the latest tracked context in one report.
`srun` does not advertise `--container-image`	`hpc-compose doctor cluster-report`	Pyxis support is unavailable or not loaded on that node.
Job submitted but no service log appeared	`hpc-compose debug -f compose.yaml`	Shows scheduler state, batch log tail, service log hints, and the next command.
Cache path warning or error	`hpc-compose debug -f compose.yaml --preflight`	Confirms whether `x-slurm.cache_dir` is shared and writable.
Services start in the wrong order	`hpc-compose plan --explain --verbose -f compose.yaml`	Shows normalized dependencies, readiness gates, and planner hints before running.

The longer symptom guide is Troubleshooting.

8. Revisit A Tracked Run Later

hpc-compose jobs list
hpc-compose status -f compose.yaml
hpc-compose ps -f compose.yaml
hpc-compose watch -f compose.yaml
hpc-compose stats -f compose.yaml
hpc-compose logs -f compose.yaml --follow

Use jobs list first when you need to rediscover tracked runs under the current repo tree. Use ps for a stable per-service snapshot, watch to reconnect to the live UI, and logs --follow for a text-only follower.

From A Source Checkout

If you are developing from a local checkout instead of an installed binary:

cargo build --release
target/release/hpc-compose validate -f examples/minimal-batch.yaml
target/release/hpc-compose plan -f examples/minimal-batch.yaml
target/release/hpc-compose plan --show-script -f examples/minimal-batch.yaml

Why hpc-compose

This is the canonical explainer for hpc-compose.

hpc-compose exists because two common approaches leave a gap:

plain sbatch scripts give you control, but multi-service coordination, startup ordering, and repeatability stay ad hoc
Docker Compose is familiar, but its networking and orchestration assumptions do not map cleanly to one Slurm allocation

hpc-compose takes the narrow path between them: a Compose-like authoring model that still produces one inspectable Slurm job.

The Pain in Current Slurm Workflows

Once a job stops being a single process, the friction climbs quickly:

helper services need explicit startup ordering
cluster-specific environment setup gets mixed into hand-written shell
debugging starts from generated state you never inspected beforehand
repeated workflows drift because the real behavior lives across scripts, notes, and local conventions

This is especially common in research ML and HPC-adjacent work where one job may need:

a serving process plus a client
a database plus a worker
a training step plus checkpoint export and resume handling

Why Docker Compose Does Not Fit Slurm Directly

Docker Compose is good at expressing a small multi-service application on one machine. Slurm solves a different problem: scheduling one batch allocation onto shared cluster resources.

That mismatch is why hpc-compose leaves several Compose features out by design. See Slurm Capability Scope for the exact unsupported-features list.

The omissions are deliberate. The point is not to emulate all of Compose on a cluster. The point is to keep a familiar authoring shape for the subset that maps cleanly to one Slurm job.

The Narrow Execution Model

hpc-compose keeps the execution model explicit: a compose-like spec is planned and rendered on the submission host into one batch script, which Slurm runs as one allocation. See Execution Model for the full spec->sbatch->srun pipeline.

That explicitness gives you a few important properties:

one inspectable unit of submission
one obvious place to look when the job fails
one explicit product boundary instead of hidden orchestration behavior

One Real Example

app-redis-worker.yaml is a good example of the intended shape:

one Redis service
one dependent worker service
TCP readiness gating before the worker starts
both services living inside the same allocation

That is awkward to hand-roll repeatedly with cluster scripts alone, but it does not justify a full orchestrator. This is the exact middle ground hpc-compose targets.

If you want the smallest possible first run, start with minimal-batch.yaml. If you want the smallest concrete inference flow, start with llm-curl-workflow-workdir.yaml.

Why the Inspectable Path Matters

The authoring flow is designed to answer the practical questions before you launch:

hpc-compose plan -f compose.yaml
hpc-compose plan --show-script -f compose.yaml

That lets you confirm:

whether the spec is valid
what service order will run
what image and cache behavior the planner inferred
what batch script you are actually handing to Slurm

For a Slurm-first tool, that inspectability matters more than feature breadth.

When Not To Use `hpc-compose`

Do not use hpc-compose when you need:

custom container networking
broad Docker Compose compatibility
a long-running orchestration control plane
dynamic cross-node scheduling instead of explicit x-slurm.placement node selectors

If that list rules out your workload, that is not a failure of the tool. It is the intended product boundary.

Slurm and Container Basics

This page is for users who know shell scripts, Python jobs, or Docker images, but are new to Slurm and HPC container runtimes.

It is not a Slurm administration guide. The goal is to explain the vocabulary you will see in generated hpc-compose scripts and in cluster error messages.

The Short Mental Model

The important point is that hpc-compose does not replace Slurm. It writes one inspectable Slurm batch script and uses Slurm to run the planned services inside one allocation. For the full spec->sbatch->srun pipeline, see Execution Model.

Slurm Terms In Plain Language

Term	Meaning for `hpc-compose` users
Login node	The machine where you edit files, run `plan`, run `preflight`, and submit jobs. Do not run long compute work here.
Compute node	A worker machine where Slurm runs your job after it starts.
Partition	A named queue or resource pool. Sites often use partitions to separate CPU, GPU, debug, and large jobs.
Job	A submitted unit of work managed by Slurm. `hpc-compose up` submits one job.
Allocation	The nodes, CPUs, memory, GPUs, and wall time reserved for a job.
Batch script	A shell script submitted with `sbatch`. It contains `#SBATCH` directives and normal shell commands.
Job step	A launched process group inside the allocation. `hpc-compose` launches services as `srun` steps.
Task	Usually one process or rank. More `ntasks` means more processes, not more CPU threads per process.
`cpus_per_task`	CPU threads requested for each task. This is common for threaded Python, OpenMP, or data-loader-heavy jobs.
`gres`	Slurm’s generic resource request field, commonly used for GPUs.

If you only remember one distinction: sbatch gets the allocation; srun starts work inside it.

A Minimal `sbatch` Script

A traditional Slurm script often looks like this:

#!/usr/bin/env bash
#SBATCH --job-name=hello-slurm
#SBATCH --partition=<partition>
#SBATCH --time=00:10:00
#SBATCH --cpus-per-task=2
#SBATCH --mem=4G

set -euo pipefail

hostname
python -c 'print("hello from a Slurm job")'

Submit it from a Slurm login node:

sbatch hello.sbatch

sbatch returns a job id. The job may wait in the queue before it starts, and Slurm normally writes batch output to a file such as slurm-<job-id>.out unless the script or site policy sets another output path.

Where `hpc-compose` Fits

The equivalent hpc-compose starting point is a spec:

name: hello-slurm

x-slurm:
  job_name: hello-slurm
  partition: <partition>
  time: "00:10:00"
  cpus_per_task: 2
  mem: 4G

services:
  app:
    image: python:3.11-slim
    command: python -c "import socket; print('hello from', socket.gethostname())"

Preview the generated Slurm script before submitting:

hpc-compose plan -f compose.yaml
hpc-compose plan --show-script -f compose.yaml

Run it on a supported Slurm login node:

hpc-compose up -f compose.yaml

up runs preflight checks, prepares missing runtime artifacts, renders the batch script, calls sbatch, records tracked job metadata, and follows scheduler/log output.

How YAML Maps To Slurm

hpc-compose translates top-level and service x-slurm fields into #SBATCH directives and srun arguments. For the exact field-by-field mapping and the full command surface (sbatch, srun, render, up, tracked follow-ups), see Spec Reference and CLI Reference. Prefer first-class fields when they exist; use raw submit_args or extra_srun_args only for site-specific options that hpc-compose does not model directly.

When debugging, inspect the generated script:

hpc-compose plan --show-script -f compose.yaml

If a job was submitted but failed before service logs appeared, inspect Slurm state and batch output through:

hpc-compose debug -f compose.yaml

Pyxis And Enroot Basics

Slurm itself is the scheduler. Container support depends on what the cluster installed. The default runtime.backend: pyxis path uses the Pyxis Slurm plugin plus the Enroot unprivileged runtime, and hpc-compose maps each service into a generated srun --container-* launch.

For the Pyxis support check, the Enroot/Apptainer/Singularity/host tooling differences, and how to choose a backend, see Runtime Backends.

Why Shared Storage Matters

hpc-compose prepare can run before the Slurm job starts, but services run later on compute nodes, so the resolved runtime cache must be visible from both places. For why the cache must live on shared storage and the operational cache configuration, see Execution Model and Cache Management.

The same rule applies to host paths mounted through volumes: the compute node must be able to read the path when the service starts.

Small Checks That Explain A Lot

These commands are useful in tiny smoke tests:

hostname
env | grep '^SLURM_' | sort
python -c 'import socket; print(socket.gethostname())'
cat /etc/os-release

Inside a container, cat /etc/os-release should describe the container image. Outside the container, it describes the host. That simple distinction helps diagnose whether a command is running where you expect.

Common Beginner Mistakes

Symptom	Likely misunderstanding	Next step
`plan` looks fine but `up` fails immediately	Static validation is not the same as cluster readiness.	Run `hpc-compose debug -f compose.yaml --preflight` on the login node.
`srun` does not accept `--container-image`	Pyxis is not available or not loaded in Slurm.	Read Runtime Backends and use the site-supported backend.
Cache warnings mention local paths	The cache path is not shared between login and compute nodes.	Configure `x-slurm.cache_dir` or `setup --cache-dir` with shared storage.
A GPU job waits longer than expected	The request may be larger than available idle resources.	Check site queue policy and start with the smallest useful request.
More CPUs were requested but only one process appears	`cpus_per_task` adds threads per task; it does not create more tasks.	Use `ntasks` for more processes/ranks, and make the application use them.
Docker Compose `ports` or service DNS do not work	This is one Slurm allocation, not a Docker Compose network.	See the networking stance in Execution Model.

Runtime Backends

runtime.backend selects how each service is launched inside the Slurm step. The default is pyxis.

For a beginner explanation of Slurm steps, Pyxis, Enroot, and shared runtime caches, start with Slurm And Container Basics.

runtime:
  backend: pyxis

Backend Summary

Backend	Launch shape	Required tools	Image/artifact shape	Notes
`pyxis`	`srun --container-*`	Slurm with Pyxis support plus Enroot on the submission host	remote images or local `.sqsh` / `.squashfs`	Default path and the only backend supported by local development workflows.
`apptainer`	`srun` plus `apptainer exec/run`	`apptainer` on submission and compute nodes	remote images prepared or reused as `.sif`; local `.sif` accepted	Use when the site standardizes on Apptainer instead of Pyxis.
`singularity`	`srun` plus `singularity exec/run`	`singularity` on submission and compute nodes	remote images prepared or reused as `.sif`; local `.sif` accepted	Similar to Apptainer for sites that still use Singularity.
`host`	direct `srun` command	Slurm client tools and host software/modules	no container image	Services must set `command` or `entrypoint`; image prepare and container bind mounts are not applied.

For Pyxis, check support with:

srun --help | grep container-image

For all backends, preflight checks the selected backend tools:

hpc-compose preflight -f compose.yaml

On the first pyxis/Enroot run, prepare imports the image with enroot — download, extract, then squashfs build — which can take several minutes for a multi-GB image; later runs reuse the cached .sqsh. The extraction scratch defaults to the shared cache (<cache_dir>/enroot/tmp); on shared NFS/Lustre/GPFS storage you can redirect it to node-local storage with x-slurm.enroot_temp_dir (or cache.enroot_temp_dir) to avoid slow imports and Stale file handle errors, while the layer cache and final .sqsh stay on the shared cache. See Files and Directories.

When the prepare scratch is node-local, also watch prepare-time bind mounts: x-runtime.prepare.mounts (and enroot prepare-hook mounts) run on the login node, so a mount whose source is on a network/shared filesystem can become a new failure point during prepare. Prefer a dependency-only prepare — install dependencies into the image during prepare (pip install -r requirements.txt, uv pip install, …) and mount your source tree as a runtime volume (services.<name>.volumes) rather than a prepare.mounts entry — so prepare stays independent of network-FS mounts. examples/dev-python-app.yaml shows source-mounted-at-runtime with deps baked in during prepare. preflight checks prepare mount sources (an absolute source is hinted as a possible cluster-workspace/site-storage path needing provisioning), and a prepare command that fails with bind mounts active lists the active mounts and suggests this pattern.

Installing Python packages (PEP 668 / externally-managed images)

How you install dependencies in prepare depends on the base image’s Python:

pip install works on the official python:*/python:*-slim images (Python from python.org, installed under /usr/local) and on Conda-based images such as pytorch/pytorch:*. The shipped Python examples use these, so a plain pip install --no-cache-dir <pkgs> is fine.
pip install is blocked on images whose Python comes from the distribution package manager — e.g. apt install python3 on an ubuntu/debian or nvidia/cuda:*-ubuntu* base. These ship an EXTERNALLY-MANAGED marker (PEP 668), so python -m pip install … fails with “externally managed environment”.

For an externally-managed image, do not reach for pip install --break-system-packages. Use one of:

x-runtime:
  prepare:
    commands:
      # Option A — a dedicated venv that can still see the image's system packages
      # (e.g. a CUDA build of torch baked into the base image):
      - python3 -m venv --system-site-packages /opt/venv
      - /opt/venv/bin/pip install --no-cache-dir <your-extra-deps>
      # Option B — uv, installed without pip via its standalone installer, then
      # installing into the system environment (uv does not honor PEP 668):
      - curl -LsSf https://astral.sh/uv/install.sh | sh
      - $HOME/.local/bin/uv pip install --system --no-cache <your-extra-deps>
services:
  trainer:
    # With Option A, run the venv's interpreter so the extra deps are importable:
    command: ["/opt/venv/bin/python", "train.py"]

--system-site-packages keeps framework packages that are baked into the base image (such as a CUDA-matched PyTorch) visible inside the venv, so you only install the extras on top.

Local Mode

up --local, test --local, dev, and tmux are intentionally narrow:

Linux only
runtime.backend: pyxis only
Pyxis-compatible Enroot tooling on the host
single-host specs only
no distributed or partitioned placement
no service-level MPI
no Slurm arrays or scheduler dependencies

Use local mode to inspect and debug a Pyxis/Enroot single-host launch path. dev adds file-change restart requests to the local supervisor, and tmux tails tracked local service logs in panes. Neither command changes the process-supervision model, and local mode is not a replacement for Slurm distributed execution.

Host Runtime Notes

runtime.backend: host runs service commands directly under srun. It is useful for module-based workflows or nested schedulers that already manage their own software environment.

Because there is no container:

image is optional
service volumes are rejected
x-runtime.prepare and x-enroot.prepare are rejected
x-slurm.mpi.host_mpi.bind_paths is rejected

Use top-level or service-level x-env for host modules, Spack views, and environment variables.

Support Matrix
Slurm And Container Basics
Execution Model
Spec Reference
CLI Reference

Execution Model

This page explains the few runtime rules that matter most when a Compose mental model meets Slurm and HPC runtime backends.

What runs where

Stage	Where it runs	What happens
`plan`, `validate`, `inspect`, `preflight`	login node or local shell	Parse the spec, resolve paths, preview the runtime plan, and check prerequisites
`prepare`	login node or local shell with the selected runtime backend	Import base images and build prepared runtime artifacts
`up`	login node or local shell with Slurm access	Run preflight, prepare missing artifacts, render the batch script, call `sbatch`, and watch by default
Batch script and services	compute-node allocation	Launch the planned services through `srun` and the selected runtime backend
`status`, `ps`, `watch`, `stats`, `logs`, `artifacts`	login node or local shell	Read tracked metadata and job outputs after submission

The main consequence is simple: image preparation and validation happen before the job starts, but the containers themselves run later inside the Slurm allocation.

Service failure policies inside one job

hpc-compose does not provide a separate long-running orchestrator. Service failure handling happens inside the rendered batch script for the current allocation.

mode: fail_job keeps fail-fast behavior and stops the job on the first non-zero service exit.
mode: ignore records the failure but allows the rest of the job to continue.
mode: restart_on_failure only reacts to non-zero process exits. It does not restart on successful exits, and it does not use cross-attempt or cross-requeue history.

For restart_on_failure, the batch script enforces two limits during one live execution:

a lifetime cap through max_restarts
a rolling-window cap through max_restarts_in_window within window_seconds

If a service omits the rolling-window fields, hpc-compose still enables crash-loop protection with window_seconds: 60 and max_restarts_in_window: <resolved max_restarts>.

Use status to inspect the tracked policy state after submission. The text view reports:

state service 'worker': failure_policy=restart_on_failure restarts=1/5 window=1/3@60s last_exit=42 completed=no

Use logs to inspect the corresponding restart messages from the batch script when you need to distinguish lifetime-cap exhaustion from rolling-window exhaustion.

Use per-service x-slurm.hooks when you want host-side notifications around those policy transitions. on: restart runs before a granted relaunch; on: window_exhausted runs when the rolling-window guard blocks another restart. These hooks are best-effort and do not change the service policy outcome.

Which paths must be shared

The resolved cache directory must be visible from both the login node and the compute nodes. It may come from x-slurm.cache_dir, project settings, or the builtin $HOME/.cache/hpc-compose fallback.
Relative host paths in volumes, local image paths, and x-runtime.prepare.mounts resolve against the compose file directory.
Each submitted job writes per-job runtime state under <runtime-root>/<job-id> on the host. <runtime-root> defaults to <submit-dir>/.hpc-compose and can be overridden with x-slurm.runtime_root.
The active job workspace is mounted into containerized services at /hpc-compose/job. For ordinary runs that workspace is <runtime-root>/<job-id>; for resume-aware attempts it is <runtime-root>/<job-id>/attempts/<attempt>, with top-level paths kept as the latest view.
Multi-node jobs also populate /hpc-compose/job/allocation/{primary_node,nodes.txt} and export allocation-wide HPC_COMPOSE_NODE... variables plus service-scoped HPC_COMPOSE_SERVICE_NODE... variables.

Use /hpc-compose/job for small shared state inside the allocation, such as ready files, request payloads, logs, metrics, or teardown signals.

Enroot runtime paths

The generated batch script sets three Enroot runtime paths scoped per job under the resolved cache directory:

Variable	Value	Purpose
`ENROOT_CACHE_PATH`	`$CACHE_ROOT/runtime/$SLURM_JOB_ID/cache`	Enroot image cache for the current job
`ENROOT_DATA_PATH`	`$CACHE_ROOT/runtime/$SLURM_JOB_ID/data`	Enroot data directory for the current job
`ENROOT_TEMP_PATH`	`$CACHE_ROOT/runtime/$SLURM_JOB_ID/tmp`	Enroot temp directory for the current job

These paths are created at batch startup and are available inside the batch script and to tooling that reads Enroot environment variables. They are not injected into service containers.

The cache must live on storage shared between login and compute nodes because prepare runs on the login node while services run on compute nodes; node-local /tmp fails because each node sees a different filesystem. For the operational list of invalid cache paths and cache configuration, see Cache Management.

Networking inside the allocation

Single-node services share the host network on one node.
In a multi-node job, helper services stay on the allocation’s primary node by default.
A distributed service may span the full allocation, or services may use x-slurm.placement to select explicit allocation node subsets.
Partitioned services should use service-scoped metadata such as HPC_COMPOSE_SERVICE_PRIMARY_NODE, HPC_COMPOSE_SERVICE_NODE_COUNT, HPC_COMPOSE_SERVICE_NODELIST, and HPC_COMPOSE_SERVICE_NODELIST_FILE.
ports, custom Docker networks, and service-name DNS are not part of the model.
Use depends_on plus readiness when a dependent service must wait for real availability rather than process start.
Use depends_on with condition: service_completed_successfully when a dependent service should wait for a one-shot stage to exit successfully.

Use 127.0.0.1 only when both sides are intentionally on the same node. For multi-node distributed or partitioned runs, derive rendezvous addresses from allocation or service metadata files and environment variables instead of relying on localhost.

If a service binds its TCP port before it is actually ready, prefer HTTP or log-based readiness over plain TCP readiness.

`volumes` vs `x-runtime.prepare`

Mechanism	Use it for	When it is applied	Reuse behavior
`volumes`	fast-changing source code, model directories, input data, checkpoint paths	at runtime inside the allocation	reads live host content every normal run
`x-runtime.prepare.commands`	slower-changing dependencies, tools, and image customization	before submission on the login node	cached until the prepared artifact changes

Recommended default:

keep active source trees in volumes
keep slower-changing dependency installation in x-runtime.prepare.commands
use prepare.mounts only when the prepare step truly needs host files

Warning

If a mounted file is a symlink, the symlink target must also be visible from inside the mounted directory. Otherwise the path can exist on the host but fail inside the container.

Command vocabulary

The normal run is hpc-compose up -f compose.yaml. See Quickstart for the full end-to-end description.
The tracked follow-up tools are status for scheduler/log summaries, ps for a stable per-service snapshot, and watch when you want to reconnect to the live TUI later.
The debugging flow is validate, inspect, preflight, and prepare run separately when you need more visibility.

Read Runtime Backends before changing runtime.backend, Runbook for the operational workflow, Examples for starting points, and Spec reference for exact field behavior.

Why hpc-compose
Runtime Backends
Slurm Capability Scope
Spec Reference

Slurm Capability Scope

This page makes the hpc-compose Slurm boundary explicit. It is a tool for compiling one Compose-like application into one Slurm allocation with one or more srun steps. Those steps can use Pyxis/Enroot, Apptainer, Singularity, or host runtime software. It is not a general frontend for the full Slurm command surface.

First-class support

These capabilities are modeled, validated, and intentionally supported by the planner, renderer, and tracked-job workflow.

Area	Support
Allocation model	One Slurm allocation per application
Submission flow	`new`, `plan`, `validate`, `config`, `inspect`, `preflight`, `prepare`, `render`, `up`, `when`, `alloc`, `run`, `debug`
Tracked job workflow	`status`, `ps`, `watch`, `stats`, `score`, `logs`, `down`, `cancel`, `artifacts`, `clean`, cache inspection/pruning
Top-level Slurm fields	`job_name`, `partition`, `account`, `qos`, `time`, `nodes`, `ntasks`, `ntasks_per_node`, `cpus_per_task`, `mem`, `gres`, `gpus`, GPU/CPU binding fields, `constraint`, `output`, `error`, `chdir`
Service step fields	`nodes`, `placement`, `ntasks`, `ntasks_per_node`, `cpus_per_task`, `gres`, `gpus`, GPU/CPU binding fields, `mpi`
Multi-node model	Single-node jobs, full-allocation distributed steps, and explicit node-index partitioning within one allocation
Runtime orchestration	`depends_on`, readiness checks, one-shot completion dependencies, service failure policies, primary-node helper placement, explicit co-location through `placement.share_with`
Service hooks	Per-service `prologue` and `epilogue` lifecycle hooks, plus host-side `restart` and `window_exhausted` event hooks
Runtime workflow	Pyxis/Enroot `.sqsh`, Apptainer/Singularity `.sif`, host runtime commands, `x-runtime.prepare`, shared cache handling
Scratch and staging	`x-slurm.scratch`, `stage_in`, `stage_out`, per-service scratch opt-out, raw `#BB`/`#DW` burst-buffer directives
Job tracking	Scheduler state via `squeue`/`sacct`, step stats via `sstat`, tracked logs, runtime state, metrics, artifacts, resume metadata
Advisory cluster weather	`weather` summarizes current node and queue conditions from read-only Slurm probes without reserving resources or changing submission behavior
Conditional submission	`when` actively monitors typed conditions, then submits one normal `hpc-compose` allocation
Canary right-sizing	`germinate` submits one short canary, writes `latest-canary.json`, and recommends resource settings without rewriting the spec
Hyperparameter sweeps	`sweep submit` expands one embedded sweep into many independent single-allocation jobs, then `sweep status` aggregates their tracked state
Cross-job rendezvous	Provider/client discovery through shared-cache JSON records under one cluster-visible cache directory

Raw pass-through

These capabilities are usable, but hpc-compose does not model or validate their semantics beyond passing them through to Slurm.

Mechanism	What it allows
`x-slurm.submit_args`	Raw `#SBATCH ...` lines for site-specific flags such as mail settings, reservations, or other submit-time options
`services.<name>.x-slurm.extra_srun_args`	Raw `srun` arguments for site-specific launch flags such as exclusivity settings
Existing reservations	Joining an already-created reservation through raw submit args is supported as pass-through

Pass-through is appropriate when a site-specific flag is useful but does not justify a first-class schema field. hpc-compose rejects line breaks and null bytes in raw #SBATCH entries so one list entry cannot emit multiple directives, but it does not validate the Slurm semantics of those flags.

Unsupported or out of scope

These capabilities are intentionally outside the product seam.

Area	Status
Admin-plane Slurm management	Out of scope
`sacctmgr` account administration	Out of scope
Reservation creation or lifecycle management	Out of scope
Federation / multi-cluster control	Out of scope
Cross-cluster service discovery	Out of scope; rendezvous is same-cluster shared-storage coordination only
Generic `scontrol` mutation	Out of scope
Broad cluster inspection tools such as a full `sinfo` / `sprio` / `sreport` frontend	Out of scope; `weather` is limited to a compact advisory snapshot
Background submit daemons or reservations	Out of scope; `when` is a foreground advisory monitor and does not reserve resources
Dynamic scheduling or bin packing across nodes	Not supported; use explicit `x-slurm.placement` selectors
Heterogeneous jobs	Not supported
Slurm arrays	Supported only through `x-slurm.array` for detached Slurm submissions. Local mode and live watch do not fan out array tasks; sweeps deliberately submit many normal allocations instead of Slurm arrays.
Compose `build`, `ports`, custom networks, `restart` policy, `deploy`	Not supported

Non-goals

hpc-compose should not grow into a generic Slurm administration layer. In particular, it will not broaden into sacctmgr, reservation management, federation control, or generic scontrol mutation. Those are real Slurm features, but they do not fit the “one application, one allocation, tracked runtime workflow” seam this tool is built around.

Why hpc-compose
Execution Model
Runtime Backends
Spec Reference

Examples

These examples are the fastest way to understand the intended hpc-compose workflows and adapt them to a real application.

There are two starting points:

built-in starter templates generated by hpc-compose new
repository example files copied directly from examples/

Use the CLI recommendation flow when you want a ranked starting point, or the coverage map when you want to inspect every shipped example by workflow or tag:

hpc-compose examples recommend
hpc-compose examples recommend 'vllm worker'
hpc-compose examples recommend 'multi-node training' --tag gpu
hpc-compose examples list --tag mpi
hpc-compose examples search 'vllm worker'

hpc-compose examples recommend is static and authoring-only: it uses the checked-in example registry, tags, and prerequisite notes; it does not inspect the cluster, contact Slurm, or submit jobs. Each result explains why it matched and prints safe next commands such as hpc-compose new, cp, plan, and plan --show-script.

Before launching anything, run the safe authoring path first:

hpc-compose new --template minimal-batch --name my-app --output compose.yaml
hpc-compose plan -f compose.yaml
hpc-compose plan --show-script -f compose.yaml

If you are reading from a source checkout, you can run the same static checks directly against examples/minimal-batch.yaml.

Some repository examples keep an explicit ${CACHE_DIR:-/cluster/shared/hpc-compose-cache} for portability, while starter examples rely on the settings/builtin cache default. Before running on a real cluster, configure a shared path visible from both the submission host and the compute nodes:

export CACHE_DIR=/cluster/shared/hpc-compose-cache
mkdir -p "$CACHE_DIR"
test -w "$CACHE_DIR"

Start Here: The Four Promoted Examples

These four examples are the intended conversion funnel.

`minimal-batch.yaml`

Demonstrates: one service, no dependencies, no image prepare step
Expected prerequisites: any machine for plan; a Linux Slurm login node plus the selected runtime backend for up
Cluster run, Linux Slurm login node only: hpc-compose up -f examples/minimal-batch.yaml
Success signal: the batch log prints Hello from Slurm!

`app-redis-worker.yaml`

Demonstrates: multi-service startup ordering plus TCP readiness inside one allocation
Expected prerequisites: a normal Slurm + Enroot submission host and shared CACHE_DIR
Cluster run, Linux Slurm login node only: hpc-compose up -f examples/app-redis-worker.yaml
Success signal: worker.log shows a successful Redis PING followed by repeated INCR jobs calls

`llm-curl-workflow-workdir.yaml`

Demonstrates: one GPU-backed LLM service plus one client service in the same job
Expected prerequisites: a GGUF model at $HOME/models/model.gguf, a GPU-capable Slurm target, and shared CACHE_DIR
Cluster run, Linux Slurm login node only: hpc-compose up -f examples/llm-curl-workflow-workdir.yaml
Success signal: curl_client.log contains a JSON response from /v1/chat/completions

`training-resume.yaml`

Demonstrates: checkpoint export, resume-aware reruns, and attempt-aware training state
Expected prerequisites: shared storage for x-slurm.resume.path plus shared CACHE_DIR
Cluster run, Linux Slurm login node only: hpc-compose up -f examples/training-resume.yaml
Success signal: results/<job-id>/ contains exported checkpoints and later attempts resume from the previously saved epoch

Beginner Ladder

Use this ordering when you are new to the project:

For a guided version of the first five concepts, run hpc-compose evolve --output compose.yaml. The progressive-complexity lesson walks through minimal, second-service, readiness, failure-policy, and multi-node-placement as one evolving valid spec.

Stage	Start here	Why
Authoring only	`minimal-batch.yaml` with `plan` and `plan --show-script`	Confirms the tool understands a spec without touching Slurm.
First cluster run	`minimal-batch.yaml` on a Linux Slurm login node	Smallest real submission and log-check path.
Single-node multi-service	`app-redis-worker.yaml`	Shows `depends_on` plus TCP readiness.
GPU or LLM serving	`llm-curl-workflow-workdir.yaml`, `llama-app.yaml`, or `vllm-openai.yaml`	Adds accelerator resources and service/client coordination.
Durable training	`training-checkpoints.yaml` or `training-resume.yaml`	Adds artifacts, checkpoints, and resume semantics.
Distributed launch	`multi-node-mpi.yaml`, `multi-node-torchrun.yaml`, or framework-specific examples below	Adds allocation-wide or explicitly placed multi-node services.

Built-In Starter Templates

Use built-in templates when you want hpc-compose to write a fresh compose.yaml with your application name filled in for you.

hpc-compose new --list-templates
hpc-compose new --describe-template minimal-batch
hpc-compose new --template minimal-batch --name my-app --output compose.yaml
hpc-compose new --template minimal-batch --name my-app --cache-dir '<shared-cache-dir>' --output compose.yaml

If the workflow you want is not listed by --list-templates, copy the closest repository example directly from examples/.

Broader Example Matrix

The matrix below covers the broader set of runnable examples beyond the four promoted starts. “Built-in template” means hpc-compose new --template <name> can scaffold it; “repository file” means copy the YAML from examples/ directly. Generate the same coverage map from the CLI with hpc-compose examples coverage --format markdown.

Example	Availability	Tags	What it demonstrates	When to start from it
`minimal-batch.yaml`	Built-in template	`beginner`, `batch`, `single-service`	Smallest single-service batch job.	You are new to hpc-compose and want the smallest possible file.
`dev-python-app.yaml`	Built-in template	`dev`, `python`, `prepare`, `hot-reload`	Mounted source code plus x-runtime.prepare.commands for dependencies.	You want an iterative source-mounted development workflow.
`dev-python-smoke.yaml`	Repository file	`test`, `python`, `dev`, `finite`	Finite test variant of the source-mounted Python app.	You want to test a development spec without a long-running process.
`cuda-probe.yaml`	Repository file	`gpu`, `cuda`, `probe`, `nvidia-smi`, `diagnostics`	Lightweight compute-node GPU/CUDA probe: hostname, nvidia-smi, and device files.	You want a fast nvidia-smi check that GPU allocation works before any real training run.
`jupyter.yaml`	Built-in template	`notebook`, `jupyter`, `gpu`, `interactive`	Tracked JupyterLab notebook server with log readiness on a GPU allocation.	You want an interactive notebook on a compute node; pair with `hpc-compose notebook`.
`app-redis-worker.yaml`	Built-in template	`multi-service`, `readiness`, `redis`, `tcp`	Multiple services with startup ordering and TCP readiness.	Your workload depends on multi-service startup ordering.
`restart-policy.yaml`	Built-in template	`failure-policy`, `restart`, `resilience`	Bounded restart_on_failure with rolling-window crash-loop guards.	You need transient-failure retries without letting a service spin forever.
`llm-curl-workflow.yaml`	Built-in template	`llm`, `curl`, `inference`, `readiness`	Repo-local LLM service with a dependent curl client.	You want the smallest concrete inference workflow under the repository tree.
`llm-curl-workflow-workdir.yaml`	Built-in template	`llm`, `curl`, `inference`, `workdir`	Home-directory LLM workflow for direct login-node use.	You want the smallest real-cluster inference workflow.
`llama-app.yaml`	Built-in template	`llm`, `gpu`, `model-serving`, `readiness`	GPU-backed service, mounted model files, and dependent app service.	You need accelerator resources or a model-serving pattern.
`llama-uv-worker.yaml`	Built-in template	`llm`, `uv`, `worker`, `python`, `llama`	llama.cpp serving plus a source-mounted Python worker run through uv.	You want the GGUF server plus mounted worker pattern.
`hf-stage-model.yaml`	Repository file	`llm`, `gpu`, `model-serving`, `huggingface`, `stage-in`	Cluster-side hf:// stage_in of a pinned HuggingFace model into a GPU service.	You want hpc-compose to download a pinned model inside the allocation, not on your laptop.
`vllm-openai.yaml`	Built-in template	`llm`, `vllm`, `openai`, `gpu`	vLLM serving with an in-job Python client.	You want vLLM-based inference instead of llama.cpp.
`vllm-uv-worker.yaml`	Built-in template	`llm`, `vllm`, `uv`, `worker`, `python`	vLLM serving plus a source-mounted Python worker run through uv.	You want a common LLM stack with mounted app code.
`eval-harness.yaml`	Built-in template	`llm`, `vllm`, `eval`, `lm-eval-harness`, `openai`, `artifacts`, `sweep`, `gpu`	vLLM OpenAI server with HTTP /health readiness plus an lm-eval-harness client and a results.json artifact, including a model/tasks sweep stub.	You want to benchmark a served model with lm-eval-harness against a loopback OpenAI endpoint.
`training-checkpoints.yaml`	Built-in template	`training`, `gpu`, `checkpoints`, `artifacts`	GPU training with checkpoints exported to shared storage.	You need durable checkpoint outputs but not automatic resume semantics.
`training-resume.yaml`	Built-in template	`training`, `gpu`, `resume`, `checkpoints`	GPU training with a shared resume directory and attempt-aware checkpoints.	The run should resume from shared storage across retries or later submissions.
`training-sweep.yaml`	Repository file	`training`, `sweep`, `hyperparameters`	Embedded sweep parameters with interpolation defaults.	You want many independent trial allocations from one sweep block.
`training-tensorboard.yaml`	Repository file	`training`, `gpu`, `tensorboard`, `sidecar`, `http-readiness`, `artifacts`	GPU training writing TensorBoard events to a shared logdir with an HTTP-readiness TensorBoard sidecar.	You want a training run with a live TensorBoard sidecar and exported event-file artifacts.
`fairseq-preprocess.yaml`	Built-in template	`training`, `nlp`, `cpu`, `preprocess`	CPU-heavy NLP data preprocessing with parallel workers.	You need a CPU-bound data preprocessing pipeline.
`canary-right-size.yaml`	Repository file	`training`, `canary`, `rightsize`, `metrics`	Deliberately over-requested training probe for germinate.	Your first question is whether a large GPU or memory request is justified.
`mpi-hello.yaml`	Built-in template	`distributed`, `mpi`, `hello`	MPI hello world using service-level x-slurm.mpi.	You need a small first-class MPI workload.
`mpi-pmix-v4-host-mpi.yaml`	Built-in template	`distributed`, `mpi`, `pmix`, `host-mpi`	Versioned PMIx launch plus host MPI bind/env configuration.	Your site requires a host MPI stack inside containers.
`multi-node-mpi.yaml`	Built-in template	`distributed`, `mpi`, `multi-node`	Primary-node helper plus one allocation-wide distributed MPI step.	You want a minimal multi-node MPI pattern without extra orchestration.
`multi-node-partitioned.yaml`	Repository file	`distributed`, `multi-node`, `placement`, `partitioned`	Disjoint node ranges, fractional selection, and explicit co-location.	Multiple distributed roles need explicit node ranges or share_with co-location.
`multi-node-torchrun.yaml`	Built-in template	`distributed`, `torchrun`, `gpu`, `training`	Allocation-wide torchrun launch using the primary node as rendezvous.	You want a multi-node GPU training starting point.
`multi-node-deepspeed.yaml`	Built-in template	`distributed`, `deepspeed`, `gpu`, `training`	DeepSpeed no-SSH multi-node training with generated rendezvous env.	You want distributed fine-tuning without hand-written rendezvous setup.
`multi-node-accelerate.yaml`	Built-in template	`distributed`, `accelerate`, `hugging-face`, `training`	Hugging Face Accelerate multi-machine launch.	You want an Accelerate-based training or fine-tuning starting point.
`multi-node-horovod.yaml`	Built-in template	`distributed`, `horovod`, `mpi`, `gpu`	Horovod rank-per-GPU launch through Slurm MPI.	You want Horovod without SSH fanout.
`multi-node-jax.yaml`	Built-in template	`distributed`, `jax`, `gpu`, `training`	JAX distributed training with generated coordinator env.	You want a JAX distributed starting point.
`nccl-tests.yaml`	Built-in template	`distributed`, `nccl`, `mpi`, `gpu`, `fabric`	MPI-backed NCCL all-reduce test job for GPU fabric debugging.	You need to debug NCCL, InfiniBand, UCX, or OFI before real training.
`ray-symmetric.yaml`	Built-in template	`distributed`, `ray`, `symmetric`	Ray symmetric-run across one Slurm allocation.	You want a modern Ray-on-Slurm starting point without an autoscaler.
`ray-head-workers.yaml`	Built-in template	`distributed`, `ray`, `workers`	Ray head plus workers inside one Slurm allocation.	You need explicit Ray head/worker control for an older or site-specific setup.
`dask-scheduler-workers.yaml`	Built-in template	`distributed`, `dask`, `workers`	Dask scheduler on the primary node plus allocation workers.	You want Dask CLI deployment inside one Slurm allocation.
`spark-standalone.yaml`	Built-in template	`distributed`, `spark`, `workers`	Spark standalone master, workers, and app submission inside one allocation.	You need a conservative Spark standalone pattern without external cluster management.
`flux-nested.yaml`	Built-in template	`distributed`, `flux`, `nested`	Nested Flux instance launched inside a Slurm allocation.	You want Flux scheduling inside an existing Slurm allocation.
`postgres-etl.yaml`	Built-in template	`workflow`, `postgres`, `etl`, `python`	PostgreSQL plus a Python data processing job.	You need a database-backed batch pipeline.
`nextflow-bridge.yaml`	Built-in template	`workflow`, `nextflow`, `bridge`	Nextflow command wrapper inside one hpc-compose allocation.	You want hpc-compose tracking around a workflow-engine run.
`snakemake-bridge.yaml`	Built-in template	`workflow`, `snakemake`, `bridge`	Snakemake command wrapper inside one hpc-compose allocation.	You want hpc-compose tracking around a Snakemake run.
`multi-stage-pipeline.yaml`	Built-in template	`workflow`, `pipeline`, `artifacts`	Two-stage data pipeline coordinating through the shared job mount.	You need file-based stage-to-stage handoff.
`pipeline-dag.yaml`	Built-in template	`workflow`, `dag`, `pipeline`, `depends-on`	One-shot preprocess -> train -> postprocess DAG with completion dependencies.	You need stage completion, not service readiness, to gate downstream work.
`rendezvous-model-server.yaml`	Repository file	`workflow`, `rendezvous`, `model-serving`	Provider job that registers a model-server endpoint in the shared cache.	One Slurm allocation should publish a service for later jobs.
`rendezvous-client.yaml`	Repository file	`workflow`, `rendezvous`, `client`	Separate client job resolving HPC_COMPOSE_RDZV_MODEL_SERVER_URL.	A later job should discover a provider through shared storage.

Which Example Should I Start From?

Run hpc-compose examples recommend with no query for the default beginner path, or pass a short workflow description when you already know the shape you need:

hpc-compose examples recommend
hpc-compose examples recommend 'checkpoint resume training'
hpc-compose examples recommend 'workflow engine bridge'
hpc-compose examples recommend 'separate rendezvous jobs' --format json

The recommendation output is the maintained chooser. It reuses the same registry metadata that feeds the coverage table below, so new examples, tags, and prerequisite notes only need to be updated in one place.

Companion notes for the more involved examples live alongside the example assets:

Development Workflow Recipe

examples/dev-python-app.yaml mounts examples/app/ and runs a long-lived Python process, so it is best for hot reload:

hpc-compose dev -f examples/dev-python-app.yaml
hpc-compose tmux -f examples/dev-python-app.yaml --no-attach

examples/dev-python-smoke.yaml keeps the same mounted-source shape but uses a finite command, so it is suitable for smoke tests:

hpc-compose test --local -f examples/dev-python-smoke.yaml
hpc-compose test --submit --time 00:01:00 -f examples/dev-python-smoke.yaml

Adaptation Checklist

Copy the closest repository example to your own compose.yaml, or run hpc-compose new --template <name> --name my-app --output compose.yaml when a matching built-in template exists.
Configure a cache path visible from both the login node and compute nodes through hpc-compose setup --cache-dir, x-slurm.cache_dir, or [defaults.cache] / [profiles.<name>.cache].
Override CACHE_DIR before running repository examples that use ${CACHE_DIR:-...}, or replace the default cache path in your copied file.
Replace the example image, command, environment, and volumes with your workload.
Keep active source in volumes and keep slower-changing dependency installation in x-runtime.prepare.commands.
Add readiness to services that must be reachable before dependents continue.
Adjust top-level or per-service x-slurm settings for your cluster.
Run hpc-compose plan -f compose.yaml before the first run, and hpc-compose debug -f compose.yaml --preflight if that run fails.
Run cluster up only from a supported Linux Slurm submission host with the selected runtime backend available.

Guided Authoring Tutorial
Task Guide
Migrate a docker-compose.yaml
Spec Reference
Runbook

Guided Authoring Tutorial

hpc-compose evolve is an interactive authoring tutorial. It starts from a minimal valid spec and progressively rewrites the same output file through increasingly realistic HPC workflow features.

The command is safe to run on a laptop or login node:

it validates and plans candidate specs,
it writes only the selected compose file,
it does not prepare images,
it does not call sbatch,
it does not run preflight.

Canonical Lesson

evolve currently ships one lesson:

hpc-compose evolve --describe-lesson progressive-complexity

The progressive-complexity path contains five valid snapshots:

Step id	What it teaches	Safe follow-up
`minimal`	One service and one single-node Slurm allocation	`hpc-compose plan -f compose.yaml`
`second-service`	A dependent service and startup ordering	`hpc-compose plan -f compose.yaml`
`readiness`	`readiness` plus `depends_on.condition: service_healthy`	`hpc-compose plan --show-script -f compose.yaml`
`failure-policy`	`restart_on_failure` with bounded retries and a rolling crash-loop window	`hpc-compose inspect -f compose.yaml`
`multi-node-placement`	A two-node allocation with explicit non-overlapping service placement	`hpc-compose plan -f compose.yaml`

The final step can validate anywhere, but running it requires a Slurm target that can grant a two-node allocation and a runtime backend available on that cluster.

Interactive Flow

Start the tutorial:

hpc-compose evolve --output compose.yaml

At each step, the command prints:

a short explanation,
the concepts being introduced,
a compact diff from the last accepted spec,
and the validation summary for the candidate.

Controls:

Enter, y, or a accepts the step and writes compose.yaml.
s skips the current step.
q quits after the last accepted valid spec.
? prints prompt help.

Transcript Example

$ hpc-compose evolve --output compose.yaml
Step 1/5: Minimal batch spec
Accept this step? [Y/a/s/q/?]
wrote /path/to/compose.yaml

Step 2/5: Add a dependent service
Accept this step? [Y/a/s/q/?]
wrote /path/to/compose.yaml

Step 3/5: Gate on readiness
Accept this step? [Y/a/s/q/?]
wrote /path/to/compose.yaml

Inspect the accepted readiness-gated spec:

hpc-compose plan -f compose.yaml

Then continue the tutorial to failure policies and multi-node placement:

Accept this step? [Y/a/s/q/?]

For automation or docs examples, accept through a specific step noninteractively:

hpc-compose evolve --yes --until readiness --format json --output compose.yaml

Non-Goals

evolve does not mutate arbitrary existing specs.
evolve is not a full-screen TUI.
evolve does not submit jobs.

For a fresh single-template scaffold, use hpc-compose new. For choosing among the broader runnable examples, use Examples.

Examples
Task Guide
Migrate a docker-compose.yaml
CLI Reference
Spec Reference

Task Guide

Use this page when you know what you want to do, but not yet which command or example should be your starting point.

First run

Read Quickstart.
Run hpc-compose evolve --output compose.yaml if you want a guided progression from minimal through multi-node-placement.
Run hpc-compose new --list-templates if you want to inspect the built-in starter templates before choosing one.
Run hpc-compose examples recommend for a static, no-Slurm starting-point recommendation with match reasons and safe next commands. Add a workflow description, such as hpc-compose examples recommend 'vllm worker', when you want registry-backed recommendations for a narrower shape.
Run hpc-compose examples list or hpc-compose examples search 'vllm worker' when you want to browse the broader example coverage map by workflow or tag.
Start from minimal-batch with hpc-compose new --template minimal-batch --name my-app --output compose.yaml.
Before running on a cluster, configure a shared cache with hpc-compose setup --cache-dir '<shared-cache-dir>' or explicit x-slurm.cache_dir. If you copy a repository example that uses CACHE_DIR, override it for your cluster before running.
Run hpc-compose plan -f compose.yaml before the first real run. Add --show-script when you want to inspect the generated launcher without writing a file.
Run hpc-compose up -f compose.yaml only from a supported Linux Slurm submission host.

Remember directory/data/env settings once

Run hpc-compose setup to create or update the project-local settings file (.hpc-compose/settings.toml).
Use hpc-compose --profile dev up so compose path, env files, env vars, and binary paths come from the selected profile.
Run hpc-compose context --format json to inspect resolved paths plus value sources. Interpolation variables are scoped to names referenced by the compose file and sensitive-looking values are redacted unless you add --show-values.
Use --settings-file <PATH> when you need an explicit settings file instead of upward discovery.

Migrate from Docker Compose

Read Docker Compose Migration.
Replace build: with image: plus x-runtime.prepare.commands.
Replace service-name networking with 127.0.0.1 or explicit allocation metadata where appropriate.

Pick a starting example

Browse the annotated catalog and chooser in Examples; it owns the per-example filename, tag, and prerequisite map.
Run hpc-compose examples recommend '<workflow description>' for a registry-backed starting point, e.g. 'multi-service app', 'multi-node training', 'checkpoint resume training', or 'vllm worker'.

Single-node multi-service app

Use Execution Model to confirm which services can rely on localhost.
Add depends_on and readiness only where ordering really matters.

Multi-node distributed training

Use generated distributed metadata such as HPC_COMPOSE_DIST_RDZV_ENDPOINT, HPC_COMPOSE_DIST_NODE_RANK, and HPC_COMPOSE_DIST_NPROC_PER_NODE instead of Docker-style service discovery.
Put cluster-specific NCCL/UCX/OFI fabric variables in .hpc-compose/cluster.toml under [distributed.env] so specs stay portable.

Checkpoint and resume workflows

See Artifacts and Resume for the export-vs-resume split.
Keep the canonical resume source in x-slurm.resume.path, not in exported artifact bundles.

LLM serving workflows

Use volumes for model directories and fast-changing code.
Use x-runtime.prepare.commands for slower-changing dependencies.

Debug cluster readiness

Run hpc-compose validate -f compose.yaml.
Run hpc-compose validate -f compose.yaml --strict-env when default interpolation fallbacks should be treated as failures.
Run hpc-compose plan --verbose -f compose.yaml.
Run hpc-compose preflight -f compose.yaml.
Run hpc-compose debug -f compose.yaml --preflight after a failed tracked run.
Run hpc-compose doctor readiness -f compose.yaml --service <name> to inspect the normalized readiness probe, or add --run when the target service, tunnel, or log file is already reachable from the current host.
Read Troubleshooting.

Cache and artifact management

Use hpc-compose cache list to inspect imported/prepared artifacts.
Use hpc-compose cache inspect -f compose.yaml to see per-service reuse expectations.
Use hpc-compose --profile dev cache prune --age 14 when you want age-based cleanup to follow the active context cache dir.
Use hpc-compose cache prune --age 7 --cache-dir '<shared-cache-dir>' when you want a direct cache cleanup that does not depend on compose resolution.
Use hpc-compose artifacts -f compose.yaml after a run to export tracked payloads.

Find and clean tracked runs

Use hpc-compose jobs list to scan the current repo tree for tracked runs.
Use hpc-compose ps -f compose.yaml when you want a one-shot per-service runtime table.
Use hpc-compose watch -f compose.yaml to reconnect to the live watch UI for the latest tracked job.
Use hpc-compose jobs list --disk-usage when you need a quick size estimate before deleting old state.
Use hpc-compose clean -f compose.yaml --dry-run --age 7 to preview what a cleanup would remove.
Use hpc-compose clean -f compose.yaml --all --format json when automation needs a stable cleanup report for one compose context, including effective latest IDs plus stale-pointer diagnostics.

Automation and scripting with JSON output

Prefer --format json for machine-readable output on non-streaming commands such as new, plan, validate, render, prepare, preflight, config, inspect, debug, status, ps, stats, score, artifacts, down, cancel, setup, cache list/cache inspect/cache prune, clean, and context. For up, --format json requires --detach or --dry-run.
Include context --format json when automation needs resolved compose path, binaries, referenced interpolation vars, and runtime path roots.
Use hpc-compose stats --format jsonl or --format csv when downstream tooling wants row-oriented metrics.
Use --format json for machine-readable output on non-streaming commands. Streaming commands such as logs --follow, watch, and completions keep their native text or script output.

Examples
Guided Authoring Tutorial
Migrate a docker-compose.yaml
CLI Reference
Runbook

Migrate a docker-compose.yaml

This guide helps you convert an existing docker-compose.yaml into an hpc-compose spec for Slurm clusters using Pyxis/Enroot, Apptainer, Singularity, or host runtimes.

At a glance

Docker Compose feature	hpc-compose equivalent
`image`	`image` (same syntax, auto-prefixed with `docker://`)
`command`	`command` (string or list, same syntax)
`entrypoint`	`entrypoint` (string or list, same syntax)
`environment`	`environment` (map or list, same syntax)
`volumes`	`volumes` (host:container bind mounts, same syntax)
`depends_on`	`depends_on` (list or map with `condition: service_started` / `service_healthy` / `service_completed_successfully`)
`working_dir`	`working_dir` (requires explicit `command` or `entrypoint`)
`build`	Not supported. Use `image` + `x-runtime.prepare.commands` instead.
`ports`	Not supported. Use host networking semantics instead. `127.0.0.1` works only when both sides run on the same node.
`networks` / `network_mode`	Not supported. There is no Docker-style overlay network or service-name DNS layer.
`restart`	Not supported as a Compose key. Use `services.<name>.x-slurm.failure_policy`.
`deploy`	Not supported. Use `x-slurm` for resource allocation.
`healthcheck`	Supported for a constrained TCP/HTTP subset and normalized into `readiness`; use explicit `readiness` for anything more complex.
Resource limits (`cpus`, `mem_limit`)	Use `x-slurm.cpus_per_task`, `x-slurm.mem`, `x-slurm.gpus`

Side-by-side: web app + Redis

Docker Compose

version: "3.9"
services:
  redis:
    image: redis:7
    ports:
      - "6379:6379"
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 5s
      timeout: 3s
      retries: 5

  app:
    build: .
    ports:
      - "8000:8000"
    depends_on:
      redis:
        condition: service_healthy
    environment:
      REDIS_HOST: redis
    volumes:
      - ./app:/workspace
    working_dir: /workspace
    command: python -m main

hpc-compose

version: "1"
name: my-app

x-slurm:
  job_name: my-app
  time: "01:00:00"
  mem: 8G
  cpus_per_task: 4
  cache_dir: /cluster/shared/hpc-compose-cache

services:
  redis:
    image: redis:7
    command: redis-server --save "" --appendonly no
    readiness:
      type: tcp
      host: 127.0.0.1
      port: 6379
      timeout_seconds: 30

  app:
    image: python:3.11-slim
    depends_on:
      redis:
        condition: service_healthy
    environment:
      REDIS_HOST: 127.0.0.1
    volumes:
      - ./app:/workspace
    working_dir: /workspace
    command: python -m main
    x-runtime:
      prepare:
        commands:
          - pip install --no-cache-dir redis fastapi uvicorn

Key changes

version: "3.9" → version: "1" or remove the field. hpc-compose uses this as its own spec schema version, not a Docker Compose compatibility version.
build: . → image: python:3.11-slim + x-runtime.prepare.commands for dependencies.
ports → Removed. Services communicate via 127.0.0.1 because they run on the same node.
REDIS_HOST: redis → REDIS_HOST: 127.0.0.1. No DNS service names; use localhost.
healthcheck → readiness with type: tcp.
Added x-slurm block for Slurm resource allocation (time, memory, CPUs).
Configured a shared cache for image storage, either through x-slurm.cache_dir as shown or project settings.

Key differences

Networking

Docker Compose creates isolated networks where services find each other by name. In hpc-compose, helper services on the same node share the host network directly, and multi-node distributed steps must use explicit rendezvous addresses. Replace service hostnames with 127.0.0.1 only when both sides intentionally stay on one node. For multi-node runs, derive the rendezvous host from /hpc-compose/job/allocation/primary_node or HPC_COMPOSE_PRIMARY_NODE.

Building images

Docker Compose uses build: to run a Dockerfile. hpc-compose uses x-runtime.prepare.commands instead:

# Docker Compose
app:
  build:
    context: .
    dockerfile: Dockerfile

# hpc-compose
app:
  image: python:3.11-slim
  x-runtime:
    prepare:
      commands:
        - pip install --no-cache-dir -r /tmp/requirements.txt
      mounts:
        - ./requirements.txt:/tmp/requirements.txt

Prefer volumes for fast-changing source code and x-runtime.prepare.commands for slower-changing dependencies. x-enroot.prepare remains accepted as a Pyxis/Enroot compatibility spelling, but new specs should use x-runtime.prepare.

Health checks vs readiness

Docker Compose uses healthcheck with a test command, interval, timeout, and retries. hpc-compose now accepts a constrained healthcheck subset and normalizes it into readiness:

# TCP: wait for a port to accept connections
readiness:
  type: tcp
  host: 127.0.0.1
  port: 6379
  timeout_seconds: 30

# HTTP: wait for an endpoint to return an expected status
readiness:
  type: http
  url: http://127.0.0.1:8080/health
  status_code: 200
  timeout_seconds: 30

# Log: wait for a pattern in service output
readiness:
  type: log
  pattern: "Server started"
  timeout_seconds: 60

# Sleep: fixed delay
readiness:
  type: sleep
  seconds: 5

Supported healthcheck migration patterns:

["CMD", "nc", "-z", HOST, PORT]
["CMD-SHELL", "nc -z HOST PORT"]
recognized curl probes against http:// or https:// URLs
recognized wget --spider probes against http:// or https:// URLs

Still unsupported:

arbitrary custom command probes
interval
retries
start_period

Resource allocation

Docker Compose uses deploy.resources or top-level cpus/mem_limit. hpc-compose uses Slurm-native resource settings:

x-slurm:
  time: "02:00:00"
  mem: 32G
  cpus_per_task: 8
  gpus: 1

services:
  app:
    x-slurm:
      cpus_per_task: 4
      gpus: 1

Restart policies

Docker Compose supports restart: always, on-failure, etc. hpc-compose does not accept the Compose restart: key, but it does support per-service restart behavior through services.<name>.x-slurm.failure_policy.

services:
  app:
    image: python:3.11-slim
    x-slurm:
      failure_policy:
        mode: restart_on_failure
        max_restarts: 3
        backoff_seconds: 5
        window_seconds: 60
        max_restarts_in_window: 3

restart_on_failure retries only on non-zero exits. It enforces both a lifetime restart cap and a rolling-window crash-loop cap during one live batch-script execution. If you omit the rolling-window fields, hpc-compose defaults to window_seconds: 60 and max_restarts_in_window: <resolved max_restarts>. Use mode: fail_job (default) for fail-fast behavior, or mode: ignore for non-critical sidecars.

Practical mapping:

Compose restart: "no" -> omit failure_policy or use mode: fail_job
Compose restart: on-failure[:N] -> use mode: restart_on_failure with max_restarts: N when you want a similar lifetime retry budget
Compose restart: always / unless-stopped -> no direct equivalent; hpc-compose intentionally keeps restart handling bounded within one batch job

The rolling-window fields have no direct Docker Compose equivalent. They exist to stop fast crash loops inside one Slurm allocation without giving up a larger lifetime retry budget for transient failures.

What to do about unsupported features

Feature	Alternative
`build`	Use `image` + `x-runtime.prepare.commands`. Mount build context files with `x-runtime.prepare.mounts` if needed.
`ports`	Not needed. Services share `127.0.0.1` on one node.
`networks` / `network_mode`	Not needed. All services are on the same host network.
`restart`	Use `services.<name>.x-slurm.failure_policy` (`fail_job`, `ignore`, `restart_on_failure`).
`deploy`	Use `x-slurm` for resources.
Service DNS names	Use `127.0.0.1` for same-node helpers, or explicit host metadata such as `HPC_COMPOSE_PRIMARY_NODE` for distributed runs.
Named volumes	Use host-path bind mounts in `volumes`.
`.env` file	Supported. `.env` in the compose file directory is loaded automatically.

Migration checklist

Replace Compose version: — Use version: "1" or omit the field; values like "3.9" are rejected by hpc-compose.
Remove build: — Replace with image: pointing to a base image. Move dependency installation to x-runtime.prepare.commands.
Remove ports: — Use host-network semantics instead of container port publishing.
Remove networks: / network_mode: — There is no Docker-style overlay network or service-name DNS layer.
Remove Compose restart: — use services.<name>.x-slurm.failure_policy when you need per-service restart behavior.
Remove deploy: — Use x-slurm for resource allocation.
Replace service hostnames — Change any service-name references (e.g. redis, postgres) to 127.0.0.1 for same-node helpers, or to explicit allocation metadata for distributed runs.
Replace healthcheck: — Convert to readiness: with type: tcp, type: http, type: log, or type: sleep.
Add x-slurm: — Set time, mem, cpus_per_task, and optionally gpus, partition, account.
Set cache storage — Point x-slurm.cache_dir or setup --cache-dir to shared storage visible from login and compute nodes.
Validate — Run hpc-compose validate -f compose.yaml to check the converted spec.
Inspect — Run hpc-compose inspect --verbose -f compose.yaml to confirm the planner understood your intent.

Examples
Guided Authoring Tutorial
Task Guide
Execution Model
Spec Reference

Operate a Real Cluster Run

This runbook is the normal real-cluster flow for adapting a hpc-compose spec on a supported Linux Slurm submission host.

If you are new to Slurm, read Slurm And Container Basics first. If you are adapting to HAICORE@KIT, read HAICORE Guide alongside this runbook.

Commands below assume hpc-compose is on your PATH. If you are running from a local checkout, replace hpc-compose with target/release/hpc-compose.

Compose-aware commands accept -f / --file. When omitted, hpc-compose uses the active context compose file from .hpc-compose/settings.toml, then falls back to compose.yaml in the current directory. Global context flags are available everywhere:

--profile <NAME> selects a profile from .hpc-compose/settings.toml.
--settings-file <PATH> uses an explicit settings file instead of upward auto-discovery.

Read Slurm And Container Basics, Execution Model, Runtime Backends, and Support Matrix before adapting a workflow to a new cluster.

Before You Start

Make sure you have:

a Linux submission host with srun and sbatch,
the runtime backend selected by runtime.backend,
scontrol when x-slurm.nodes > 1,
Pyxis support in srun when runtime.backend: pyxis (srun --help should mention --container-image),
shared storage for the resolved cache directory,
local source trees or local .sqsh / .sif images in place,
registry credentials when your cluster or registry requires them.

Backend-specific requirements are listed in Runtime Backends. Cluster profile generation and MPI smoke probes are covered in Cluster Profiles.

The Operational Spine

For a new spec on a real cluster, work the numbered steps below in order:

Choose a starter from Examples, or run hpc-compose new --template <name> --name my-app --output compose.yaml. See Choose A Starting Example.
Run hpc-compose setup once and verify resolved values with hpc-compose context --format json. See Project-Local Settings.
Choose the cache directory early. See Choose A Cache Directory Early.
Adapt the example and adjust cluster-specific resource settings. See Adapt The Example.
Validate the spec. See Validate The Spec.
Plan the run. See Plan The Run.
Launch with up. See Normal Run: Use up.
When debugging cluster readiness, prepare, or rendering, break out preflight, prepare, and render separately. See steps 6–8.
Inspect the tracked run. See Inspect A Tracked Run.
Manage cache and old state. See Manage Cache And Old State.

If a run fails, start with hpc-compose debug -f compose.yaml --preflight, then follow the First Triage flow in Troubleshooting.

For a minimal cluster smoke test from a checkout, set CACHE_DIR to shared storage and run scripts/cluster_smoke.sh. It validates, preflights, and renders by default; set HPC_COMPOSE_SMOKE_SUBMIT=1 only when you intentionally want it to launch the smoke job.

Project-Local Settings

hpc-compose can discover .hpc-compose/settings.toml by walking upward from the current directory. You can also pin a file with --settings-file.

Typical setup flow:

hpc-compose setup
hpc-compose context
hpc-compose --profile dev context --format json

Non-interactive setup is available for scripting:

hpc-compose setup --profile-name dev --compose-file compose.yaml --env-file .env --env-file .env.dev --cache-dir '<shared-cache-dir>' --default-profile dev --non-interactive

Settings file shape:

version = 1
default_profile = "dev"

[defaults]
compose_file = "compose.yaml"
env_files = [".env"]
login_host = "login01.hpc.example.edu"
login_user = "<username>"

[defaults.env]
CACHE_DIR = "/cluster/shared/hpc-compose-cache"

[defaults.cache]
dir = "/cluster/shared/hpc-compose-cache"

[profiles.dev]
compose_file = "compose.yaml"
env_files = [".env", ".env.dev"]

[profiles.dev.env]
RESUME_DIR = "/shared/$USER/runs/my-run"
MODEL_DIR = "$HOME/models"

[profiles.dev.cache]
dir = "/cluster/shared/dev-hpc-compose-cache"

[resource_profiles.cpu-small]
time = "00:30:00"
cpus_per_task = 4
mem = "16G"

[resource_profiles.gpu-small]
partition = "gpu"
time = "01:00:00"
gpus = 1
cpus_per_task = 8
mem = "32G"

Resolution precedence is fixed:

CLI flags
selected profile values
shared settings defaults
built-in CLI defaults

Use context whenever you want to inspect effective compose path, binaries, interpolation variables, runtime paths, and per-field sources.

Resource profiles are referenced from YAML with x-slurm.resources: gpu-small. They are Slurm resource defaults, not the same thing as the global --profile setting selector, and explicit x-slurm values in the spec override profile defaults.

login_host is the SSH login host. It is the default SSH destination for hpc-compose up --remote when you do not pass --remote=<host>, and it also names the host shown in notebook, reach, and pull connection/tunnel hints and in the machine-readable hpc-compose notebook --format json output. A profile’s login_host overrides the shared default.

login_user is the SSH username applied to a bare login host, so the resolved destination becomes user@host. A profile’s login_user overrides the shared default. The login user for up --remote is resolved with this precedence: an explicit user@ already present in --remote=<dest> or login_host wins; then the HPC_COMPOSE_REMOTE_USER environment variable; then settings login_user (profile over defaults); then the User from your ~/.ssh/config. Persist both values with hpc-compose setup --profile-name <name> --login-host <host> --login-user <user> (written into [profiles.<name>]) or edit settings.toml directly.

If the login node requires an OTP/2FA on every SSH session, use SSH connection multiplexing (ControlMaster/ControlPersist) so you authenticate once and reused tunnels skip the prompt — see Run a Notebook or IDE Session.

An editor schema for settings.toml is available:

hpc-compose schema --kind settings

For TOML editor integration, write that schema to a file (hpc-compose schema --kind settings > hpc-compose-settings.schema.json) and point your TOML language server at the local path.

Choose A Starting Example

The maintained selection guide is Examples. It includes:

four promoted beginner paths,
a novice ladder from authoring to distributed workloads,
the full repository example matrix,
companion notes for LLM worker examples,
an adaptation checklist.

Keep docs/src/examples.md as the single source of example selection truth. The embedded YAML source appendix is Example Source.

1. Choose A Cache Directory Early

Set the cache default to a path visible from both the login node and compute nodes:

[profiles.dev.cache]
dir = "/cluster/shared/hpc-compose-cache"

Or set x-slurm.cache_dir directly in the spec when the cache path should travel with that file:

x-slurm:
  cache_dir: /cluster/shared/hpc-compose-cache

Quick recipe:

export CACHE_DIR=/cluster/shared/hpc-compose-cache
mkdir -p "$CACHE_DIR"
test -w "$CACHE_DIR"

Rules:

Do not use /tmp, /var/tmp, /private/tmp, or /dev/shm.
If cache_dir is unset in the spec, resolution checks profile cache settings, then defaults cache settings, then $HOME/.cache/hpc-compose.
The default may work on some clusters, but a shared project/work/scratch path is safer.
Validation can accept unsafe local paths; preflight reports them as policy errors.

More cache details are in Cache Management.

2. Adapt The Example

Start with the nearest example and then change:

image
command / entrypoint
volumes
environment
x-slurm resource settings
x-runtime.prepare commands for dependencies or tooling

Recommended pattern:

Put fast-changing application code in volumes.
Put slower-changing dependency installation in x-runtime.prepare.commands.
Add readiness only to services that other services truly depend on.

3. Validate The Spec

hpc-compose validate -f compose.yaml
hpc-compose validate -f compose.yaml --strict-env

Use validate first when changing field names, dependency shape, command/entrypoint form, paths, x-slurm, x-runtime, or compatibility x-enroot blocks.

If validate fails, fix that before doing anything more expensive. Use --strict-env when missing interpolation variables should fail instead of consuming ${VAR:-default} or ${VAR-default} fallbacks.

4. Plan The Run

hpc-compose plan -f compose.yaml
hpc-compose plan --verbose -f compose.yaml
hpc-compose plan --show-script -f compose.yaml

Check:

service order,
allocation geometry and service step geometry,
normalized image references,
host-to-container mount mappings,
resolved environment values,
runtime artifact paths,
cache hit/miss expectations.

plan is purely static: it parses, validates, builds the normalized runtime plan, and can print the generated script to stdout, but it does not run preflight, prepare images, call sbatch, or write hpc-compose.sbatch. Add --explain for planner hints about cache paths, missing artifacts, resume/artifact settings, and the next command. plan --verbose can print secrets from resolved environment values.

5. Normal Run: Use `up`

hpc-compose up -f compose.yaml

up is the preferred end-to-end cluster flow. It runs preflight unless disabled, prepares images unless skipped, renders the script, calls sbatch, records tracked job metadata, polls scheduler state, and streams logs. It also uses a spec-scoped lock under .hpc-compose/locks/ so two concurrent up invocations against the same compose file do not race through prepare/render/submit.

Useful options:

--script-out path/to/job.sbatch keeps a copy of the rendered script.
--force-rebuild refreshes imported and prepared artifacts.
--skip-prepare reuses existing prepared artifacts.
--no-preflight skips the preflight phase.
--detach submits or launches, records tracking metadata, and returns without watching.
--format text|json is accepted with --detach or --dry-run.
--watch-queue waits in line-oriented queue output until the Slurm job reaches RUNNING, then opens the normal watch view.
--queue-warn-after <DURATION> warns once when --watch-queue stays PENDING longer than the threshold; the default is 10m, and 0 disables the warning.
--watch-mode auto|tui|line selects the live output mode.
--hold-on-exit never|failure|always controls whether the TUI stays open after the job reaches a terminal scheduler state.
--resume-diff-only prints resume-sensitive config diffs without launching.
--allow-resume-changes confirms intentional resume-coupled config drift.

up --local is Linux + Pyxis-only and single-host. See Runtime Backends.

Array jobs should be submitted with up --detach; use SLURM_ARRAY_TASK_ID in the service command and output patterns such as %A_%a for task-specific logs. Scheduler dependencies declared with x-slurm.after_job or x-slurm.dependency are passed to sbatch --dependency=... at submit time. Arrays and scheduler dependencies are not supported by up --local.

For conditional submission on a busy partition, use when:

hpc-compose when -f compose.yaml --partition gpu8 --free-nodes 4 --poll-interval 120s
hpc-compose when -f compose.yaml --after-job 12345
hpc-compose when -f compose.yaml --between 22:00-06:00

when is a foreground monitor. Interrupt it with Ctrl-C to stop waiting before the job is submitted. It runs preflight, image preparation, and script rendering before the wait begins, so submission is immediate once the conditions match; use --skip-prepare only when the required runtime artifacts already exist. --detach applies after submission: it still waits in the foreground for conditions, then returns after tracking metadata is written instead of opening the watch view.

Idle-node checks are advisory, not reservations. Another user can still submit first, and Slurm may queue the job after when calls sbatch. Keep polling gentle on shared login nodes: the default --poll-interval is 60s (minimum 5s); reserve very short intervals for brief, intentional watches.

For interactive development inside one allocation, use alloc:

hpc-compose alloc -f compose.yaml
hpc-compose run app -- python -m pytest

Inside the allocation shell, run SERVICE -- CMD reuses the active allocation with srun instead of submitting a new sbatch job. alloc exports HPC_COMPOSE_* metadata for the compose file, cache directory, runtime backend, and allocated nodes. For interactive notebook sessions inside an allocation, see Notebook.

5b. Submit From Your Laptop With `up --remote`

hpc-compose up runs on a Linux Slurm login node. macOS (and any host without Slurm) is authoring-only, so to submit from a laptop, delegate the run to a login node over SSH:

# Uses the configured login_host (with login_user, if set, as user@host):
hpc-compose up --remote -f compose.yaml
# Or target a specific host or ~/.ssh/config alias:
hpc-compose up --remote=login01 -f compose.yaml
# Or pass the SSH user inline:
hpc-compose up --remote=alice@login01 -f compose.yaml

The SSH destination comes from --remote=<dest> when given, otherwise from login_host. The login user follows the precedence documented in Project-Local Settings: an inline user@ wins, then HPC_COMPOSE_REMOTE_USER, then settings login_user (profile over defaults), then your ~/.ssh/config User.

What `--remote` stages

--remote rsyncs the compose project to a per-project staging directory on the login node (~/.hpc-compose-remote/<project>), including project settings such as .hpc-compose/settings.toml and .hpc-compose/cluster.toml while excluding tracked job/runtime state. It then runs hpc-compose up there over SSH, streaming the output back and propagating the remote exit code. Behavioral up flags such as --detach, --dry-run, --no-preflight, --skip-prepare, --force-rebuild, --allow-resume-changes, --resume-diff-only, --format, --print-endpoints, --watch-mode line, and --hold-on-exit are forwarded; without --detach the default remote run streams in line mode.

The staged root is the settings base: the directory that contains .hpc-compose/settings.toml. Place that file (or run hpc-compose setup) at the repo root so your whole source tree is staged. If your compose file lives in a subdirectory (for example hpc/haicore/compose.yaml) and there is no repo-root settings file, only that subdirectory is staged and the rest of your source tree is hidden from the job; hpc-compose prints a warning when it stages only a subdir.

--remote stages your repo only. It does not allocate cluster workspaces (for example ws_allocate) or create site storage directories — provision those yourself first, or a missing host bind-mount path blocks preflight. See Repo staging vs cluster workspace provisioning.

Before the (potentially expensive) rsync, up --remote probes the login node for hpc-compose — on PATH or in ~/.local/bin — and reads its version. If the remote binary is missing or older than your local version, up --remote downloads and installs the newest release into ~/.local/bin with the official installer (curl -fsSL https://raw.githubusercontent.com/NicolasSchuler/hpc-compose/main/install.sh | sh), reusing the same multiplexed SSH connection so an OTP login node prompts only once. No root is needed, and the release tarball is checksum-verified. The delegated command runs the resolved absolute binary path, so an install in ~/.local/bin that is not on the non-interactive SSH PATH still works.

Control this with --remote-install <auto|never|force> (default auto) or the HPC_COMPOSE_REMOTE_INSTALL environment variable:

auto (default): install only when the remote binary is missing or older than your local version.
force: always reinstall the newest release before delegating.
never: only probe. If the remote binary is missing or old, fail with an actionable error that prints the manual install command. Use this on locked-down or air-gapped login nodes.

If the install fails (for example, the login node has no outbound network), hpc-compose prints the manual install one-liner and a clear error. Set HPC_COMPOSE_REMOTE_INSTALL_URL to point the installer at a mirror.

Connection details and first run

Connection details belong in your ~/.ssh/config (port, identity, jump host), so --remote=<host> stays a bare host or alias. For an ad-hoc host not in your config, set HPC_COMPOSE_REMOTE_SSH_OPTS (whitespace-split ssh flags, e.g. -p 2222 -i ~/.ssh/cluster). Every connection reuses one SSH ControlMaster, so a login node that requires an OTP/2FA prompts only once within ControlPersist.

On the first remote run (or after cache eviction) the login-node prepare step imports your image with enroot — a multi-GB download plus extract and squashfs build — which can take several minutes; later runs reuse the cache. See Prepare Images Separately When Needed.

This is a thin delegation: it re-stages the project on each run and does not maintain a persistent login session. It is not up --local (that launches on the current host); --remote and --local cannot be combined.

Inspect a remote run from your laptop

The follow-up commands take the same --remote flag, so the metrics/logs workflow stays laptop-native — you don’t have to SSH into the staged checkout or know its internal paths. After a successful up --remote, hpc-compose prints the exact commands to run (fill in the Slurm job id it reported):

hpc-compose stats --remote=alice@login01 -f compose.yaml --job-id <job-id>   # GPU util / memory / power
hpc-compose logs  --remote=alice@login01 -f compose.yaml --job-id <job-id>
hpc-compose score --remote=alice@login01 -f compose.yaml --job-id <job-id>
hpc-compose pull  --remote=alice@login01 -f compose.yaml --job-id <job-id>

These reuse the same host/login-user/staging context as up --remote: they SSH into the existing remote stage (no re-sync) and stream the output back, reusing the same SSH ControlMaster so an OTP node still prompts only once. They require the project to have been staged by a prior up --remote. pull --remote prints the same rsync command from the login-node context; run that printed command from your laptop to copy the artifact bundle locally.

6. Run Preflight When Debugging Cluster Readiness

hpc-compose preflight -f compose.yaml
hpc-compose preflight --verbose -f compose.yaml
hpc-compose preflight -f compose.yaml --strict

preflight checks selected-backend tools, Slurm tools, cache path policy, local mounts/images, registry credentials, cluster profile compatibility, distributed-readiness hazards, metrics collector tools, and resume path safety.

Generate a cluster capability profile on the target login node when you want validation and preflight to catch partition/backend/QOS/GPU/MPI mismatches earlier:

hpc-compose doctor cluster-report

See Cluster Profiles for generated profile details, site policy packs, and MPI smoke probes.

7. Prepare Images Separately When Needed

hpc-compose prepare -f compose.yaml
hpc-compose prepare -f compose.yaml --force-rebuild

Use this when you want to build or refresh prepared images before submission, confirm cache reuse behavior, or debug preparation separately from job submission.

prepare needs the selected runtime backend tools, but it does not call sbatch.

8. Render The Batch Script

hpc-compose render -f compose.yaml --output /tmp/job.sbatch

This is useful when debugging generated srun arguments, mounts, environment passing, launch order, and readiness waits.

9. Inspect A Tracked Run

hpc-compose jobs list
hpc-compose status -f compose.yaml
hpc-compose status -f compose.yaml --array
hpc-compose ps -f compose.yaml
hpc-compose watch -f compose.yaml
hpc-compose replay -f compose.yaml --speed 10
hpc-compose logs -f compose.yaml --service app --follow
hpc-compose stats -f compose.yaml --format jsonl

Use Runtime Observability for tracked state, replay, logs, metrics, and machine-readable output. For a failed run, start with the First Triage flow in Troubleshooting. Use Artifacts and Resume for artifact bundles and resume-aware attempts.

10. Manage Cache And Old State

Cache Management owns cache inspection, pruning, and cleanup of old tracked runs (cache prune, jobs list --disk-usage, clean --age). For first triage of a failed run, see Troubleshooting.

What Changed And What Should I Run?

If you changed…	Typical next step
YAML planning/runtime settings only	`plan --verbose`, then `up`
Base image, `x-runtime.prepare.commands`, or prepare env	`up --force-rebuild`, or `prepare --force-rebuild` when debugging separately
Mounted runtime source under `volumes`	Usually just `up`
Cache entries this plan no longer references	`cache prune --all-unused -f compose.yaml`
`hpc-compose` itself	Expect cache misses on the next `prepare` or `up`, then optionally prune old entries

Monitor a Run
Manage the Cache and Clean Up
Troubleshoot a Failed Run
Develop and Smoke-Test Locally
Onboard a Cluster Site
Notebook

Monitor a Run

Status, watch, logs.

After a real submission, hpc-compose writes per-job runtime artifacts under:

<runtime-root>/<job-id>/

<runtime-root> defaults to <submit-dir>/.hpc-compose and can be overridden with x-slurm.runtime_root. The tracked submission record lives next to the compose file under .hpc-compose/jobs/<job-id>.json, and together those paths let follow-up commands reconnect without resubmitting.

Common Commands

hpc-compose status -f compose.yaml
hpc-compose ps -f compose.yaml
hpc-compose watch -f compose.yaml
hpc-compose watch -f compose.yaml --hold-on-exit always
hpc-compose replay -f compose.yaml --speed 10
hpc-compose watch -f compose.yaml --watch-mode line
hpc-compose logs -f compose.yaml --follow
hpc-compose logs -f compose.yaml --grep 'error|oom' --since 30m
hpc-compose stats -f compose.yaml
hpc-compose stats -f compose.yaml --accounting
hpc-compose inspect -f compose.yaml --rightsize
hpc-compose score 12345
hpc-compose germinate -f compose.yaml
hpc-compose sweep status -f compose.yaml
hpc-compose sweep list -f compose.yaml
hpc-compose diff 12345 12346 -f compose.yaml

Command	Use it for
`status`	Scheduler state, batch log path, runtime paths, and failure-policy state.
`ps`	Stable per-service snapshot with readiness, status, restart counters, and log path.
`watch`	Live terminal UI; falls back to line-oriented output on non-interactive terminals.
`replay`	Best-effort DVR for a tracked run, reconstructed from existing runtime artifacts.
`logs`	Text log output, optionally focused, searched, or coarsely time-filtered.
`stats`	Tracked metrics, Slurm step statistics, and optional accounting rollups.
`inspect --rightsize`	Post-run request-versus-usage recommendations for memory, CPUs, GPUs, and walltime.
`score`	0-100 post-run efficiency score with GPU, memory, compute-time, and kWh components.
`germinate`	Short canary submission; see Right-Size With Canary Runs.
`sweep status` / `sweep list`	Inspect sweep trials and manifests; see Hyperparameter Sweeps.
`diff`	Compact comparison between two tracked submissions.

Use --format json on non-streaming commands when automation needs stable fields. stats also supports --format csv and --format jsonl.

Watch UI

On an interactive terminal, watch and the default up follow mode open a live view with service state on the left and log output on the right. The UI automatically switches to a compact single-column view on narrow or short terminals. It keeps a detailed status view while the job runs and, by default, holds the final screen on failures so the failing service, final scheduler state, and next diagnostic commands stay visible.

Keybindings:

Key	Action
`j`, `Down`, `Tab`	Move to the next service.
`k`, `Up`	Move to the previous service.
`g` / `G`	Jump to the first or last service.
`/`	Filter services by name; press `Enter` to apply or `Esc` to cancel.
`f`	Find within log content; matches are highlighted and counted in the log header.
`Space`	Pause or resume log following.
`PgUp` / `PgDn`	Scroll the visible log pane while paused.
`End`	Return to live-follow mode at the newest log lines.
`a`	Toggle between the selected service log and all tracked service logs.
`w`	Toggle wrapping of long log lines (otherwise they are truncated).
`o`	Cycle service ordering between spec order and triage (failed, then unhealthy, first).
`r`	Request a restart of the selected service (local supervised jobs; see note below).
`Enter`	Open a detail panel for the selected service (placement, ntasks, nodelist, restart policy, timings, assertions); `Esc`/`Enter` closes and `j`/`k` switches service.
`y`	Copy a ready-to-run `logs` command for the selected service to the system clipboard (OSC 52; works over SSH).
`?`	Toggle in-UI help.
`q` / `Ctrl-C`	Leave the watch view without cancelling the job.

Log lines are colored by inferred severity: lines mentioning error/fatal/panic show in red and warn/warning in yellow (subject to the active color policy).

Use --hold-on-exit never|failure|always on up or watch to control whether the final TUI stays open after a terminal scheduler state. When the view is held, press d, l, or s to print the exact debug, logs, or stats command after leaving the alternate screen.

The r restart action writes a request consumed by the local Pyxis/Enroot supervisor, the same mechanism hpc-compose dev uses for file-watch reloads; it applies to local supervised jobs and is reported as unavailable for Slurm batch jobs. Run hpc-compose dev --tui to get this live view during a dev session: file-watching keeps reloading changed services in the background while the watch UI (including r for an on-demand restart) runs in the foreground. Without --tui, dev keeps its line-oriented output, which is friendlier for CI and logs.

The watch and replay views repaint only the rows that change between refreshes, which keeps the display flicker-free and minimizes bytes sent over SSH. Two environment variables tune the live view:

Variable	Effect
`HPC_COMPOSE_WATCH_REFRESH_MS`	Scheduler/log refresh cadence in milliseconds (default 1000, clamped to 100–60000).
`HPC_COMPOSE_WATCH_METRICS_REFRESH_MS`	Metrics refresh cadence in milliseconds (default 5000, clamped to 500–600000).
`HPC_COMPOSE_WATCH_MOUSE`	Set to a non-zero value to enable mouse capture; the scroll wheel then drives the log pane. Off by default so native terminal text selection keeps working.

These display preferences can also be set per-project in .hpc-compose/settings.toml under a [watch] section; environment variables take precedence over the file:

[watch]
sort = "triage"          # spec | triage
wrap = true
refresh_ms = 500         # 100–60000
metrics_refresh_ms = 2000 # 500–600000
mouse = false

Use hpc-compose up --watch-queue when you want explicit queue polling before the watch view opens. It prints queue state changes, pending reason, and expected start time when Slurm exposes them; --queue-warn-after <DURATION> controls the one-time long-pending warning.

Use --watch-mode line when you are recording output, using a screen reader, running in CI, or working in a terminal where alternate-screen UIs are inconvenient. Line mode preserves detailed scheduler and log updates without alternate-screen control codes.

Replay

hpc-compose replay reconstructs a best-effort execution timeline after the run. It reuses the watch-style view, but reads only artifacts that already exist under the tracked job directory. This makes it useful for rewinding to the time a service failed, comparing the nearest prior metrics sample, or sharing a deterministic text/JSON summary without querying Slurm again.

hpc-compose replay -f compose.yaml
hpc-compose replay -f compose.yaml --speed 10
hpc-compose replay -f compose.yaml --job-id 12345 --service trainer
hpc-compose replay -f compose.yaml --format json

Replay controls:

Key	Action
`Space`	Pause or play the replay.
`+` / `-`	Move between speed presets such as `1x`, `10x`, and `100x`.
`Left` / `Right`	Seek backward or forward by five seconds.
`[` / `]`	Jump to the previous or next reconstructed event.
`Home` / `End`	Jump to the first or final replay frame.
`/`, `f`, `a`, `w`, `o`, `PgUp`, `PgDn`, `q`	Same filter, find, log-pane, wrap, sort, scroll, and quit behavior as `watch`.

A timeline scrubber under the header shows the playback cursor and reconstructed event ticks between the start and end of the run.

Replay data sources:

Source	What replay uses	Fidelity notes
`state.json`	Final per-service state, start/finish times, exit code fallback, placement metadata	This file is overwritten during the run, so intermediate readiness and scheduler transitions are not exact.
`service-exits/*.jsonl`	Append-only service exit markers and restart evidence	Multiple exits reconstruct failure/restart sequences, but accepted restart relaunch time is inferred.
`metrics/*.jsonl`	Historical GPU and Slurm sampler rows	Replay shows the latest metrics sample at or before the cursor and never displays future metrics as current.
`logs/*.log`	Service log tails in the replay UI	Service logs do not include guaranteed per-line timestamps, so log panes are contextual tails, not exact log-time scrubbing.
Scheduler commands	Not queried during replay	Historical queue state, pending reason changes, and accounting gaps are not reconstructed.

Use --format json when notebooks, dashboards, or experiment records need the reconstructed events, frame summaries, artifact paths, and fidelity notes.

Checkpoints

hpc-compose checkpoints reports the attempt and requeue history of a tracked job from LOCAL tracked state only. It contacts no scheduler and reads nothing from the cluster filesystem, so it is safe to run from a laptop against a synced tracked directory.

hpc-compose checkpoints -f compose.yaml
hpc-compose checkpoints --job-id 12345
hpc-compose checkpoints --format json

The history derives from the per-attempt state.json files written under .hpc-compose/<job>/attempts/<n>/. These per-attempt directories are produced only when x-slurm.resume is configured and the job is requeued: each requeue records a new 0-based attempt index (attempts = highest index + 1, requeues = attempts - 1). A non-resume job has no attempts/ directory and writes a single top-level state.json, which checkpoints reports as one attempt with zero requeues and no per-attempt index.

For each attempt, the command reports the earliest service start, the latest service finish, the derived duration, the job status, and the job exit code. A missing or unreadable per-attempt state.json is skipped and surfaced under degraded[] rather than failing the command, and a gap in the 0-based attempt indices (for example, an early attempt reaped by retention) is flagged as a truncated history so requeue counts are not silently miscounted.

--format json emits one object: {job_id, compose_file, submitted_at, resume_configured, attempts, requeues, current_attempt, is_resume, resume_dir, entries[], degraded[]}. This is distinct from the artifacts --bundle checkpoints export, which copies model checkpoint files rather than describing attempt history. See Artifacts and Resume for the attempt directory layout.

Logs

Runtime logs live under:

<runtime-root>/<job-id>/logs/<service>.log

Unless x-slurm.output is set, real submissions also write the top-level batch log under <runtime-root>/logs/hpc-compose-<job-id>.out. Check the batch log first when a job fails before any service log appears.

Service names containing non-alphanumeric characters are encoded in log filenames. Prefer [a-zA-Z0-9_-] in service names for readability.

Each service log is bracketed by timestamped lifecycle markers so a run does not look stuck before it produces output. A [hpc-compose] <ts> service <name>: container starting via srun … line is written just before the container launch (which is where srun scheduling and the first-use image extract happen), and a [hpc-compose] <ts> service <name>: command exited rc=<code> line is written when the command finishes. The gap between the start marker and the command’s own first line is the container-launch time, not a hang.

Use --grep <pattern> to print only matching raw log lines across selected service logs. Use --since <duration> for coarse time-bounded initial output, for example 30s, 15m, 2h, 1d, or 1h30m. Because service logs do not include line timestamps, --since filters by each log file’s modification time rather than by individual line time. Follow mode still starts from the current end of each selected log and applies --grep to appended lines.

Event Hooks

Per-service x-slurm.hooks can run host-side observability scripts when restart_on_failure accepts a restart or when the rolling restart window blocks a crash loop. Hook stdout/stderr is appended to that service’s log, and non-zero hook exits are logged without changing the restart or failure outcome.

Use on: restart for retry notifications and on: window_exhausted for crash-loop alerts. Event hooks receive service identity, exit code, Slurm attempt, and restart-window counters through HPC_COMPOSE_* environment variables; see Spec reference for the full list.

Metrics

When x-slurm.metrics is enabled, sampler files are written under:

<runtime-root>/<job-id>/metrics/
  meta.json
  gpu.jsonl
  gpu_processes.jsonl
  slurm.jsonl
  diagnostics/

The sampler can collect GPU snapshots through nvidia-smi and job-step CPU/memory snapshots through sstat. Collector failures are best-effort: missing nvidia-smi, missing sstat, or unsupported queries do not fail the batch job itself.

Add --accounting to stats when you need post-run sacct rollups for reporting. The accounting summary includes allocated CPU-hours, total CPU-hours when available, allocated GPU-hours, allocation-based memory byte-seconds, and observed maximum RSS. Memory byte-seconds are labeled as allocation-based because Slurm’s standard accounting fields do not reliably provide true per-line memory-seconds across all clusters.

Use hpc-compose inspect --rightsize -f compose.yaml after a tracked Slurm run to convert those observations into conservative resource suggestions. The assistant requires tracked submission metadata and compares explicit requests such as x-slurm.mem, x-slurm.time, x-slurm.gpus, and service x-slurm.cpus_per_task against sacct, sstat, and nvidia-smi sampler evidence. It only reports suggestions; it does not rewrite the compose file.

Use hpc-compose score <job-id> after a tracked Slurm run when you want a compact efficiency grade. The score reuses sampler history, sacct, sstat, and right-sizing recommendations, then reports GPU utilization, memory utilization, active compute-time versus requested walltime, and a best-effort kWh estimate. Energy uses sampled GPU power when available, otherwise falls back to power limits or configured TDP assumptions through --gpu-tdp-w, --cpu-watts-per-core, and --pue; it does not claim carbon intensity or emissions.

Use hpc-compose experiment show <job-id> when you want all of that in one read-only object. A single call aggregates scheduler status, the post-run efficiency score, the artifact manifest, and submit-time provenance, so a notebook or experiment tracker can capture one run with one command (hpc-compose experiment show <job-id> --format json). It is static-safe: it contacts the scheduler only as much as status and score already do, writes nothing, and opens no connection. For each service with TCP or HTTP readiness it emits a per-service ssh -L tunnel hint, and next_commands carries SSH ControlMaster/ControlPath/ControlPersist multiplexing guidance so an OTP/2FA login node prompts you only once. Legacy records without provenance, non-terminal jobs without a complete efficiency report, and runs without an artifact manifest still produce a valid object with those fields omitted.

For a short canary run before a full run, use hpc-compose germinate; see Right-Size With Canary Runs.

Sweep Manifests

Sweep submission and monitoring (sweep submit, sweep status, sweep list) are covered in Hyperparameter Sweeps. Sweep-trial records do not replace normal latest.json or latest-run.json, so hpc-compose status, watch, and logs continue to target ordinary runs unless you pass an explicit job id.

Diffing Runs

Use hpc-compose diff <job-id-1> <job-id-2> to compare two tracked submissions. The compact text view highlights outcome, resource, and config changes; --format json returns the full uncapped diff for notebooks or experiment records. Older tracked jobs without config snapshots still compare outcome metadata and report a note that config comparison is unavailable.

N-Way Comparison Matrix

To compare more than two runs at once, drop the positional job ids and pass either --jobs a,b,c (an explicit comma-separated list of tracked job ids) or --across <sweep-id> (every submitted trial of a sweep; unsubmitted trials are skipped with a note). The result is a matrix with one column per run and one row per field that differs in at least one run — fields identical across every run are collapsed and omitted, so the output stays focused on what actually changed. The same outcome, provenance, resource, and config sections as the pairwise diff are projected across all runs.

Choose the output with --matrix-format text|csv|json (default text). --matrix-format csv emits a section,field,<job_id>... table for spreadsheets, while --matrix-format json serializes the full uncapped matrix (the text view caps the config section at 25 rows). This is a pure read-only projection over already-persisted records; like pairwise diff, it opens no connection and only probes the scheduler as much as status does.

hpc-compose diff --jobs 12345,12346,12347 --matrix-format json
hpc-compose diff --across sweep-1700000000-1234 --matrix-format csv

Operate a Real Cluster Run
Troubleshoot a Failed Run
Manage the Cache and Clean Up
Artifacts and Resume
Hyperparameter Sweeps
Right-Size With Canary Runs

Manage the Cache and Clean Up

The resolved cache directory stores imported and prepared runtime artifacts. It comes from explicit x-slurm.cache_dir, then profile/default settings, then $HOME/.cache/hpc-compose. For real cluster runs, it must be visible from both the submission host and compute nodes; see Execution Model for why prepared artifacts must live on shared storage.

Choose A Cache Path

Use a project scratch, work, or shared filesystem path:

export CACHE_DIR=/cluster/shared/hpc-compose-cache
mkdir -p "$CACHE_DIR"
test -w "$CACHE_DIR"

You can record that path in project settings instead of every compose file:

hpc-compose setup --profile-name dev --cache-dir "$CACHE_DIR" --default-profile dev --non-interactive

Do not use /tmp, /var/tmp, /private/tmp, or /dev/shm. Validation may accept those strings, but preflight reports them as unsafe because compute nodes must reuse artifacts prepared before submission.

Inspect Cache State

hpc-compose cache list
hpc-compose cache inspect -f compose.yaml
hpc-compose cache inspect -f compose.yaml --service app

Use cache inspect to answer:

which artifact is being reused
whether a prepared image came from a cached manifest
whether a service rebuilds on every prepare because prepare mounts are present

Staged-Input Cache (Datasets/Models)

Staged datasets and models live in a content-addressed store under the same shared cache root, at cache_dir/datasets/<key> and cache_dir/models/<key>. The key is derived from the input spec (its source URI and pinned revision), so identical staged inputs are materialized once and reused on every later run. Each staged directory carries a sidecar manifest (<key>.dataset.json or <key>.model.json) so cache list and cache prune cover staged inputs alongside image artifacts.

The store itself never fetches anything: it is a pure on-disk store, and the actual fetch and materialization (network) is approval-gated and introduced by the hf:// stage-in work, not run automatically by cache, plan, or prepare.

Prune Cache Entries

Prune old entries by age:

hpc-compose --profile dev cache prune --age 14 --yes

Prune artifacts not referenced by the current plan:

hpc-compose cache prune --all-unused -f compose.yaml --yes

Prune one cache directory directly:

hpc-compose cache prune --age 7 --cache-dir '<shared-cache-dir>' --yes

--age and --all-unused are mutually exclusive.

Rendezvous Records

Cross-job rendezvous records live under the same shared cache root and are pruned separately (rendezvous list, rendezvous prune). See Cross-Job Rendezvous for placement, TTL, and ownership rules.

Clean Up Old Tracked Runs

Tracked job metadata and logs accumulate in .hpc-compose/. Preview disk usage and cleanup before deleting:

hpc-compose jobs list --disk-usage
hpc-compose clean -f compose.yaml --age 7 --dry-run
hpc-compose clean -f compose.yaml --age 7

After Upgrading

Cache keys include the tool version, so upgrading hpc-compose invalidates existing cached artifacts. Expect a full rebuild on the next prepare or up, then optionally prune old entries:

hpc-compose cache prune --age 0 --yes

Operate a Real Cluster Run
Monitor a Run
Troubleshoot a Failed Run
Execution Model
Cross-Job Rendezvous
CLI Reference

Troubleshoot a Failed Run

Use this page when the safe authoring path worked but the first real cluster run failed.

For background on Slurm allocations, sbatch, srun, Pyxis, and Enroot, see Slurm And Container Basics. For HAICORE-specific storage and runtime checks, see HAICORE Guide.

First Triage

hpc-compose validate -f compose.yaml
hpc-compose validate -f compose.yaml --strict-env
hpc-compose plan --verbose -f compose.yaml
hpc-compose lint --fix --dry-run -f compose.yaml
hpc-compose debug -f compose.yaml --preflight

plan --verbose can print resolved environment values and final mount mappings. Treat its output as sensitive when the spec contains secrets. validate and lint emit “Did you mean …” suggestions for misspelled service keys and dependency conditions. lint --fix --dry-run previews auto-fixes (for example, making an implicit depends_on condition explicit) without writing. debug is read-only unless --preflight is passed; with --preflight, it reruns prerequisite checks and includes those findings in the triage report.

Common Symptoms

Symptom	Likely cause	Next step
`required binary '...' was not found`	Selected backend or Slurm client tool is not on `PATH`.	Run `debug --preflight`; pass `--enroot-bin`, `--apptainer-bin`, `--singularity-bin`, `--srun-bin`, or `--sbatch-bin` as needed.
`srun does not advertise --container-image`	Pyxis support is unavailable or not loaded.	Move to a supported login node, load the site module, or choose another backend.
Cache directory warning/error	The resolved cache directory is not shared, writable, or policy-safe.	Choose a shared project/work/scratch path through `x-slurm.cache_dir` or `setup --cache-dir`, then rerun `debug --preflight`.
Missing local mount or image path	Relative paths are resolved from the compose file directory.	Check paths relative to the copied `compose.yaml`.
Mounted symlink exists on the host but fails in the container	The symlink target is outside the mounted directory.	Copy the real file into the mounted directory or mount the target directory.
Anonymous pull or registry warning	Registry credentials are missing or rate limits apply.	Configure credentials before relying on private or rate-limited images.
Services start in the wrong order	Dependency condition or readiness is too weak.	Use `service_healthy` with `readiness`, or `service_completed_successfully` for DAG stages.
No service logs exist	The batch script failed before launching a service.	Use `debug` to see scheduler state, the tracked top-level batch log tail, and missing-log hints.
`dev` reports no watchable source directories	Services only mount files, missing paths, cache paths, or container-only paths.	Mount the source as a host directory or pass `hpc-compose dev --watch-paths ./src -f compose.yaml`.
Readiness never passes	Probe target, pattern, host, or dependency timing does not match the real service.	Inspect the service log with `logs --service <name>` and try a finite `hpc-compose test --local` or short `test --submit` spec.
Smoke test times out	The spec is long-running, readiness blocks forever, or the scheduler job never reaches terminal state.	Make the smoke spec finite, lower service readiness timeouts, and use `--format json` to inspect the failed phase and service reason.
`tmux` is unavailable or attach fails	`tmux` is not installed or the shell is non-interactive.	Install `tmux`, pass `--tmux-bin <PATH>`, or create the dashboard with `--no-attach`.
Local mode is unsupported	Local workflows require a Linux host with Pyxis-compatible Enroot behavior.	Use authoring commands on non-Linux hosts, then run `test --submit` or `up` on a supported Slurm login node.
`up --remote` reports the remote `hpc-compose` is missing or older	The login node has no `hpc-compose` on `PATH` or `~/.local/bin`, or has an older version than your local one.	Default `--remote-install auto` downloads and installs the newest release into `~/.local/bin` over the same SSH connection. On a locked-down/air-gapped node, use `--remote-install never` and install manually with the printed one-liner.
`up --remote` job cannot see part of your source tree	The compose file lives in a subdirectory with no repo-root settings, so only that subdir was staged (watch for the “staged only a subdir” warning).	Put `.hpc-compose/settings.toml` at the repo root (or run `hpc-compose setup` there) so the whole source tree is staged.
`--skip-prepare` reports the runtime image is not prepared	`--skip-prepare` reuses an existing image cache and builds nothing; on a first run (or after cache eviction) the image does not exist yet.	Run `hpc-compose up` or `hpc-compose prepare` once without `--skip-prepare`, then reuse the cache with `--skip-prepare`.
enroot import fails at `Creating squashfs filesystem...` with `Stale file handle`	The default extraction scratch (`<cache_dir>/enroot/tmp`) is on a shared NFS/Lustre/GPFS filesystem, where the extract-then-`mksquashfs` import triggers ESTALE.	Point the prepare scratch at node-local storage (opt-in): set `x-slurm.enroot_temp_dir` in the spec (e.g. `/tmp/${USER}-hpc-compose-enroot`), `cache.enroot_temp_dir` in `.hpc-compose/settings.toml`, or `HPC_COMPOSE_ENROOT_TEMP_DIR`. `hpc-compose` retries once on a clean temp dir before failing.
prepare command fails when a `prepare.mounts` source is on a network filesystem	The prepare step binds that source on the login node, where a network/shared-FS mount can fail.	Use a dependency-only prepare (install deps into the image, mount the source as a runtime `volumes` entry), or ensure the mount source is stable on the login node. `examples/dev-python-app.yaml` shows the pattern.
enroot import fails with `manifest unknown` / `manifest not found` / `401 Unauthorized`	The image tag does not exist on the registry (often a typo, or a tag that was never published), or the pull needs credentials.	Verify the reference exists before submitting: `skopeo inspect docker://<image>` or `docker manifest inspect <image>`. `hpc-compose lint` (HPC007) warns about mutable/`latest` tags but cannot confirm a tag exists on the registry; for private images configure registry credentials.

Readiness Issues

Use depends_on with condition: service_healthy when a dependent must wait for a dependency’s readiness probe. Plain list form means service_started.

Use condition: service_completed_successfully for one-shot DAG stages where the next service should start only after the previous stage exits with status 0, such as preprocess -> train -> postprocess.

When a TCP port opens before the service is fully usable, prefer HTTP or log-based readiness over TCP readiness.

Inspect the normalized readiness probe without starting or submitting anything:

hpc-compose doctor readiness -f compose.yaml --service api

If the service is already running, tunneled, or otherwise reachable from the current host, run the same probe host-side:

hpc-compose doctor readiness -f compose.yaml --service api --run
hpc-compose doctor readiness -f compose.yaml --service api --run --log-file .hpc-compose/<job-id>/logs/api.log

doctor readiness --run does not launch services, prepare images, or call Slurm. It only checks the selected readiness target from the current host, which makes it useful before testing a dependent service or while debugging an already tracked run.

For hpc-compose test, readiness failures are terminal smoke-test failures. A service with configured readiness must become healthy and then complete successfully; ignored sidecars are still expected to pass in a smoke spec.

Preview A Run

Use plan for the static preview. It never prepares images, runs preflight, calls sbatch, or writes hpc-compose.sbatch:

hpc-compose plan --show-script -f compose.yaml

Use up --dry-run only when you intentionally want to exercise preflight, prepare, and render without calling sbatch:

hpc-compose up --dry-run -f compose.yaml

Clean Old Tracked Runs

Cleaning up accumulated tracked job metadata and logs is covered in Manage the Cache and Clean Up.

Operate a Real Cluster Run
Monitor a Run
Manage the Cache and Clean Up
Develop and Smoke-Test Locally
Slurm And Container Basics
HAICORE@KIT Guide

Develop and Smoke-Test Locally

test, dev, and tmux are the local-development command layer. They reuse the same prepare, render, local supervisor, runtime state, and tracking paths as up, so a run started by one command remains visible to status, ps, logs, stats, watch, and debug.

Smoke-Test Specs

Use test for finite specs that prove a workflow starts, satisfies readiness gates, and exits cleanly:

hpc-compose test --local -f examples/dev-python-smoke.yaml
hpc-compose test --submit --time 00:01:00 --timeout 180s -f compose.smoke.yaml
hpc-compose test --submit --format json -f compose.smoke.yaml

test requires exactly one execution mode:

--local runs the rendered local supervisor on the current host.
--submit calls sbatch; it defaults to --time 00:01:00 and --timeout 180s. This is a real-scheduler operation that consumes an allocation, so it needs explicit user approval before running.

A smoke test passes only when every service:

appears in tracked runtime state,
launched at least once,
passed readiness when readiness is configured,
completed successfully.

Services with failure_policy.mode: ignore still have to complete successfully for test to pass. That makes smoke tests stricter than production runs by design: ignored sidecars are useful operationally, but they should not silently hide a broken spec test.

Making Long-Running Specs Finite

Production services often run forever. For smoke tests, create a finite variant of the spec or override the service command in a copied file:

services:
  app:
    image: python:3.11-slim
    working_dir: /workspace
    volumes:
      - ./app:/workspace
    command:
      - python
      - -c
      - "import main; print('smoke ok', flush=True)"

Keep the same image, mounts, environment, dependencies, and readiness where possible. Change only the command or entrypoint needed to prove startup and exit. If a dependent service uses condition: service_healthy, keep the upstream readiness probe real enough to catch wiring mistakes.

Hot Reload

dev is local-only:

hpc-compose dev -f examples/dev-python-app.yaml
hpc-compose dev -f compose.yaml --watch-paths ./src --debounce-ms 500

It infers watch roots from host directories mounted through service volumes. File mounts, container-only paths, cache paths, missing paths, and non-directory paths are ignored. --watch-paths adds an explicit directory and restarts every service when it changes.

File changes write restart requests into the tracked run’s dev control directory. The local supervisor handles those requests as development restarts, so readiness and completion state reset for the affected service without consuming the restart counters used by failure_policy.mode: restart_on_failure (max_restarts/max_restarts_in_window).

By default, Ctrl-C stops the local supervisor. Add --keep-running when you want to leave the tracked local run alive after exiting the watch loop.

Tmux Dashboard

tmux is a log dashboard, not a process supervisor:

hpc-compose tmux -f compose.yaml
hpc-compose tmux -f compose.yaml --job-id local-123
hpc-compose tmux -f compose.yaml --session demo --no-attach

Without --job-id, it launches a new local run. With --job-id, it attaches to an existing tracked local run. Each pane tails one service log with tail -F, and pane titles use service names. Use --no-attach when running from a non-interactive terminal or CI smoke check.

Shared Local Constraints

up --local, test --local, dev, and tmux share the same current constraints:

Linux hosts only
runtime.backend: pyxis only
Pyxis-compatible Enroot tooling on the host
single-host specs only
no distributed or partitioned placement
no service-level MPI
no Slurm arrays or scheduler dependencies

Use these commands to author and debug single-host launch behavior. Use test --submit or up on a Slurm login node for real scheduler behavior, or use the Local Slurm Dev Cluster from a source checkout when you want a throwaway real sbatch smoke test without a cluster login.

Example Recipe

The source-mounted app in examples/dev-python-app.yaml is intentionally long-running, so it is a good dev target:

hpc-compose dev -f examples/dev-python-app.yaml
hpc-compose tmux -f examples/dev-python-app.yaml --no-attach

The companion examples/dev-python-smoke.yaml keeps the same mounted source pattern but uses a finite command:

hpc-compose test --local -f examples/dev-python-smoke.yaml
hpc-compose test --submit --time 00:01:00 -f examples/dev-python-smoke.yaml

Operate a Real Cluster Run
Local Slurm Dev Cluster
Monitor a Run
Troubleshoot a Failed Run
Manage the Cache and Clean Up
CLI Reference

Local Slurm Dev Cluster

The local Slurm dev cluster is source-checkout tooling for running hpc-compose against a real throwaway scheduler on a laptop. It starts one privileged Docker/Podman container with slurmctld, slurmd, slurmdbd, MariaDB, and the current checkout’s hpc-compose binary.

Use it when you want a real scheduler smoke test before moving to a shared cluster. It is not a dry-run: scripts/devcluster.sh run ... calls real sbatch inside the local container. The job consumes only the local throwaway Slurm node.

Preview Levels

Goal	Command	Scheduler contact	Writes runtime state
Static authoring preview	`hpc-compose plan --show-script -f compose.yaml`	No	No
Preflight, prepare, and render without submission	`hpc-compose up --dry-run -f compose.yaml`	No `sbatch`	Writes the rendered script
Real local scheduler smoke test	`scripts/devcluster.sh run compose.yaml`	Local dev-cluster `sbatch`	Yes, inside the mounted project

Use plan first for fast static feedback. Use up --dry-run when you want the same preflight and preparation path as submission but no sbatch. Use the dev cluster when you specifically want to exercise hpc-compose’s real up -> sbatch -> slurmd -> sacct path without a cluster login.

Requirements

A source checkout of this repository. Release archives install the CLI and manpages, not the dev-cluster wrapper and Dockerfile.
docker compose or podman compose, with the engine running.
Support for privileged containers. The local node needs cgroup access for slurmd; treat it as a disposable developer machine workflow.

Quickstart

From the repository root:

scripts/devcluster.sh up
scripts/devcluster.sh sinfo
scripts/devcluster.sh run dev-cluster/specs/hello.yaml
scripts/devcluster.sh down

To smoke-test another project tree with the same local Slurm node:

scripts/devcluster.sh up --project /path/to/project
scripts/devcluster.sh run compose.yaml
scripts/devcluster.sh down

Specs run in the dev cluster should use runtime.backend: host. That keeps the local loop tractable and avoids nesting Pyxis/Enroot or Apptainer inside Docker/Podman. If your production spec uses a container backend, keep a small host-backend smoke variant for local scheduler validation and revalidate the container runtime on the real cluster.

Automated Check

Maintainers can run the checked-in real-scheduler suite with:

DEVCLUSTER_E2E_DOWN=1 scripts/devcluster_e2e.sh

The script boots the cluster, runs every spec under dev-cluster/specs, asserts that each spec has an explicit expected outcome, and verifies scheduler-backed commands such as status, ps, logs, and score where applicable. CI runs the same harness as a separate Dev Cluster E2E job with a cached image build.

Scope

Validated locally:

sbatch submission against a real controller
service ordering and readiness gates
multi-service composition inside one allocation
terminal accounting through sacct
scheduler-facing observability for tracked runs
expected failure propagation for negative smoke specs
sbatch --array fan-out with per-task accounting and status --array
the restart_on_failure supervisor draining to COMPLETED through real restarts
cancel driving a running job to the CANCELLED terminal state, with tracked-state teardown
artifact teardown collection resolved by pull/artifacts against a real manifest
scheduler inter-job dependencies (after_job holds a consumer until the producer ends)
failure_policy: ignore and depends_on: service_completed_successfully ordering
tracked-state readers over a real run (experiment, replay, debug, checkpoints, jobs, clean)
the host-backend resume dir resolving to a real on-node path
alloc + run reusing one allocation via srun

Still validate on the real cluster:

Pyxis/Enroot, Apptainer, or Singularity runtime behavior
GPU execution
site-specific modules, filesystems, partitions, and accounting policy
multi-node network and placement behavior

Quickstart
Develop and Smoke-Test Locally
Operate a Real Cluster Run
Troubleshoot a Failed Run

Run a Notebook or IDE Session

hpc-compose notebook launches a tracked interactive server — JupyterLab or VS Code (code tunnel) — as a single-service Slurm job, waits for it to become ready, and prints the connection URL. The session is a normal tracked job: manage it with hpc-compose status and stop it with hpc-compose cancel.

Use it when you want an interactive IDE or notebook on a compute node (for example, on a GPU partition) without hand-writing sbatch glue.

Kinds

`--kind`	Default image	Connection
`jupyter` (default)	`jupyter/scipy-notebook:latest`	Local URL + SSH tunnel hint; you forward the port from your laptop.
`vscode`	none (requires `--image`)	A `https://vscode.dev/tunnel/...` link. VS Code tunnels outbound, so no port forwarding is needed.

Quickstart

# JupyterLab on one GPU, with your project mounted
hpc-compose notebook --kind jupyter --gpus 1 \
  --volume ./project:/workspace --working-dir /workspace

# VS Code tunnel (supply an image containing the `code` CLI)
hpc-compose notebook --kind vscode --image ghcr.io/example/code:1 --gpus 1

After readiness, hpc-compose prints the URL. For Jupyter on Slurm it also prints a ready-to-copy SSH command:

Open: http://127.0.0.1:8888/lab?token=<generated>

On your laptop, forward the port:
  ssh -N -o ControlMaster=auto -o ControlPath=~/.ssh/cm-%r@%h:%p -o ControlPersist=10m -L 8888:<compute-node>:8888 <login-node>
then open the URL above in your browser.
The ControlMaster options reuse one authenticated connection, so a login node that requires an OTP/2FA only prompts on the first connection within ControlPersist.

The printed command already carries the SSH connection-multiplexing options (the same ones reach, pull, and experiment emit), so a login node that requires an OTP/2FA prompts only on the first connection of your session.

For VS Code, open the printed vscode.dev link directly in a browser — no tunnel is required.

If your login node demands a one-time password on every SSH session, keep SSH connection multiplexing enabled so you authenticate once and every later tunnel (and rsync/scp) reuses the master connection. The printed Jupyter tunnel command already includes these options; the equivalent persistent ~/.ssh/config form is:

# ~/.ssh/config
Host <login-node>
    ControlMaster auto
    ControlPath ~/.ssh/cm-%r@%h:%p
    ControlPersist 10m

Establish the master once (entering the OTP), then the forward runs without re-authenticating until ControlPersist expires:

ssh -fN <login-node>                          # OTP entered here, once
ssh -L 8888:<compute-node>:8888 <login-node>  # reuses the master — no OTP

hpc-compose only prints the tunnel command; it never opens a connection or stores credentials, so the OTP step stays entirely under your control.

Local mode

--local runs the server on the current host (login node or workstation) through the same local supervisor used by dev. The printed URL points at 127.0.0.1 directly:

hpc-compose notebook --kind jupyter --local --volume ./src:/workspace

Local mode requires a Linux host with Pyxis-compatible Enroot tooling, like the rest of the local-development command layer.

Managing the session

The notebook is a tracked job, so the standard commands work:

hpc-compose status -f <compose>          # scheduler + service state
hpc-compose logs -f <compose> --follow   # tail the notebook log
hpc-compose cancel -f <compose>          # stop and release the allocation

By default notebook detaches after printing the URL (the job keeps running). Pass --follow to stream logs in the foreground instead.

Security

For Jupyter, hpc-compose generates a random auth token and embeds it in the printed URL, so the link is unguessable but self-contained. Override it with --token if you prefer. Do not share the printed URL: it grants access to the notebook session.

For VS Code, code tunnel performs GitHub device-flow authentication the first time; --accept-server-license-terms is passed automatically.

Authoring notes

Images and users. jupyter/scipy-notebook runs as the non-root jovyan user. Bind-mounted host directories must be writable by that user (typically uid 1000). Use --working-dir to point at your mounted workspace and adjust ownership on the host if needed.
VS Code images. There is no universal default code image; supply one with --image that contains the VS Code CLI.
Readiness. hpc-compose waits for a log pattern (/lab?token= for Jupyter, vscode.dev/tunnel/ for VS Code) before printing the URL. Use --ready-timeout (default 10m) to bound the wait; first-run image pulls happen during prepare, before the readiness clock starts.
Declarative counterpart. The same workflow is available as a compose file via the jupyter template (hpc-compose new --template jupyter), so you can commit it to a repo and launch with hpc-compose up.

Use Secrets
Development Workflow
Examples
CLI Reference

Use Secrets

hpc-compose resolves named secrets from local files or environment variables and feeds them into the interpolation map as first-class, redacted values. This keeps secrets out of the rendered batch script’s environment: block authoring surface and ensures they are hidden in config/context/inspect output.

Declaring secrets

Add a top-level secrets: block mapping a secret name to exactly one source:

secrets:
  hf_token:
    file: ./secrets/hf.txt       # value = file contents (trimmed)
  db_password:
    env: DB_PASSWORD             # value = named environment variable

file: reads the value from a file relative to the compose file directory.
env: reads the value from the named variable in the resolved environment (process env, .env, or settings env_files).
Each secret must set exactly one of file or env.

Using secrets

Reference a secret anywhere interpolation works — most commonly in a service environment: block:

services:
  trainer:
    image: pytorch/pytorch:2.12.1-cuda13.2-cudnn9-runtime
    environment:
      HF_TOKEN: ${hf_token}
      DB_PASSWORD: ${db_password}
    command: python -m train

The resolved value flows through the normal environment: render path (--container-env= + the launcher env array) into the container. No new mount machinery is required.

Redaction

A value resolved through secrets: is tagged as a secret source. It is always redacted in diagnostic output regardless of its name:

$ hpc-compose config -f compose.yaml
...
    environment:
      HF_TOKEN: <redacted>
      MODEL: llama

Name-based redaction also still applies to any sensitive-named value, even when it was not declared as a secret. A name triggers redaction when (after upper-casing) it contains any of these substrings:

SECRET   TOKEN     PASSWORD   PASSWD
API_KEY  ACCESS_KEY  PRIVATE_KEY  CREDENTIAL
AUTH     COOKIE    SESSION    BEARER

Matching is a case-insensitive substring test, so names such as SESSION_DIR, AUTH_MODE, or MY_API_KEY_PATH are redacted too. Pass --show-values on config or context to reveal secrets when you have a legitimate need:

hpc-compose config -f compose.yaml --show-values
hpc-compose context   # table of interpolation vars; secret-sourced values show SOURCE 'secret' and are redacted

The raw secret value never appears in config, context, or inspect output by default. inspect does not expose a --show-values escape hatch; use config --show-values or context --show-values for trusted local diagnostics.

Secrets at rest

Redaction only governs diagnostic output. The rendered Slurm script and the persisted job state can carry resolved secret values, because the environment is materialized so the job can run. These files are written owner-only (mode 0600); even so, keep secret-bearing compose specs, rendered scripts, and tracked state under a non-group-readable directory (for example a private cache or scratch path), and avoid committing them to shared or version-controlled locations.

Resolution order

Secrets are resolved after process environment variables and declared with the secret source. Declaring a secret is authoritative for its name; an explicit declaration overrides a same-named variable from a lower-precedence source. For env: sources, the named variable is read from the full resolved environment (including .env and settings env_files).

What is not included

hpc-compose ships local file: and env: sources only. Backend integrations (HashiCorp Vault, AWS Secrets Manager, GCP Secret Manager) are intentionally deferred — they would require either shelling out to the vault/gcloud CLIs or adding a client crate, which conflicts with the project’s minimal-dependency stance. You can bridge to them today by writing the fetched value into a file or exporting it as an environment variable, then referencing it through secrets:.

File-mount injection to /run/secrets/<name> (Docker Compose semantics) is also deferred; env-var injection through environment: covers the common case.

Run a Notebook or IDE Session
Wire Up CI
Spec Reference
Troubleshooting

Run Hyperparameter Sweeps

hpc-compose sweep turns one compose file with an embedded sweep block into many independent tracked Slurm jobs. Each trial is a normal sbatch submission with its own allocation, rendered script, job record, and scheduler state. The sweep manifest ties those jobs together for listing and aggregate status.

Quickstart

Start from a spec that can run with ordinary defaults, then add a top-level sweep block:

name: training-sweep

x-slurm:
  time: "00:20:00"
  cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}

sweep:
  parameters:
    lr: [0.001, 0.01, 0.1]
    batch_size: [32, 64]
  matrix: full

services:
  trainer:
    image: python:3.11-slim
    environment:
      LR: "${lr:-0.001}"
      BATCH_SIZE: "${batch_size:-32}"
    command: ["python", "train.py"]

Preview the expansion first:

hpc-compose sweep submit -f examples/training-sweep.yaml --dry-run

Then submit the trials:

hpc-compose sweep submit -f examples/training-sweep.yaml
hpc-compose sweep status -f examples/training-sweep.yaml
hpc-compose sweep list -f examples/training-sweep.yaml

Matrix Modes

matrix: full expands the full Cartesian product over sorted parameter names, so the example above produces six trials in stable t000, t001, … order.

Random sampling selects without replacement:

sweep:
  parameters:
    lr: [0.001, 0.01, 0.1]
    batch_size: [32, 64]
  matrix:
    random: 5
    seed: "paper-table-2"

With a seed, the selected trials are stable across machines. Without a seed, sweep submit derives one from the new sweep id and persists it in the manifest.

Interpolation Rules

Sweep parameter names are interpolation variable names. Values may be scalar strings, numbers, or booleans. For each trial, those variables override values from the environment and settings before planning, preparing, and rendering.

Reserved variables are also available:

Variable	Value
`HPC_COMPOSE_SWEEP_ID`	The persisted sweep id.
`HPC_COMPOSE_SWEEP_TRIAL`	The stable trial label such as `t000` (or `t000r0` when `replicates` > 1).
`HPC_COMPOSE_SWEEP_TRIAL_INDEX`	Zero-based trial index.
`HPC_COMPOSE_SWEEP_REPLICATE`	Zero-based replicate index within the config (`0` when `replicates: 1`).
`HPC_COMPOSE_SWEEP_SEED`	Deterministic per-replicate seed; present only when `replicates` > 1.

Normal commands still treat sweep as metadata. If plan, up, or render encounters ${lr} without a default, it fails unless lr is provided in the environment or settings. Use defaults such as ${lr:-0.001} when the base spec should remain runnable, and use sweep submit --dry-run as the validation path for missing sweep-only variables.

Replicates

Set replicates: N to submit N seeded trials per parameter config. This is sweep sugar for repeating each combination so noise can be averaged out:

sweep:
  parameters:
    lr: [0.001, 0.01]
  matrix: full
  replicates: 3
  objective:
    direction: minimize
    log_pattern: 'final loss=([0-9.]+)'

With replicates: 1 (the default) the expansion is byte-identical to a non-replicated sweep: trial ids stay t000, t001, … and no replicate seed is injected. With replicates > 1 each config c fans out into t{c:03}r0 … t{c:03}r{N-1} (for example t000r0, t000r1, t000r2), each its own Slurm allocation. The example above submits 2 configs × 3 replicates = 6 trials.

Each replicate gets a deterministic seed exposed as HPC_COMPOSE_SWEEP_SEED, derived as the hex SHA-256 digest of <sweep_id>:<config_key>:<replicate> (where config_key is the name=value;… join of the config’s sorted variables). Re-expanding the same sweep block with the same sweep id always reproduces the same seed, so a training script can feed HPC_COMPOSE_SWEEP_SEED to its RNG and recover the same run. HPC_COMPOSE_SWEEP_REPLICATE carries the zero-based replicate index.

sweep status, sweep observe, and sweep results group the trials of each config and report a mean±std(n) rollup (population standard deviation, so n=1 reports std=0). Crucially, best_trial ranks on the per-config group mean, not the single luckiest replicate, and sweep observe reports the winning config’s mean objective:

replicate rollup (mean+/-std over n replicates per config):
  lr=0.001: mean=0.034000 std=0.002160 n=3 (3 replicate(s))
  lr=0.01:  mean=0.041000 std=0.001414 n=3 (3 replicate(s))
best config: t000r0 (mean objective=0.034)

The fanout guard below counts materialized runs (combinations × replicates), so a 40-config matrix with replicates: 3 is 120 runs and is rejected without --max-trials.

Fanout Guard

By default, submitted sweeps are capped at 100 trials. Larger matrices fail before calling sbatch:

hpc-compose sweep submit -f examples/training-sweep.yaml

Raise the explicit ceiling when the fanout is intentional:

hpc-compose sweep submit -f examples/training-sweep.yaml --max-trials 500

The guard applies to real submissions. Dry runs can inspect any matrix size.

Status Output

sweep status loads the manifest, queries the tracked state for submitted jobs, and aggregates:

completed
failed
running
pending
unknown
missing_tracking
submit_failed

Use JSON for notebooks, dashboards, or CI automation:

hpc-compose sweep submit -f examples/training-sweep.yaml --format json
hpc-compose sweep status -f examples/training-sweep.yaml --format json
hpc-compose sweep status -f examples/training-sweep.yaml --sweep-id sweep-123 --format json
hpc-compose sweep list -f examples/training-sweep.yaml --format json

The JSON includes the sweep id, manifest path, matrix mode, persisted seed, trial variables, job ids, record paths, and per-trial status. When the sweep used replicates, it also carries a groups array with the per-config mean±std(n) rollup.

Manifest Layout

Sweep state is stored beside normal tracked jobs:

.hpc-compose/
  sweeps/
    latest.json
    <sweep-id>/
      sweep.json
      t000.sbatch
      t001.sbatch
  jobs/
    <job-id>.json

With replicates > 1 the per-trial scripts are named t000r0.sbatch, t000r1.sbatch, … (config index plus replicate index) instead of the flat t000.sbatch.

Sweep-trial records have kind: sweep_trial and include sweep metadata. They do not update the normal latest.json or latest-run.json pointers, so status, watch, and logs for ordinary runs keep their existing meaning.

Objectives and Early Termination

Declare an objective block to have sweep observe parse a metric from each terminal trial, rank trials, and record the best on the manifest:

sweep:
  parameters: { lr: [0.001, 0.01, 0.1] }
  matrix: full
  objective:
    direction: minimize
    log_pattern: 'final loss=([0-9.]+)'

The trial workload prints the metric to its service log (e.g. final loss=0.034). Two parse sources are supported (set exactly one):

log_pattern: a regex against the trial’s primary service log; capture group group (default 1) is parsed as a float.
json_path + json_field: read a JSON field from the trial’s artifact-collected tree.

hpc-compose sweep observe -f train.yaml             # parse + rank + print best
hpc-compose sweep observe -f train.yaml --format json

Early termination stops the sweep once a threshold is met. Use --watch --stop-when to poll and auto-stop:

hpc-compose sweep observe -f train.yaml --watch --stop-when 'objective < 0.05' --poll-interval 30s

Or stop manually after inspecting sweep observe output:

hpc-compose sweep stop -f train.yaml --yes --reason 'objective threshold met'

sweep stop cancels every non-terminal trial via scancel and records the stop on the manifest. --stop-when uses a tiny grammar: objective < N, objective <= N, objective > N, or objective >= N, evaluated against the best observed value.

Bayesian/adaptive trial selection is intentionally out of scope. The objective writeback, ranking, and stop machinery here are the foundation any future optimizer would build on.

Scaling Reports

Set objective.scaling_axis to the name of a numeric sweep parameter (for example nodes or model_size) to enable a post-hoc scaling report:

sweep:
  parameters:
    nodes: [1, 2, 4, 8]
  matrix: full
  objective:
    direction: minimize
    log_pattern: 'final loss=([0-9.]+)'
    scaling_axis: nodes

scaling_axis must name a key under sweep.parameters, and every value of that parameter must parse as a number. Both are checked at validate time (including sweep submit --dry-run), so a typo or a non-numeric axis is rejected with a clear message before anything is submitted.

Run sweep observe --scaling to print the report alongside the usual ranked table:

hpc-compose sweep observe -f train.yaml --scaling
hpc-compose sweep observe -f train.yaml --scaling --format json

The report pairs each config group’s mean objective with its axis value, reports a log-log least-squares slope (ln(objective) vs ln(axis)), and computes speedup/efficiency relative to a baseline group:

scaling (minimize objective vs nodes):
  baseline nodes=1
  nodes=1 mean=0.800000 runtime=100s speedup=1.000x efficiency=100.0% (n=1)
  nodes=2 mean=0.400000 runtime=50s speedup=2.000x efficiency=100.0% (n=1)
  nodes=4 mean=0.200000 runtime=25s speedup=4.000x efficiency=100.0% (n=1)
  log-log slope (objective vs nodes): -1.0000

The report is purely read-only, post-hoc analysis over the persisted manifest and tracked local state: it reuses the same terminal-only scheduler/runtime probe as sweep observe and never opens a new connection. Runtime is taken from the maximum observed service duration of each terminal trial; trials that are non-terminal or report no runtime are skipped rather than zero-filled. The baseline is the smallest-axis group that has runtime data. The report is print/JSON-only and is never written back to the manifest, so omitting --scaling leaves observe output byte-identical.

Limitations

Sweeps must be embedded in the same compose file. sweep.spec is not supported.
Each trial is a separate Slurm allocation. Sweeps are not Slurm arrays.
x-slurm.array is rejected during sweep submit.
Trials submit sequentially. If a submission fails, later trials are not submitted and the partial manifest is kept.
sweep status summarizes scheduler/tracking state; use sweep observe to parse and rank objectives.

Right-Size With Canary Runs
Runtime Observability
CLI Reference
Spec Reference

Right-Size With Canary Runs

hpc-compose germinate submits a short Slurm canary for an existing compose spec, forces runtime metrics on, waits for the canary to finish, and prints conservative resource recommendations for the original spec.

germinate is the pre-run probe: it submits a fresh short canary to estimate requests before you commit to a full run. inspect --rightsize is the post-run counterpart: it derives recommendations from the metrics a completed tracked run already produced. Use germinate when you have no run yet; use inspect --rightsize after a real run.

Canaries are short probes, not benchmark truth. They are useful for catching obvious over-requests such as asking for many GPUs when only one device is touched, or requesting far more memory than the process ever approaches during startup. They are not a substitute for full-run profiling when a workload has long warmup, data-dependent memory, lazy model loading, or late training phases.

Basic Workflow

hpc-compose germinate -f compose.yaml
hpc-compose germinate -f compose.yaml --format json
hpc-compose germinate -f compose.yaml --canary-time 00:01:00 --metrics-interval 5

The canary keeps partition, account, QoS, constraints, cache, runtime backend, and service topology from the original plan. It minimizes CPU, memory, and GPU requests in memory only, writes latest-canary.json, and leaves normal latest.json untouched.

Dry-run the canary script without submitting:

hpc-compose germinate -f compose.yaml --dry-run --script-out canary.sbatch

Output

Text output includes the canary job id, the standard right-sizing observations, and a YAML patch you can apply manually:

x-slurm:
  mem: 16G
services:
  trainer:
    x-slurm:
      cpus_per_task: 4

JSON output includes the same patch plus the full right-sizing report:

hpc-compose germinate -f compose.yaml --format json

Recommendation Rules

CPU recommendations use observed CPU demand with conservative headroom and round up.
Memory recommendations use the strongest available evidence from sampler rows, sstat, and sacct, then round to Slurm-friendly units.
GPU recommendations shrink only when GPU sampler evidence shows fewer active devices.
Walltime is observed but not down-sized from a short canary run.

Caveats

Warmup-heavy jobs can look smaller than steady-state jobs.
Data-dependent memory may peak after the canary exits.
Lazy model loading can under-report memory and GPU use if no real request hits the model.
Distributed training may need full topology even when a canary only exercises startup.
Failed, OOM-like, time-limit, malformed-metrics, and missing-metrics cases are reported as diagnostics rather than YAML rewrites.

Start from examples/canary-right-size.yaml when you want a small, explicit spec to practice the workflow.

Run Hyperparameter Sweeps
Runtime Observability
Runbook
Spec Reference

Connect Jobs Across Allocations

hpc-compose cross-job rendezvous lets independent Slurm jobs coordinate through the shared cache directory. A provider job registers an address under <cache_dir>/rendezvous/<name>/latest.json; a later client job resolves that record and receives stable HPC_COMPOSE_RDZV_* environment variables.

This is same-cluster shared-storage discovery. It does not create DNS, tunnels, authentication, authorization, or a service mesh. Use it only inside a same-user or trusted shared-project cache boundary.

Provider

name: model-server

x-slurm:
  cache_dir: ${CACHE_DIR}

services:
  model:
    image: python:3.12-slim
    command: python -m http.server 8000
    readiness:
      type: tcp
      port: 8000
    x-slurm:
      rendezvous:
        register:
          name: model-server
          port: 8000
          protocol: http
          path: /
          ttl_seconds: 3600

Provider registration is declarative. If readiness is configured, the rendered script registers after the readiness check succeeds. On cleanup, it removes latest.json only when the current job still owns the latest record.

Client

name: model-client

x-slurm:
  cache_dir: ${CACHE_DIR}
  rendezvous: model-server

services:
  client:
    image: curlimages/curl:8.10.1
    command: curl -fsS "$HPC_COMPOSE_RDZV_MODEL_SERVER_URL"

Clients receive generic variables such as HPC_COMPOSE_RDZV_URL, plus name-scoped variables such as HPC_COMPOSE_RDZV_MODEL_SERVER_URL, HPC_COMPOSE_RDZV_MODEL_SERVER_HOST, and HPC_COMPOSE_RDZV_MODEL_SERVER_PORT.

Debugging CLI

hpc-compose rendezvous list --cache-dir "$CACHE_DIR"
hpc-compose rendezvous resolve model-server --cache-dir "$CACHE_DIR"
hpc-compose rendezvous register model-server --host node01 --port 8000 --job-id 12345 --cache-dir "$CACHE_DIR"
hpc-compose rendezvous prune --cache-dir "$CACHE_DIR"

register is mainly for debugging and custom workflows. Normal provider jobs should use services.<name>.x-slurm.rendezvous.register.

TTL and Staleness

Records have a TTL. Resolution ignores expired records, and prune removes expired latest and historical JSON files. If the provider job exits cleanly, cleanup removes the latest pointer only if it still points at that job, so a newer provider is not deregistered by an older job finishing later.

Requirements

x-slurm.cache_dir must point to storage visible from the login node and compute nodes.
Provider and client jobs must use the same cache directory.
Names are single safe path components: ASCII letters, digits, ., _, and -.

See examples/rendezvous-model-server.yaml and examples/rendezvous-client.yaml for a runnable pair.

Artifacts and Resume
Cache Management
Spec Reference
Examples

Artifacts and Resume

Artifacts are collected after a run for export and provenance. Resume state is the canonical live checkpoint a later attempt loads on restart. Keep those roles separate: exported checkpoints are retrieval output, while the shared resume path is what a restarted run reads first.

Artifacts: Collection vs. Export

Artifact handling has two stages, and only the first is automatic.

1. Collection — automatic, at teardown (compute node)

When x-slurm.artifacts is enabled, the in-job teardown collects the declared paths into the tracked runtime directory:

<runtime-root>/<job-id>/artifacts/
  manifest.json
  payload/...

For resume-aware runs, the active attempt writes first under <runtime-root>/<job-id>/attempts/<attempt>/artifacts/; the top-level artifacts path is kept as the latest view.

This stage only fills the runtime payload — it never writes to export_dir.

Copying the collected payload into the configured export_dir is a separate, explicit step. Run it after the job finishes:

hpc-compose artifacts -f compose.yaml
hpc-compose artifacts -f compose.yaml --bundle checkpoints --tarball

export_dir is populated only by hpc-compose artifacts. Nothing runs it for you: down tears the job down without exporting, and pull only prints an rsync line that copies the runtime payload to your laptop (it does not touch export_dir). If downstream jobs read <export_dir>/<job-id>, run hpc-compose artifacts before down. When an export_dir is configured, hpc-compose surfaces this step in the “Next:” hints after up, status, and experiment.

export_dir is resolved relative to the compose file and expands ${SLURM_JOB_ID} from tracked metadata. Named bundles are written under <export_dir>/bundles/<bundle>/, and provenance JSON is written under <export_dir>/_hpc-compose/bundles/<bundle>.json.

The bundle name default is reserved for top-level x-slurm.artifacts.paths.

Resume-Aware Runs

When x-slurm.resume is enabled, hpc-compose:

mounts the shared resume path into every service at /hpc-compose/resume
injects HPC_COMPOSE_RESUME_DIR, HPC_COMPOSE_ATTEMPT, and HPC_COMPOSE_IS_RESUME
writes attempt-specific runtime outputs under <runtime-root>/<job-id>/attempts/<attempt>/
keeps <runtime-root>/<job-id>/{logs,metrics,artifacts,state.json} pointed at the latest attempt for compatibility

Use the shared resume directory for the canonical checkpoint a restarted run should load next. Treat exported artifacts as retrieval and provenance output after the attempt finishes, not as the primary live resume source.

Useful Commands

hpc-compose up --resume-diff-only -f compose.yaml
hpc-compose up --allow-resume-changes -f compose.yaml
hpc-compose artifacts -f compose.yaml

Connect Jobs Across Allocations
Runtime Observability
Spec Reference
Examples

Wire Up CI

hpc-compose ships fast, authoring-time commands (validate, lint) that are well-suited to pre-commit hooks and CI. This page covers three drop-in integrations: a pre-commit hook, a reusable GitHub Actions workflow, and a GitLab CI snippet.

All integrations require the hpc-compose binary to be installed first. The latest installer is:

curl -fsSL https://raw.githubusercontent.com/NicolasSchuler/hpc-compose/main/install.sh | sh

For pinned tags, checksum verification, and other install variants, see Installation. The CI snippets below pin a release tag for reproducible runs.

Pre-commit

The repository ships a .pre-commit-hooks.yaml defining two local hooks that run hpc-compose validate and hpc-compose lint against compose.yaml. Because hpc-compose is not distributed via pip, the hooks use language: system and require the binary to already be on PATH.

Add this repo to your .pre-commit-config.yaml:

Replace vX.Y.Z below with the latest release tag from the GitHub Releases page.

repos:
  - repo: https://github.com/NicolasSchuler/hpc-compose
    rev: vX.Y.Z  # pin to a release tag
    hooks:
      - id: hpc-compose-validate
      - id: hpc-compose-lint

By default the hooks run when a top-level compose.yaml is staged. If your project uses compose.yml or a nested spec, override both entry and files so the hook checks the file that triggered it:

      - id: hpc-compose-validate
        entry: hpc-compose validate -f compose.yml
        files: ^compose\.yml$
      - id: hpc-compose-lint
        entry: hpc-compose lint -f compose.yml --allow-warnings
        files: ^compose\.yml$

hpc-compose-validate fails on any spec error.
hpc-compose-lint passes with --allow-warnings (warnings are advisory). Use hpc-compose-validate plus a strict CI lint (below) to enforce both.

GitHub Actions

Reusable workflow

The simplest integration calls the maintained reusable workflow, which installs a pinned release and runs validate + lint. Replace vX.Y.Z with the latest release tag from the GitHub Releases page:

jobs:
  hpc-compose:
    uses: NicolasSchuler/hpc-compose/.github/workflows/hpc-compose-lint.yml@vX.Y.Z
    with:
      compose-file: compose.yaml
      version: vX.Y.Z
      strict: true

Set strict: true to fail on lint warnings, or strict: false (default) to allow warnings. Pin both the uses: ref and version to the same release tag.

Inline snippet

For repos that prefer an inline step. Replace vX.Y.Z with the latest release tag from the GitHub Releases page:

jobs:
  lint:
    runs-on: ubuntu-24.04
    steps:
      - uses: actions/checkout@v4
      - name: Install hpc-compose
        run: |
          set -euo pipefail
          curl -fsSL "https://raw.githubusercontent.com/NicolasSchuler/hpc-compose/vX.Y.Z/install.sh" \
            | env HPC_COMPOSE_VERSION="vX.Y.Z" sh
          echo "$HOME/.local/bin" >> "$GITHUB_PATH"
      - run: hpc-compose validate -f compose.yaml
      - run: hpc-compose lint -f compose.yaml --allow-warnings

GitLab CI

GitLab runners typically do not provide hpc-compose, so install it inside the job first. Replace vX.Y.Z with the latest release tag from the GitHub Releases page:

hpc-compose-lint:
  image: alpine:3.20
  rules:
    - changes: [compose.yaml]
  variables:
    HPC_COMPOSE_VERSION: vX.Y.Z
  before_script:
    - apk add --no-cache curl ca-certificates
    - |
      curl -fsSL "https://raw.githubusercontent.com/NicolasSchuler/hpc-compose/${HPC_COMPOSE_VERSION}/install.sh" \
        | env HPC_COMPOSE_VERSION="${HPC_COMPOSE_VERSION}" sh
    - export PATH="${HOME}/.local/bin:${PATH}"
  script:
    - hpc-compose validate -f compose.yaml
    - hpc-compose lint -f compose.yaml --allow-warnings

Strict vs. warnings

validate always fails on structural spec errors. lint emits advisory findings (HPC001–HPC007, HPC900); by default these fail the command, so add --allow-warnings for advisory-only runs. A common setup is:

pre-commit / local: lint --allow-warnings (fast feedback, advisory).
CI (merge gate): lint without --allow-warnings, or strict: true (enforce).

See CLI Reference for the full lint rule table and Troubleshooting for related workflows.

Run a Notebook or IDE Session
Use Secrets
Installation
Spec Reference

CLI Reference

This page maps the public hpc-compose CLI by workflow. Use Quickstart for the shortest install-and-run path, Runbook for real-cluster operations, and Spec Reference for YAML field behavior.

Command Index

Jump to the section that documents each command group:

Commands	Section
`new` / `init`, `examples`, `evolve`, `setup`, `context`, `completions`	Authoring and Setup
`--profile`, `--settings-file`, `setup`, `context`, `validate --strict-env`, `lint`, `schema`	Settings-aware commands
`plan`, `validate`, `lint`, `config`, `schema`, `inspect`, `preflight`, `doctor`, `weather`, `prepare`, `render`, `up`, `test`, `dev`, `tmux`, `germinate`, `sweep`, `when`, `alloc`, `run`, `shell`, `notebook`	Plan and Run
`lint` finding codes (`HPC001`-`HPC900`)	Lint rules
`debug`, `status`, `ps`, `watch`, `replay`, `checkpoints`, `logs`, `inspect --rightsize`, `stats`, `score`, `diff`, `artifacts`, `reach`, `pull`, `experiment`, `cancel`, `down`, `jobs`, `clean`, `rendezvous`	Tracked Runtime
`cache list`, `cache inspect`, `cache prune`	Cache Maintenance
`--<tool>-bin` overrides	Tool overrides

Manual Pages

Every command also ships a Unix man page, generated from the same definitions as this reference:

man hpc-compose
man hpc-compose-up
man hpc-compose-checkpoints
man hpc-compose-sweep-submit

Release archives install them under share/man/man1/. From a source checkout, regenerate them with cargo run --features manpage-bin --bin gen-manpages.

Common Flags

Flag	Use it for	Notes
`--profile <NAME>`	Select a profile from the project-local settings file	Applies to every command.
`--settings-file <PATH>`	Use an explicit settings file	Bypasses upward discovery of `.hpc-compose/settings.toml`.
`-f`, `--file <FILE>`	Select the compose file on compose-aware commands	When omitted, `hpc-compose` uses the active context compose file or falls back to `compose.yaml`.
`–color auto	always	never`
`--quiet`	Suppress non-essential progress labels	Useful when a wrapper only needs command output and errors.
`--format json`	Machine-readable output	Preferred on non-streaming commands.

Authoring and Setup

Command	Use it for	Notes
`new` (alias: `init`)	Generate a starter compose file from a built-in template	Use `--list-templates` and `--describe-template <name>` to inspect templates before writing a file. `--cache-dir` is optional and writes an explicit `x-slurm.cache_dir`.
`examples`	Search and recommend shipped examples and starter templates	Use `examples recommend` for a no-Slurm starting-point chooser, `examples list` or `examples search` to browse, and `examples coverage` to generate the docs coverage table.
`evolve`	Learn spec features through a progressive valid-spec tutorial	Use `--list-lessons`, `--describe-lesson <id>`, and `--until <step>` to inspect or stop at a lesson step. `--format json` requires `--yes`.
`setup`	Create or update the project-local settings file	Records compose path, env files, env vars, binary overrides, and an optional profile cache default.
`context`	Print the resolved execution context	Shows the selected profile, binaries, interpolation vars, runtime paths, and value sources.
`completions`	Generate shell completion scripts	Supports Bash, Zsh, Fish, PowerShell, and Elvish through Clap’s completion generator.

hpc-compose new --list-templates
hpc-compose new --describe-template minimal-batch
hpc-compose new --template minimal-batch --name my-app --output compose.yaml
hpc-compose new --template minimal-batch --name my-app --cache-dir '<shared-cache-dir>' --output compose.yaml
hpc-compose examples list
hpc-compose examples list --tag mpi --format json
hpc-compose examples search 'vllm worker'
hpc-compose examples recommend 'multi-node training' --tag gpu
hpc-compose examples recommend --format json
hpc-compose examples coverage --format markdown
hpc-compose evolve --list-lessons
hpc-compose evolve --describe-lesson progressive-complexity
hpc-compose evolve --output compose.yaml --name my-app
hpc-compose evolve --yes --until readiness --format json
hpc-compose setup
hpc-compose setup --profile-name dev --cache-dir '<shared-cache-dir>' --default-profile dev --non-interactive
hpc-compose context --format json
hpc-compose context --show-values --format json
hpc-compose completions zsh

`evolve` Options

evolve is authoring-only: it validates and writes candidate specs but does not prepare images, run preflight, or submit jobs. The default lesson is progressive-complexity, with steps minimal, second-service, readiness, failure-policy, and multi-node-placement.

--list-lessons prints shipped lessons.
--describe-lesson <LESSON> prints lesson steps and concepts.
--lesson <LESSON> selects the lesson to run.
--until <STEP> stops after a step id such as readiness.
--yes accepts steps noninteractively.
--format json is available for list/describe and for --yes runs.
--force allows overwriting the output file.

Settings-aware commands

Use these commands and global flags when you want the project-local settings file (.hpc-compose/settings.toml) to remember compose path, env files, env vars, and binary overrides. The YAML these commands operate on is documented in Spec Reference.

Command or flag	Purpose	Notes
`--profile <NAME>`	Select the profile from settings	Global flag; applies to every subcommand.
`--settings-file <PATH>`	Use an explicit settings file	Global flag; bypasses upward auto-discovery of `.hpc-compose/settings.toml`.
`hpc-compose setup`	Create or update the project-local settings file	Interactive by default; supports `--non-interactive` with `--profile-name`, `--compose-file`, `--env-file`, `--env`, `--binary`, `--cache-dir`, and `--default-profile`. `--login-host <host>` and `--login-user <user>` persist the `up --remote` SSH destination onto the selected profile (`[profiles.<name>]`).
`hpc-compose context`	Print fully resolved execution context	Shows selected settings/profile, compose path, binaries, referenced interpolation vars, runtime paths, and value sources; supports `--format json`. Sensitive-looking interpolation values are redacted unless `--show-values` is passed.
`hpc-compose validate --strict-env`	Fail when interpolation fell back to defaults	Detects when `${VAR:-...}` or `${VAR-...}` consumed fallback values because `VAR` was missing.
`hpc-compose lint`	Run opinionated authoring checks	Builds on validation and planning, then reports stable finding codes for risky dependency, memory, and shared-write patterns. Auto-fixable findings can be applied with `--fix` (preview with `--fix --dry-run`). See Lint rules.
`hpc-compose schema`	Print the checked-in JSON Schema	Useful for editor integration and authoring tools. Defaults to the compose schema; pass `--kind settings` to print the `settings.toml` authoring schema. Rust validation remains the semantic source of truth.

Plan and Run

Command	Use it for	Notes
`plan`	Validate and preview the static runtime plan	Recommended before every first run. `--show-script` prints the generated launcher to stdout without writing a file; `--explain` adds actionable cache, resume, preflight, and next-command hints.
`validate`	Check YAML shape and field validation	Add `--strict-env` when interpolation fallbacks should fail.
`lint`	Run stricter opinionated static checks	Flags risky-but-valid specs such as weak dependency readiness, unusual memory/CPU ratios, ignored services that can write shared paths, node-local cache or volume paths, and implicit `depends_on` conditions. Warnings fail by default; add `--allow-warnings` to make warning-only results successful. Pass `--fix` to apply auto-fixable findings in place (preview with `--fix --dry-run`).
`config`	Show the fully interpolated effective config	Use `--format json` when you need stable machine-readable snapshots or resume diffs. `config --variables` reports only interpolation variables referenced by the compose file and redacts sensitive-looking names unless `--show-values` is passed.
`schema`	Print the checked-in JSON Schema	Use it for editor integration and authoring tools. Defaults to the compose schema; pass `--kind settings` for the `settings.toml` authoring schema. The compose schema is also published with the docs site for YAML Language Server and SchemaStore consumption. Rust validation remains the semantic source of truth.
`inspect`	View the normalized runtime plan	`--verbose` shows resolved argv and final mount mappings with secret values redacted. Add `--dependencies` for a service DAG in text, DOT, or JSON form.
`preflight`	Check host and cluster prerequisites	Use `--strict` when warnings should block a later run.
`doctor cluster-report`	Generate a best-effort cluster capability profile	Writes `.hpc-compose/cluster.toml` by default; use `--out -` to print the TOML profile.
`doctor readiness`	Explain or run one service readiness probe from the current host	Does not start services or submit jobs. Use `--run` only against an already reachable endpoint, tracked log, tunnel, or login-node-visible service.
`doctor mpi-smoke`	Render or run a small MPI probe for one service	Reports requested/advertised MPI types, MPI profile metadata, discovered MPI installs, host MPI binds/env, and rendered `srun`; add `--submit` to consume a Slurm allocation.
`doctor fabric-smoke`	Render or run MPI/NCCL/UCX/OFI smoke probes for one MPI service	Use `--checks auto` or a comma-separated list such as `mpi,nccl`; render-only by default, `--submit` consumes a Slurm allocation.
`weather`	Show advisory live cluster conditions	One-shot dashboard from `sinfo`, `squeue`, optional `sshare`, and optional `sprio`; does not reserve resources or change submission behavior.
`prepare`	Import images and build prepared runtime artifacts	Use `--force-rebuild` when the base image or prepare inputs changed.
`render`	Write the generated launcher script without submitting	Good for reviewing the final batch script.
`up`	Run the one-command launch/watch/logs workflow	Preferred normal run on a real cluster. Uses a spec-scoped `.hpc-compose/locks/*.up.lock` to prevent concurrent `up` races.
`test`	Smoke-test a finite spec end to end	Requires explicit `--local` or `--submit`; every service must start, pass configured readiness, and complete successfully.
`dev`	Run local hot-reload mode	Watches bind-mounted source directories and restarts affected services through the local supervisor.
`tmux`	Open a multi-pane local service log dashboard	Tails one tracked local service log per pane; tmux does not own service processes.
`germinate`	Submit a short canary (default one minute) and recommend resource settings	Writes `latest-canary.json`, keeps normal `latest.json` untouched, and prints a manual YAML patch.
`sweep submit`	Submit many independent trials from a top-level `sweep` block	Each trial is a tracked Slurm allocation. Use `--dry-run` first and `--max-trials` for intentional fanout above 100.
`when`	Submit after cluster conditions are met	Prepares and renders now, then monitors typed conditions such as idle nodes, prior job completion, or a local time window before calling `sbatch`.
`alloc`	Open an interactive Slurm allocation for iterative service runs	Uses top-level `x-slurm` allocation settings, exports `HPC_COMPOSE_*`, and lets `run SERVICE -- CMD` reuse the active allocation.
`run`	Launch a one-off command	Service mode uses an existing compose service. Image mode uses `--image IMAGE -- CMD` and builds an ephemeral one-service plan.
`shell`	Open an interactive Pyxis shell	Thin wrapper around `srun --pty --container-image=<image> bash -l`.
`notebook`	Launch a tracked JupyterLab or VS Code notebook server	Submits a single-service Slurm job (or `--local`), waits for readiness, and prints the connection URL plus an SSH tunnel hint for Jupyter. `--format json` emits `{url, tunnel_hint, compute_node, login_host, job_id, next_commands}` as a single object. Set `login_host` in settings so the tunnel names your real SSH login host. Stop with `hpc-compose cancel`.

hpc-compose plan -f compose.yaml
hpc-compose plan --explain -f compose.yaml
hpc-compose plan --show-script -f compose.yaml
hpc-compose validate -f compose.yaml
hpc-compose lint -f compose.yaml
hpc-compose lint -f compose.yaml --allow-warnings
hpc-compose lint -f compose.yaml --fix
hpc-compose lint -f compose.yaml --fix --dry-run
hpc-compose lint -f compose.yaml --format json
hpc-compose config -f compose.yaml
hpc-compose config -f compose.yaml --variables
hpc-compose schema > hpc-compose.schema.json
hpc-compose inspect --verbose -f compose.yaml
hpc-compose inspect --dependencies -f compose.yaml
hpc-compose inspect --dependencies --dependencies-format dot -f compose.yaml
hpc-compose preflight -f compose.yaml
hpc-compose doctor cluster-report
hpc-compose doctor readiness -f compose.yaml --service api
hpc-compose doctor readiness -f compose.yaml --service api --run
hpc-compose doctor readiness -f compose.yaml --service api --run --log-file .hpc-compose/<job-id>/logs/api.log
hpc-compose doctor mpi-smoke -f compose.yaml --service trainer --script-out mpi-smoke.sbatch
hpc-compose doctor mpi-smoke -f compose.yaml --service trainer --submit
hpc-compose doctor fabric-smoke -f compose.yaml --service trainer --checks auto --script-out fabric-smoke.sbatch
hpc-compose doctor fabric-smoke -f compose.yaml --service trainer --checks mpi,nccl --submit
hpc-compose weather
hpc-compose weather --format json
hpc-compose prepare -f compose.yaml
hpc-compose render -f compose.yaml --output job.sbatch
hpc-compose up -f compose.yaml
hpc-compose up --hold-on-exit always -f compose.yaml
hpc-compose up --watch-queue --queue-warn-after 15m -f compose.yaml
hpc-compose up --detach --format json -f compose.yaml
hpc-compose up --detach --format json --print-endpoints -f compose.yaml
hpc-compose test --local -f compose.yaml
hpc-compose test --submit --time 00:01:00 -f compose.yaml
hpc-compose dev -f examples/dev-python-app.yaml
hpc-compose tmux -f examples/dev-python-app.yaml --no-attach
hpc-compose germinate -f compose.yaml
hpc-compose germinate -f compose.yaml --format json
hpc-compose germinate -f compose.yaml --dry-run --script-out canary.sbatch
hpc-compose sweep submit -f compose.yaml --dry-run
hpc-compose sweep submit -f compose.yaml --max-trials 200
hpc-compose sweep results -f compose.yaml --format csv > runs.csv
hpc-compose sweep results -f compose.yaml --include score,energy --format json
hpc-compose score --sweep sweep-1700000000-1234 -f compose.yaml --format json
hpc-compose stats --sweep sweep-1700000000-1234 -f compose.yaml --format json
hpc-compose sweep status -f compose.yaml --format json
hpc-compose sweep list -f compose.yaml
hpc-compose when -f compose.yaml --partition gpu8 --free-nodes 4
hpc-compose when -f compose.yaml --after-job 12345
hpc-compose when -f compose.yaml --between 22:00-06:00
hpc-compose when --detach --format json -f compose.yaml --partition gpu8 --free-nodes 4
hpc-compose alloc -f compose.yaml
hpc-compose run app -- python -m smoke_test
hpc-compose run --image docker://python:3.12 --resources cpu-small -- python -V
hpc-compose shell --image docker://ubuntu:24.04
hpc-compose notebook --kind jupyter --gpus 1 --volume ./project:/workspace
hpc-compose notebook --kind vscode --image ghcr.io/example/code:1 --gpus 1
hpc-compose notebook --local --kind jupyter
hpc-compose notebook --kind jupyter --format json

Lint rules

hpc-compose lint emits stable finding codes after validation and planning succeed. Warning-level findings fail the command by default; pass --allow-warnings to downgrade them to advisory so a warning-only run still succeeds.

Code	Severity	Trigger	Recommendation
`HPC001`	warning	A service uses `depends_on` with `condition: service_started` against an upstream service that has no `readiness` probe. The dependency may fire before the upstream is actually ready.	Add a `readiness` block to the upstream service, or switch to `service_completed_successfully` for one-shot dependencies.
`HPC002`	warning	`x-slurm.mem` gives fewer than 512 MiB or more than 512 GiB per requested CPU. Very low ratios may OOM; very high ratios may queue poorly or violate site policy.	Adjust `x-slurm.mem` or CPU/task counts to land in the expected band.
`HPC003`	warning	A service with `failure_policy.mode: ignore` has a writable mount from a shared cache path. Ignored failures can leave corrupt state for subsequent jobs.	Use a read-only mount, write to job-local scratch, or avoid `mode: ignore` for services that mutate shared state.
`HPC004`	warning	`x-slurm.cache_dir` resolves under a node-local root (`/tmp`, `/var/tmp`, `/private/tmp`, `/dev/shm`). Compute nodes typically cannot see these paths, so the cache is rebuilt every job.	Point `x-slurm.cache_dir` at shared storage visible from both login and compute nodes. Advisory only; `--fix` will not rewrite paths.
`HPC005`	warning	A service `volumes` host path lives under a node-local root. The mount will be missing or empty on compute nodes.	Move the host path under shared storage, or use job-local scratch. Advisory only; `--fix` will not rewrite paths.
`HPC006`	warning (fixable)	A `depends_on` edge has no explicit `condition:` (list-form `depends_on: [name]`, or mapping form with the `condition:` key omitted). The implicit `service_started` default is easy to misread.	Make the condition explicit. `hpc-compose lint --fix` writes the current default for you, preserving comments and formatting everywhere else.
`HPC007`	warning	A service’s remote image uses a mutable or missing tag (`:latest`, or no tag at all) instead of an immutable pin. Such tags drift, so a rerun can silently pull a different image — and a typo’d tag is only caught at Enroot import time.	Pin by digest (`repo/name@sha256:...`), or at least an explicit non-`latest` version tag, for reproducible runs. Advisory only; `lint --fix` will not rewrite image references.
`HPC900`	warning	Cluster profile advisory: the site cluster profile (`doctor cluster-report`) detected a runtime-plan mismatch such as a shared-cache path, port-range overlap, or MPI configuration concern.	Inspect the finding message for the specific cluster-level concern and adjust the spec or cluster profile accordingly.

Auto-fixable findings

hpc-compose lint --fix applies every fixable finding directly to the compose file. Today only HPC006 (implicit depends_on condition) is auto-fixable, because the rewrite is deterministic and semantics-preserving: the implicit service_started default is written out verbatim, so the rendered Slurm script is byte-identical.

hpc-compose lint -f compose.yaml --fix --dry-run   # preview the diff
hpc-compose lint -f compose.yaml --fix              # apply in place

The rewriter edits only the located depends_on block; comments, blank lines, and author formatting elsewhere are preserved byte-for-byte. A safety gate re-parses and re-plans the file after each run; if anything fails to reload the original file is restored. Path findings (HPC004, HPC005) are intentionally not auto-fixed because the correct replacement is cluster-specific.

Editor Schema

The checked-in schema is draft-07 JSON Schema and is published with the docs site at /schema/hpc-compose.schema.json. SchemaStore should associate it only with hpc-compose-specific filenames: hpc-compose.yaml, hpc-compose.yml, *.hpc-compose.yaml, and *.hpc-compose.yml. Generic compose.yaml remains a supported input file, but it is intentionally not claimed for zero-config editor association.

`up` Options

Useful workflow flags:

--local runs a Pyxis/Enroot plan on the current Linux host instead of calling sbatch.
--detach submits or launches and returns after tracking metadata is written.
--format text|json is accepted with --detach or --dry-run.
--watch-queue waits in line-oriented queue output until the Slurm job reaches RUNNING, then opens the normal watch view.
--queue-warn-after <DURATION> warns once when --watch-queue stays PENDING longer than the threshold; the default is 10m, and 0 disables the warning.
--watch-mode auto|tui|line selects the live output mode.
--hold-on-exit never|failure|always controls whether the TUI stays open after the job reaches a terminal scheduler state.
--allow-resume-changes acknowledges an intentional change to resume-coupled config between tracked runs.
--resume-diff-only prints the resume-sensitive config diff without submitting.
--script-out <PATH> keeps a copy of the rendered batch script.
--remote[=<DEST>] delegates submission to a login node over SSH; without <DEST> it uses login_host from settings. <DEST> may be a host, an ~/.ssh/config alias, or user@host. Without an inline user@, the SSH user is taken from HPC_COMPOSE_REMOTE_USER, then settings login_user (profile over defaults), then your ~/.ssh/config. It cannot be combined with --local, --watch-queue, --script-out, or non-detached --watch-mode tui. See Submit From Your Laptop With up --remote.
--remote-install <auto|never|force> (default auto; env HPC_COMPOSE_REMOTE_INSTALL) controls remote auto-install. up --remote probes the login node for hpc-compose and, under auto, installs the newest release into ~/.local/bin when it is missing or older than your local version. force always reinstalls before delegating; never only probes and fails with the manual install command when the binary is missing or old (use on locked-down or air-gapped login nodes).
--force-rebuild refreshes imported and prepared artifacts before launch.
--skip-prepare reuses an already-prepared image cache and builds nothing. On a first run (or after cache eviction) the image is not prepared yet, so preflight reports it as not-yet-prepared; run up/prepare once without --skip-prepare first. See Troubleshooting.
--keep-failed-prep leaves the failed Enroot rootfs behind for inspection.
Array jobs (x-slurm.array) require --detach because live watch/log fan-out is not array-aware yet.
Scheduler dependencies from x-slurm.after_job and x-slurm.dependency are passed as sbatch --dependency=....
stats --sweep <ID> and score --sweep <ID> require a real sweep id from sweep list; latest is not a special sentinel for those options.

Tool overrides

Commands that interact with Slurm or container runtimes accept --<tool>-bin <PATH> flags to point at non-default executables. This is useful when tools live outside PATH or when testing against fake binaries.

Flag	Default	Accepted by
`--sbatch-bin`	`sbatch`	`up`, `when`, `germinate`, `test`, `run`, `notebook`, `sweep submit`, `preflight`, `debug`, `doctor`
`--srun-bin`	`srun`	`up`, `when`, `alloc`, `germinate`, `test`, `run`, `notebook`, `shell`, `sweep submit`, `preflight`, `debug`, `doctor`
`--squeue-bin`	`squeue`	`up`, `when`, `germinate`, `test`, `run`, `notebook`, `watch`, `status`, `stats`, `ps`, `inspect`, `score`, `diff`, `reach`, `experiment show`, `sweep status`, `sweep observe`, `sweep stop`, `sweep results`, `debug`, `weather`
`--sacct-bin`	`sacct`	`up`, `when`, `germinate`, `test`, `run`, `notebook`, `watch`, `status`, `stats`, `ps`, `inspect`, `score`, `diff`, `reach`, `experiment show`, `sweep status`, `sweep observe`, `sweep stop`, `sweep results`, `debug`
`--salloc-bin`	`salloc`	`alloc`
`--scontrol-bin`	`scontrol`	`alloc`, `sweep submit`, `preflight`, `debug`, `doctor`
`--sinfo-bin`	`sinfo`	`when`, `weather`
`--scancel-bin`	`scancel`	`test`, `cancel`, `down`, `sweep observe`, `sweep stop`
`--sstat-bin`	`sstat`	`germinate`, `stats`, `inspect`, `score`, `experiment show`, `sweep results`
`--sshare-bin`	`sshare`	`weather`
`--sprio-bin`	`sprio`	`weather`
`--enroot-bin`	`enroot`	`up`, `when`, `alloc`, `germinate`, `test`, `dev`, `tmux`, `run`, `notebook`, `sweep submit`, `prepare`, `preflight`, `debug`, `doctor`
`--apptainer-bin`	`apptainer`	`up`, `when`, `alloc`, `germinate`, `test`, `dev`, `tmux`, `run`, `notebook`, `sweep submit`, `prepare`, `preflight`, `debug`, `doctor`
`--singularity-bin`	`singularity`	`up`, `when`, `alloc`, `germinate`, `test`, `dev`, `tmux`, `run`, `notebook`, `sweep submit`, `prepare`, `preflight`, `debug`, `doctor`
`--huggingface-cli-bin`	`huggingface-cli`	`up`, `when`, `alloc`, `germinate`, `test`, `dev`, `tmux`, `run`, `notebook`, `sweep submit`, `prepare`
`--tmux-bin`	`tmux`	`tmux`

Note: doctor accepts the --*-bin overrides only through its deprecated top-level flag form, not the recommended doctor <subcommand> forms (doctor cluster-report, doctor readiness, doctor mpi-smoke, doctor fabric-smoke), which reject them with an “unexpected argument” error.

Settings profiles can also configure these via [defaults.binaries] or [profiles.<name>.binaries] (see Runbook).

`germinate` Canary Runs

germinate is the conservative right-sizing workflow:

hpc-compose germinate -f compose.yaml
hpc-compose germinate -f compose.yaml --canary-time 00:01:00 --metrics-interval 5
hpc-compose germinate -f compose.yaml --pending-timeout 30m --format json

Useful options:

--canary-time <TIME> defaults to 00:01:00.
--metrics-interval <SECONDS> defaults to 5 and is forced on in the canary plan.
--pending-timeout <DURATION> defaults to 30m.
--min-cpus <N>, --min-mem <MEM>, and --min-gpus <N> set canary floors.
--dry-run renders the canary script without calling sbatch.
--skip-prepare, --force-rebuild, --keep-failed-prep, --no-preflight, and --script-out match the normal preparation flags.

The command rejects x-slurm.array and never rewrites your compose file automatically. See Right-Size With Canary Runs.

`sweep` Hyperparameter Sweeps

sweep expands the top-level sweep block in a compose file. Each generated trial is rendered and submitted as an independent tracked Slurm job; sweep status and sweep list read the persisted manifest under .hpc-compose/sweeps/.

When the sweep sets replicates: N, each parameter config fans out into N seeded trials (t000r0, t000r1, …). sweep status, sweep observe, and sweep results then add a per-config mean±std(n) rollup (text and a groups array in JSON), sweep results --format csv gains config_key and replicate columns, and best_trial ranks on the per-config group mean rather than the single best replicate.

hpc-compose sweep submit -f train.yaml --dry-run
hpc-compose sweep submit -f train.yaml --max-trials 200
hpc-compose sweep submit -f train.yaml --format json
hpc-compose sweep status -f train.yaml
hpc-compose sweep status -f train.yaml --sweep-id sweep-123 --format json
hpc-compose sweep list -f train.yaml --format json

sweep submit options:

Option	Use it for
`-f`, `--file <FILE>`	Select the compose file containing the embedded `sweep` block.
`--dry-run`	Expand and validate all trials without writing manifests, scripts, or job records.
`--max-trials <N>`	Permit real submissions above the default 100-trial fanout guard.
`--skip-prepare`	Reuse existing prepared artifacts and skip image preparation.
`--force-rebuild`	Refresh imported/prepared artifacts for each submitted trial.
`--no-preflight`	Skip preflight checks before trial submission.
`–format text	json`

sweep status options:

Option	Use it for
`-f`, `--file <FILE>`	Select the compose file whose sweep manifests should be read.
`--sweep-id <ID>`	Inspect a specific sweep instead of `.hpc-compose/sweeps/latest.json`.
`–format text	json`

sweep list options:

Option	Use it for
`-f`, `--file <FILE>`	Select the compose file whose sweep directory should be scanned.
`–format text	json`

sweep observe options:

Option	Use it for
`-f`, `--file <FILE>`	Select the compose file whose sweep manifest should be observed.
`--sweep-id <ID>`	Observe a specific sweep instead of the latest.
`--watch`, `--stop-when <EXPR>`	Poll until a terminal trial satisfies the objective threshold, then stop the sweep.
`--poll-interval <DURATION>`, `--timeout <DURATION>`	Tune the `--watch` polling cadence and deadline.
`--scaling`	Print a read-only post-hoc scaling report (objective vs `objective.scaling_axis`: log-log slope plus speedup/efficiency over terminal trials).
`–format text	json`

sweep stop options:

Option	Use it for
`-f`, `--file <FILE>`	Select the compose file whose sweep manifest should be stopped.
`--sweep-id <ID>`	Stop a specific sweep instead of the latest.
`--reason <REASON>`	Record a free-form stop reason on the manifest.
`--yes`	Skip the interactive confirmation prompt.
`–format text	json`

sweep stop cancels every still-running or pending trial of a sweep with scancel and records the stop on the manifest. Use it after sweep observe to realize early termination once an objective threshold is met.

See Hyperparameter Sweeps for the sweep spec shape, interpolation rules, status categories, and current limitations.

`when` Conditional Submission

when is a foreground monitor for constrained partitions and off-hour workflows. It runs the normal pre-submit work first, then polls until every supplied condition is true:

hpc-compose when -f compose.yaml --partition gpu8 --free-nodes 4
hpc-compose when -f compose.yaml --after-job 12345 --after-job-condition afterok
hpc-compose when -f compose.yaml --between 22:00-06:00

All conditions must hold (logical AND). --free-nodes counts only idle rows from sinfo -h -p <partition> -o "%T|%D" and requires --partition to match x-slurm.partition. --after-job polls squeue first and then sacct; afterok and afternotok fail immediately when the prior job reaches a terminal state that can never satisfy the requested condition. --between uses local login-node wall-clock time and supports wraparound windows such as 22:00-06:00.

Useful options:

--poll-interval <DURATION> defaults to 60s; the minimum is 5s.
--timeout <DURATION> gives up if conditions are not met; 0s performs one check.
--detach returns after submission and tracking metadata are written.
--format json is accepted with --detach and returns the condition summaries plus normal submission metadata.
--skip-prepare, --force-rebuild, --keep-failed-prep, --no-preflight, and --script-out match the corresponding up preparation flags.

Example JSON automation:

hpc-compose when --detach --format json -f compose.yaml --partition gpu8 --free-nodes 4

There is no x-when YAML field. Conditional submission is intentionally a CLI workflow layered over the normal compose spec.

`up --local`

up --local launches a Pyxis/Enroot plan on the current host instead of calling sbatch. It is useful for local authoring and script inspection, not for distributed Slurm execution.

hpc-compose up --local --dry-run -f compose.yaml

Current constraints:

Linux hosts only
runtime.backend: pyxis only
single-host specs only
no distributed or partitioned placement
no services.<name>.x-slurm.extra_srun_args
no services.<name>.x-slurm.mpi
no x-slurm.array
no scheduler dependencies from x-slurm.after_job or x-slurm.dependency
reservation-related x-slurm.submit_args are ignored
x-slurm.error is ignored, and local batch stderr is written into the tracked local batch log

up --local follows the tracked local launch immediately, just like up does for a submitted job. Add --detach when you want to launch and return.

In local mode the batch script also exports HPC_COMPOSE_BACKEND_OVERRIDE=local, HPC_COMPOSE_LOCAL_ENROOT_BIN pointing to the resolved enroot binary, and HPC_COMPOSE_LOCAL_BIN_DIR containing a generated srun shim. These variables are internal to hpc-compose and not intended for direct use in compose specs.

Development Workflow

test, dev, and tmux are intentionally small workflows layered over the same render/prepare/tracking machinery as up. See Development Workflow for the smoke-test guide, hot-reload behavior, and local-mode constraints.

test is for finite smoke specs:

hpc-compose test --local -f compose.yaml
hpc-compose test --submit --time 00:01:00 --timeout 180s -f compose.yaml
hpc-compose test --submit --format json -f compose.yaml

Success means all tracked services appear in runtime state, launched at least once, passed readiness when readiness is configured, and completed successfully. Long-running application specs should use a smoke-test variant of the command or service entrypoint that exits after proving the workflow.

Useful test options:

Option	Use it for
`--local`	Run the finite smoke spec through the local supervisor.
`--submit`	Submit the finite smoke spec to Slurm; required before any scheduler submission happens.
`--time <TIME>`	Override Slurm wall time for `--submit`; defaults to `00:01:00`.
`--wait-timeout <DURATION>` (alias `--timeout`)	Stop waiting and best-effort cancel/cleanup after the timeout; defaults to `180s`.
`--format json`	Emit phase status, job id, script path, per-service results, and failure reason for automation.

dev is local-only and watches host directories from service volumes:

hpc-compose dev -f examples/dev-python-app.yaml
hpc-compose dev -f compose.yaml --watch-paths ./src --debounce-ms 500

Directory bind mounts are mapped back to affected services. File mounts, missing paths, container-only paths, cache paths, and non-directory paths are ignored. --watch-paths adds an explicit directory and restarts all services when it changes. By default, leaving dev stops the local supervisor; use --keep-running when you want the tracked local job to continue.

Useful dev options:

Option	Use it for
`--watch-paths <PATH>`	Add an explicit watch root when mounted source directories cannot be inferred.
`--debounce-ms <N>`	Coalesce rapid file changes before requesting a restart.
`--keep-running`	Leave the local supervisor alive when the watch loop exits.
`--tui`	Open the live watch TUI while file-watching restarts services in the background.

tmux opens a log dashboard for local runs:

hpc-compose tmux -f compose.yaml
hpc-compose tmux -f compose.yaml --job-id local-123
hpc-compose tmux -f compose.yaml --session demo --no-attach

When --job-id is omitted, tmux launches a new local run first. Each pane runs tail -F against one tracked service log and uses the service name as the pane title.

Useful tmux options:

Option	Use it for
`--job-id <ID>`	Attach the dashboard to an existing tracked local run.
`--session <NAME>`	Choose the tmux session name instead of `hpc-compose-<job-id>`.
`--no-attach`	Create/update the dashboard without requiring an interactive terminal.
`--lines <N>`	Set the initial `tail -n` history for each pane.

`run` and `shell`

run has two forms:

hpc-compose run [-f compose.yaml] SERVICE -- CMD [ARGS...]
hpc-compose run --image IMAGE [--resources NAME] [--time T] [--mem M] [--cpus-per-task N] [--gpus N] [--partition P] [--env K=V] [--dataset PATH] [--output DIR] [--local] -- CMD [ARGS...]

Service mode reuses the named service’s image, environment, mounts, working directory, and prepare rules, clears depends_on, and submits a fresh tracked run job. When launched inside hpc-compose alloc, service mode detects HPC_COMPOSE_ALLOCATION=1 and SLURM_JOB_ID, prints the active allocation id, runs the one-service launcher inside the allocation with srun, and records the latest run metadata against the allocation job id. Image mode creates an ephemeral one-service plan from CLI flags, then follows the normal render/prepare/submit path. --resources refers to [resource_profiles.<name>] in settings; it is not the global --profile selector.

Image mode also accepts two batch-inference flags (both image-mode-only; using either without --image is an error):

--dataset <PATH> binds an existing shared-filesystem path read-only into the container and exposes its in-container location as HPC_COMPOSE_DATASET_DIR. The path is filesystem-based only; remote or registry schemes such as hf:// are rejected, and a non-existent path fails before any submission. Copy datasets onto the shared filesystem first.
--output <DIR> turns on artifact export: the in-container path exposed as HPC_COMPOSE_OUTPUT_DIR is collected after the job and exported into <DIR> (recorded as the run’s artifact_export_dir). Have the in-job command write its results under $HPC_COMPOSE_OUTPUT_DIR.

hpc-compose run --image docker://python:3.12 --dataset /scratch/data --output ./results -- python infer.py

alloc requests an interactive allocation through salloc:

hpc-compose alloc -f compose.yaml
hpc-compose alloc -f compose.yaml -- bash -lc 'hpc-compose run app -- python -m pytest'

It runs preflight and image preparation by default, accepts the matching up preparation flags (--no-preflight, --skip-prepare, --force-rebuild, and --keep-failed-prep), rejects x-slurm.array, and exports allocation metadata such as HPC_COMPOSE_COMPOSE_FILE, HPC_COMPOSE_CACHE_DIR, HPC_COMPOSE_NODELIST_FILE, and HPC_COMPOSE_PRIMARY_NODE.

shell is intentionally thinner:

hpc-compose shell --image IMAGE [--resources NAME] [--time T] [--mem M] [--cpus-per-task N] [--gpus N] [--partition P] [--env K=V]

It calls srun --pty directly with Pyxis --container-image and defaults to bash -l. It does not render an sbatch script or create tracked job metadata.

notebook launches a tracked interactive server:

hpc-compose notebook [--kind jupyter|vscode] [--image IMAGE] [--port N] [--token TOKEN]
                     [--volume HOST:CONTAINER]... [--working-dir PATH] [--tunnel-name NAME]
                     [--ready-timeout DURATION] [--follow] [--dry-run] [--local] [-- ARGS...]
                     [--resources NAME] [--time T] [--mem M] [--cpus-per-task N] [--gpus N]
                     [--partition P] [--env K=V]

It synthesizes a one-service compose job from the preset, runs the normal preflight/prepare/render path, submits (or launches locally with --local), waits for a log readiness signal, then prints the connection URL — a localhost Jupyter URL plus an SSH tunnel hint for Jupyter on Slurm, or the scraped vscode.dev link for VS Code. The session is a tracked job of kind notebook (see Notebook Sessions); stop it with hpc-compose cancel. --kind vscode requires --image because no universal default code image is shipped.

Accessible and Automation-Friendly Output

Use plain or structured output when terminal styling, progress labels, or alternate-screen interfaces make automation or assistive tooling harder:

hpc-compose --color never plan -f compose.yaml
hpc-compose --quiet validate -f compose.yaml
hpc-compose watch -f compose.yaml --watch-mode line
hpc-compose logs -f compose.yaml --service app --follow
hpc-compose logs -f compose.yaml --grep 'error|oom' --since 30m
hpc-compose status -f compose.yaml --format json

context and config --variables intentionally scope interpolation variables to names referenced by the compose file. Values whose names look secret-bearing are shown as <redacted> by default; add --show-values only in trusted local diagnostics. A name triggers redaction when, after upper-casing, it contains any of these case-insensitive substrings: SECRET, TOKEN, PASSWORD, PASSWD, API_KEY, ACCESS_KEY, PRIVATE_KEY, CREDENTIAL, AUTH, COOKIE, SESSION, BEARER. Because the test is a substring match, names such as SESSION_DIR or AUTH_MODE also match.

Tracked Runtime

Command	Use it for	Notes
`debug`	Diagnose the latest tracked run	Shows scheduler state, per-service state, batch and service log tails, missing-log hints, and a recommended next command. Add `--preflight` to rerun prerequisite checks.
`status`	Summarize scheduler state, the top-level batch log, per-service outcomes, and failure-policy state	Prefer `--format json` for automation. Add `--array` to include merged `squeue --array` and `sacct --array` task rows.
`ps`	Show a stable per-service runtime snapshot	Useful when you want a point-in-time view instead of the live TUI.
`watch`	Reconnect to the live watch UI	Falls back to line-oriented output on non-interactive terminals.
`reach`	Print the SSH tunnel to reach a tracked service from a laptop	Resolves the compute node from tracked status and the port from the service’s TCP/HTTP readiness, then prints an `ssh -L` command (with `ControlMaster` multiplexing so an OTP login node prompts once) or runs it in the foreground with `--open`. Pass `--port` for services without TCP/HTTP readiness; `--format json` emits `{service, job_id, compute_node, login_host, local_port, remote_port, url, ssh_command}`.
`experiment`	Read-only aggregator for one tracked run	Parent command; the `experiment show` subcommand aggregates a single run into one object.
`experiment show`	Aggregate one tracked run into a single read-only object	Combines scheduler status, the post-run efficiency score, the artifact manifest, and submit-time provenance into one object. `--format json` emits `{job_id, name, state, services[], provenance, results, efficiency, next_commands}`; each service carries `{name, nodelist, status, tunnel_hint}` with an `ssh -L` `ControlMaster` hint for TCP/HTTP readiness. Defaults to the latest tracked run; energy flags (`--pue`, `--gpu-tdp-w`, `--cpu-watts-per-core`) tune the embedded efficiency report. Static-safe: contacts the scheduler only as much as `status`/`score` do, writes nothing, and opens no connection. Example: `hpc-compose experiment show 12345 --format json`.
`replay`	Reanimate a tracked job timeline from existing artifacts	Best-effort DVR view built from final state, service-exit markers, metrics JSONL, and logs. Use `--speed` or `--format json` as needed.
`checkpoints`	Show attempt and requeue history from tracked state	Reads LOCAL tracked state only: the per-attempt `state.json` files written under `.hpc-compose/<job>/attempts/<n>/` when `x-slurm.resume` is configured (each requeue is a new attempt), or the single latest `state.json` otherwise (reported as one attempt, zero requeues). Reports per-attempt start/finish/duration and exit code. Contacts no scheduler and reads nothing from the cluster filesystem; missing or unreadable state degrades into `degraded[]` notes instead of failing. `--format json` emits `{job_id, compose_file, submitted_at, resume_configured, attempts, requeues, current_attempt, is_resume, resume_dir, entries[], degraded[]}`. Not to be confused with the `artifacts --bundle checkpoints` model-checkpoint export.
`logs`	Print tracked service logs	Add `--follow`, `--grep <pattern>`, or coarse `--since <duration>` as needed.
`inspect --rightsize`	Suggest conservative resource request reductions after a tracked run	Uses tracked `sacct`, `sstat`, and sampler evidence; supports `--job-id` and `--format json`.
`stats`	Report tracked runtime metrics, step stats, and optional accounting	Supports `--accounting`, `--format json`, `--format jsonl`, and `--format csv`.
`score`	Score post-run resource efficiency	Supports positional job ids, `--format json`, `--pue`, `--gpu-tdp-w`, and `--cpu-watts-per-core`.
`diff`	Compare two tracked job submissions, or an N-way matrix of several runs	Pairwise: two positional job ids, compact text by default, `--format json` for full detail. N-way matrix: `--across <SWEEP_ID>` compares every submitted trial of a sweep, or `--jobs a,b,c` compares an explicit list. The matrix shows one column per run and one row per field that differs in at least one run (fields identical across all runs are collapsed); pick `--matrix-format text\|csv\|json` (CSV emits `section,field,<job_id>...` for spreadsheets).
`artifacts`	Export tracked artifact bundles after a run	Use `--bundle <name>` and `--tarball` when needed.
`pull`	Print the `rsync` command to copy a tracked job’s artifacts to a laptop	Resolves the artifact payload directory from tracked state and prints an `rsync` line (with `ControlMaster` multiplexing so an OTP login node prompts once); `--into <DIR>` sets the local destination, `--format json` emits `{job_id, bundles, login_host, cluster_path, into, files, bytes, suggested_command, ssh_multiplex_hint}` (`login_host` omitted when not set). Read-only: copies nothing and opens no connection.
`cancel`	Cancel the latest tracked job or an explicit job id	Uses tracked metadata instead of making you retype paths.
`down`	Cancel a tracked job and clean tracked state	Supports `--purge-cache` when the tracked snapshot names concrete cache artifacts.
`jobs list`	Scan the current repo tree for tracked runs	Start here when you need to rediscover an older run.
`clean`	Remove old tracked job directories for one compose context	Use `--dry-run` first when you are unsure.
`rendezvous list`	List live shared-cache service records	Defaults to the resolved cache dir; `--cache-dir` inspects a specific cache.
`rendezvous resolve NAME`	Resolve one provider record	Prints endpoint fields or JSON for automation.
`rendezvous register NAME`	Manually register a provider record	Intended for debugging and custom workflows; declarative specs usually register providers.
`rendezvous prune`	Remove expired provider records	Cleans stale latest and historical rendezvous JSON files.

Add --remote[=<HOST>] to status, ps, stats, score, logs, or pull to run that command on the login node’s staged checkout from a prior up --remote, over SSH, streaming output back. With no value it uses the configured login_host; pass user@host to override.

hpc-compose debug -f compose.yaml
hpc-compose debug -f compose.yaml --preflight
hpc-compose jobs list
hpc-compose status -f compose.yaml --format json
hpc-compose status -f compose.yaml --array
hpc-compose status -f compose.yaml --job-id 12345_7 --array
hpc-compose ps -f compose.yaml
hpc-compose watch -f compose.yaml --watch-mode line
hpc-compose watch -f compose.yaml --hold-on-exit always
hpc-compose replay -f compose.yaml
hpc-compose replay -f compose.yaml --speed 10
hpc-compose replay -f compose.yaml --job-id 12345 --service app
hpc-compose replay -f compose.yaml --format json
hpc-compose checkpoints -f compose.yaml
hpc-compose checkpoints --job-id 12345 --format json
hpc-compose logs -f compose.yaml --service app --follow
hpc-compose logs -f compose.yaml --grep 'error|oom' --since 30m
hpc-compose inspect -f compose.yaml --rightsize
hpc-compose stats -f compose.yaml --format jsonl
hpc-compose stats -f compose.yaml --accounting --format csv
hpc-compose score 12345
hpc-compose diff 12345 12346 -f compose.yaml
hpc-compose diff --jobs 12345,12346,12347 --matrix-format json
hpc-compose diff --across sweep-1700000000-1234 --matrix-format csv
hpc-compose artifacts -f compose.yaml --bundle checkpoints --tarball
hpc-compose down -f compose.yaml --yes
hpc-compose cancel -f compose.yaml --yes
hpc-compose clean -f compose.yaml --age 7 --dry-run
hpc-compose rendezvous list
hpc-compose rendezvous resolve model-server
hpc-compose rendezvous register model-server --host node01 --port 8000 --job-id 12345
hpc-compose rendezvous prune

Cache Maintenance

Command	Use it for	Notes
`cache list`	Inspect cached image artifacts and staged dataset/model entries	Works without a compose file.
`cache inspect`	Show cache reuse expectations for the current plan	Supports `--service <name>` for one service.
`cache prune`	Remove old or unused cache entries	Covers image artifacts and staged dataset/model entries; `--age` and `--all-unused` are mutually exclusive.

hpc-compose cache list
hpc-compose cache inspect -f compose.yaml --service app
hpc-compose cache prune --age 7 --cache-dir '<shared-cache-dir>' --yes
hpc-compose cache prune --all-unused -f compose.yaml --yes

Spec Reference
Glossary
Full Example Specs
Roadmap and Non-Goals
Runbook

Spec Reference

This page describes the Compose subset that hpc-compose accepts today. Unknown or unsupported fields are rejected unless this page explicitly says otherwise.

How To Use This Reference

This page is intentionally complete. If you are new, start with Quickstart, Examples, and Runtime Backends, then use the table below to jump into the field group you need.

Need	Section
Overall YAML shape	Top-level shape and Top-level fields
Shared templates and overrides	`extends`
Runtime backend choice	`runtime` and Runtime Backends
Slurm allocation settings	`x-slurm`
Resource profiles	Resource profiles
Hyperparameter sweeps	`sweep` and Hyperparameter Sweeps
Secrets	`secrets` and Secrets
Service command, image, env, and mounts	Service fields, Image rules, `command` and `entrypoint`, `environment`, `volumes`
Startup ordering	`depends_on`, `readiness`, and `healthcheck`
Post-run contracts	`assert`
Multi-node placement and MPI	Multi-node placement rules, `services.<name>.x-slurm.placement`, and `services.<name>.x-slurm.mpi`
Prepared images	`x-runtime.prepare` and `x-enroot.prepare`
Metrics, artifacts, and resume	`x-slurm.metrics`, `x-slurm.artifacts`, and `x-slurm.resume`
Runtime env vars in services	Allocation metadata inside services
Unsupported Compose features	Unsupported Compose keys

Top-level shape

name: demo
version: "1"

runtime:
  backend: pyxis

x-slurm:
  time: "00:30:00"

services:
  app:
    image: python:3.11-slim
    command: python -m main

Top-level fields

Field	Shape	Default	Notes
`extends`	string	omitted	Top-level authoring-only path to a base spec. The base is resolved before interpolation, validation, planning, and `config` output.
`name`	string	omitted	Used as the Slurm job name when `x-slurm.job_name` is not set.
`version`	string `"1"` or integer `1`	`1`	hpc-compose spec schema version. Omit for v1 or set explicitly to `"1"`; Docker Compose values such as `"3.9"` are rejected after migration.
`runtime`	mapping	`backend: pyxis`	Selects the service runtime backend and GPU passthrough policy.
`services`	mapping	required	Must contain at least one service.
`steps`	mapping	alias for `services`	Use either `services` or `steps`, not both.
`modules`	list of strings	omitted	List-only shorthand for top-level `x-env.modules.load`; cannot be combined with `x-env.modules`.
`x-env`	mapping	omitted	Structured host-side module, Spack view, and environment setup shared by all services.
`x-slurm`	mapping	omitted	Top-level Slurm settings and shared runtime defaults.
`sweep`	mapping	omitted	Embedded hyperparameter sweep metadata consumed by `hpc-compose sweep submit/status/list`. Normal commands treat it as metadata.
`secrets`	mapping	omitted	Named secret sources resolved into the interpolation map and redacted in `config`/`context` output. See `secrets`.

`extends`

extends is an authoring feature for sharing base specs and service templates without copying large cluster-specific blocks. It is resolved before interpolation, validation, planning, rendering, tracked metadata, and hpc-compose config; the effective config no longer contains any extends keys.

Top-level extends points at a base YAML file:

extends: cluster-base.yaml

x-slurm:
  time: "02:00:00"

services:
  trainer:
    command: python train.py

Service-level extends supports three forms:

services:
  api:
    extends: base-service

  worker:
    extends: service-templates.yaml

  trainer:
    extends:
      file: ml-templates.yaml
      service: gpu-worker

Rules:

Top-level extends must be a file path string.
A service string that looks like a YAML file path, such as base.yaml, ../base.yml, or a path with a separator, uses the same service name from that file. Other strings refer to a service in the same file.
A service mapping can select { file, service }; omit file to select a service from the same file.
Extends references are recursive and cycles are rejected.
Maps merge recursively. Sequences append base-first. Child scalars replace base scalars.
Service volumes merge by container target, so a child mount for /data replaces the base mount for /data while unrelated base mounts are kept.
Relative host paths in the final plan still resolve against the leaf compose file passed with -f.
There is no delete or unset syntax in this version.

`sweep`

sweep defines trial variables for hpc-compose sweep submit. It is a top-level metadata block; every generated trial is still planned, rendered, submitted, and tracked as a normal one-allocation job.

Full Cartesian product:

sweep:
  parameters:
    lr: [0.001, 0.01, 0.1]
    batch_size: [32, 64]
  matrix: full

Random sample without replacement:

sweep:
  parameters:
    lr: [0.001, 0.01, 0.1]
    batch_size: [32, 64]
  matrix:
    random: 5
    seed: "optional-stable-seed"

Rules:

parameters must contain at least one key, and every value list must contain at least one scalar.
Parameter keys must be valid interpolation variable names: [A-Za-z_][A-Za-z0-9_]*.
Parameter keys must not use the reserved HPC_COMPOSE_SWEEP_ prefix.
Parameter values may be strings, numbers, or booleans. They are passed to interpolation as strings.
matrix: full expands the Cartesian product deterministically over sorted parameter names.
matrix.random must be at least 1 and cannot exceed the total number of combinations.
matrix.seed is optional. If omitted, sweep submit derives a seed from the new sweep id and persists it.
replicates (optional, default 1) submits N seeded replicate trials per parameter config. Each replicate is a separate allocation with a deterministic per-replicate seed; sweep status/observe roll up mean±std(n) per config. The --max-trials guard counts combinations × replicates. replicates: 0 is rejected; replicates: 1 is byte-identical to a non-replicated sweep (legacy t000 trial ids). See Hyperparameter Sweeps.
objective (optional) declares how sweep observe parses and ranks each terminal trial. Set direction (minimize/maximize) and exactly one parse source: log_pattern (a regex with optional capture group, default 1) or json_path + json_field. Optionally set objective.scaling_axis to the name of a numeric sweep parameter (e.g. nodes) to enable the read-only sweep observe --scaling report; the named parameter must exist under sweep.parameters and all its values must parse as numbers (both checked at validate time). See Scaling Reports.
sweep.spec is not supported; embed the sweep in the same compose file.

For each trial, sweep variables override existing interpolation variables from .env, environment, settings, or --env. These reserved variables are also available:

Variable	Meaning
`HPC_COMPOSE_SWEEP_ID`	Persisted sweep id.
`HPC_COMPOSE_SWEEP_TRIAL`	Trial label such as `t000` (or `t000r0` with replicates).
`HPC_COMPOSE_SWEEP_TRIAL_INDEX`	Zero-based trial index.
`HPC_COMPOSE_SWEEP_REPLICATE`	Zero-based replicate index within the config (`0` when `replicates: 1`).
`HPC_COMPOSE_SWEEP_SEED`	Deterministic per-replicate seed; present only when `replicates` > 1.

Normal commands do not expand the sweep matrix. If the runnable spec contains ${lr} with no default, ordinary plan, up, and render still fail unless lr is provided. Use defaults such as ${lr:-0.001} when the base spec should remain runnable, or use hpc-compose sweep submit --dry-run to validate sweep-only variables.

hpc-compose sweep submit rejects x-slurm.array, because every sweep trial is already its own allocation. See Hyperparameter Sweeps for manifests, status aggregation, objective ranking via sweep observe, and early termination via sweep stop.

`secrets`

secrets maps secret names to local file: or env: sources. Each value is resolved into the interpolation map tagged as a secret, so ${name} works in environment: and is redacted in config/context/inspect output regardless of its name.

secrets:
  hf_token:
    file: ./secrets/hf.txt
  db_password:
    env: DB_PASSWORD
services:
  app:
    image: redis:7
    environment:
      HF_TOKEN: ${hf_token}

See Secrets for the full redaction model, resolution order, and what is deferred (Vault/KMS, /run/secrets file mounts).

`x-env`

x-env is structured host-side software setup. It is available at the top level and under services.<name>.

x-env:
  modules:
    - cuda/12.4
    - openmpi/5
  spack:
    view: /shared/spack/views/ml
  env:
    HDF5_USE_FILE_LOCKING: "FALSE"

services:
  app:
    image: python:3.11-slim
    x-env:
      modules:
        purge: false
        load:
          - netcdf/4.9
      env:
        OMP_NUM_THREADS: "8"

Supported forms:

modules: [name, ...]
modules: { purge: bool, load: [name, ...] }
spack: { view: /path/to/view }
env: { KEY: VALUE }

Rules:

Top-level x-env renders before x-slurm.setup.
Service-level x-env renders immediately before that service’s srun.
env entries are exported on the host and forwarded into Pyxis containers.
Service-level x-env.env overrides top-level x-env.env when the same variable is set.
Top-level modules: [...] and service-level modules: [...] are shorthand for the matching x-env.modules.load list. The shorthand is list-only and cannot be combined with x-env.modules at the same scope.
spack.view prepends bin, lib, lib64, and Python site-package paths only when those directories exist.
Modules and Spack views are host-side setup. Container filesystem visibility still requires explicit volumes, x-slurm.mpi.host_mpi.bind_paths, or other site-specific binds.

Settings and lint commands

CLI behavior for the settings-aware commands (--profile, --settings-file, setup, context, validate --strict-env, lint, schema) and the full lint-rule table (HPC001-HPC900, including auto-fix) now lives in CLI Reference: see Settings-aware commands and Lint rules. This page describes only the YAML these commands operate on.

`x-slurm`

These fields live under the top-level x-slurm block.

Field	Shape	Default	Notes
`resources`	string	omitted	Name of a `[resource_profiles.<name>]` entry in `.hpc-compose/settings.toml`. Profile values are defaults only; explicit `x-slurm` fields win.
`job_name`	string	`name` when present	Rendered as `#SBATCH --job-name`.
`partition`	string	omitted	Passed through to `#SBATCH --partition`.
`account`	string	omitted	Passed through to `#SBATCH --account`.
`qos`	string	omitted	Passed through to `#SBATCH --qos`.
`time`	string	omitted	Passed through to `#SBATCH --time`.
`nodes`	positive integer	omitted	Slurm allocation node count. Defaults to `1` when omitted.
`ntasks`	positive integer	omitted	Passed through to `#SBATCH --ntasks`.
`ntasks_per_node`	positive integer	omitted	Passed through to `#SBATCH --ntasks-per-node`.
`cpus_per_task`	positive integer	omitted	Top-level Slurm CPU request.
`mem`	string	omitted	Passed through to `#SBATCH --mem`.
`gres`	string	omitted	Passed through to `#SBATCH --gres`.
`gpus`	positive integer	omitted	Used only when `gres` is not set.
`gpus_per_node`	positive integer	omitted	Passed through to `#SBATCH --gpus-per-node`.
`gpus_per_task`	positive integer	omitted	Passed through to `#SBATCH --gpus-per-task`.
`cpus_per_gpu`	positive integer	omitted	Passed through to `#SBATCH --cpus-per-gpu`.
`mem_per_gpu`	string	omitted	Passed through to `#SBATCH --mem-per-gpu`.
`gpu_bind`	string	omitted	Passed through to `#SBATCH --gpu-bind`.
`cpu_bind`	string	omitted	Passed through to `#SBATCH --cpu-bind`.
`mem_bind`	string	omitted	Passed through to `#SBATCH --mem-bind`.
`distribution`	string	omitted	Passed through to `#SBATCH --distribution`.
`hint`	string	omitted	Passed through to `#SBATCH --hint`.
`constraint`	string	omitted	Passed through to `#SBATCH --constraint`.
`output`	string	omitted	Passed through to `#SBATCH --output`.
`error`	string	omitted	Passed through to `#SBATCH --error`.
`chdir`	string	omitted	Passed through to `#SBATCH --chdir`.
`array`	string	omitted	Slurm array spec such as `0`, `1-10`, `1-10:2`, `0,3,8-12`, or `0-99%10`. Rendered as `#SBATCH --array`.
`after_job`	string or mapping	omitted	Scheduler dependency on a prior job id. String shorthand means `afterany:<id>`; mapping supports `{ id, condition }`.
`dependency`	string	omitted	Currently supports `singleton`, combined with `after_job` when both are set.
`cache_dir`	string	settings profile, settings defaults, then `$HOME/.cache/hpc-compose`	Must resolve to shared storage visible from the login node and the compute nodes.
`enroot_temp_dir`	string	`<cache_dir>/enroot/tmp`	Override for enroot’s prepare-time temporary extraction scratch (`ENROOT_TEMP_PATH`), separate from `cache_dir`. Point it at fast node-local storage (e.g. `/tmp/${USER}-hpc-compose-enroot`) when the shared cache filesystem raises `Stale file handle` (ESTALE) errors during image import; the final image and layer cache still live under `cache_dir`.
`runtime_root`	string	`<submit_dir>/.hpc-compose`	Directory that holds per-job runtime state (`<runtime_root>/<job_id>/{logs,metrics,state.json,artifacts}`). Relative values resolve against the submit directory. Must be visible from both login and compute nodes; node-local overrides are rejected by preflight.
`cleanup`	mapping	omitted	Teardown cleanup policy. `cleanup.runtime_cache` (`never` \| `on_success` \| `always`, default `never`) controls whether the batch teardown trap removes the per-job enroot runtime cache.
`scratch`	mapping	omitted	Optional scratch path mounted into services and exposed as `HPC_COMPOSE_SCRATCH_DIR`.
`stage_in`	list of mappings	omitted	Copy or rsync host paths, or fetch an `hf://` model/dataset, before services launch.
`stage_out`	list of mappings	omitted	Copy or rsync paths during teardown, optionally by outcome.
`burst_buffer`	mapping	omitted	Raw `#BB` / `#DW` directives for site-specific burst-buffer systems.
`metrics`	mapping	omitted	Enables runtime metrics sampling.
`artifacts`	mapping	omitted	Enables tracked artifact collection and export metadata.
`resume`	mapping	omitted	Enables checkpoint-aware resume semantics with a shared host path mounted into every service.
`notify`	mapping	omitted	First-class Slurm email notification settings.
`setup`	list of strings	omitted	Raw shell lines inserted into the generated batch script before service launches.
`submit_args`	list of strings	omitted	Extra raw Slurm arguments appended as `#SBATCH ...` lines.
`rendezvous`	string, list, or mapping	omitted	Resolve cross-job service records from the shared cache and inject `HPC_COMPOSE_RDZV_*` env vars.
`parallelism`	mapping `{ tensor, pipeline }`	omitted	Descriptive tensor/pipeline geometry. Validation-only: no `#SBATCH`/`srun` flag is emitted. See `x-slurm.parallelism`.

`x-slurm.parallelism`

parallelism records the tensor (tensor) and pipeline (pipeline) sizes a job intends to use. Both fields are required and must be at least 1. It is purely descriptive: it lowers onto the existing single-srun-per-service placement and emits no extra #SBATCH or srun flags.

When gpus_per_node is set at the same scope, validation cross-checks that tensor * pipeline == nodes * gpus_per_node (where nodes defaults to 1 when omitted). A mismatch fails validate/config with a scoped diagnostic; the check is skipped entirely when gpus_per_node is not set.

x-slurm:
  nodes: 2
  gpus_per_node: 4
  parallelism:
    tensor: 4
    pipeline: 2 # 4 * 2 == 2 * 4

Resource profiles

Resource profiles are reusable settings defaults, distinct from the global --profile setting selector. Define them in .hpc-compose/settings.toml:

[resource_profiles.gpu-small]
partition = "gpu"
time = "01:00:00"
gpus = 1
cpus_per_task = 8
mem = "32G"

Reference one from the spec:

x-slurm:
  resources: gpu-small
  mem: 64G

The profile fills only omitted resource fields. In the example above, partition, time, gpus, and cpus_per_task come from the profile, while the explicit mem: 64G wins. Profiles intentionally exclude behavior such as job_name, cache_dir, arrays, dependencies, submit_args, setup hooks, scratch/staging, artifacts, resume, notify, and metrics.

Allowed profile fields are: partition, account, qos, time, nodes, ntasks, ntasks_per_node, cpus_per_task, mem, gres, gpus, gpus_per_node, gpus_per_task, cpus_per_gpu, mem_per_gpu, gpu_bind, cpu_bind, mem_bind, distribution, hint, and constraint.

`x-slurm.array`

x-slurm:
  array: 0-99%10
  output: logs/%A_%a.out
services:
  worker:
    image: python:3.12-slim
    command: python worker.py

array accepts Slurm list, range, step, and concurrency forms such as 0, 1-10, 1-10:2, 0,3,8-12, and 0-99%10. Values with spaces, null bytes, malformed ranges, negative numbers, zero step, or zero concurrency are rejected.

Array jobs currently require hpc-compose up --detach; live watch/log fan-out for per-task array elements is future work. --local rejects array specs. Slurm provides SLURM_ARRAY_JOB_ID, SLURM_ARRAY_TASK_ID, SLURM_ARRAY_TASK_COUNT, SLURM_ARRAY_TASK_MAX, SLURM_ARRAY_TASK_MIN, and SLURM_ARRAY_TASK_STEP; for Pyxis jobs, hpc-compose forwards these names into the container when x-slurm.array is set. Prefer output patterns such as %A_%a so task logs do not overwrite each other.

`x-slurm.after_job` and `x-slurm.dependency`

x-slurm:
  after_job:
    id: "12345"
    condition: afterok
  dependency: singleton

after_job: "12345" is shorthand for afterany:12345. Mapping form accepts id plus condition, where condition is afterany, afterok, or afternotok. Job ids must be numeric Slurm ids such as 12345, or array elements such as 12345_7.

dependency: singleton is separate because Slurm’s singleton dependency does not take a job id. When both fields are set, hpc-compose submits one command-line dependency string such as --dependency=afterok:12345,singleton.

Dependencies are passed to sbatch as CLI arguments, not rendered as #SBATCH lines, because dependency job ids are commonly dynamic. --local rejects scheduler dependencies.

`x-slurm.setup`

x-slurm:
  setup:
    - module load enroot
    - source /shared/env.sh

Shape: list of strings
Default: omitted
Notes:
- Each line is emitted verbatim into the generated bash script.
- The script runs under set -euo pipefail.
- Shell quoting and escaping are the user’s responsibility.

`x-slurm.submit_args`

x-slurm:
  submit_args:
    - "--mail-type=END"
    - "--mail-user=user@example.com"
    - "--reservation=gpu-reservation"

Shape: list of strings
Default: omitted
Notes:
- Each entry is emitted as #SBATCH {arg}.
- Entries are rejected if they contain line breaks or null bytes.
- Entries are not validated against Slurm option syntax.
- First-class fields reject conflicting raw entries for the same option. Use x-slurm.array, x-slurm.after_job, or x-slurm.dependency instead of raw --array or --dependency.

`x-slurm.notify`

x-slurm:
  notify:
    email:
      to: user@example.com
      on: [end, fail]

Field	Shape	Default	Notes
`notify.email`	mapping	omitted	Required when `notify` is present.
`notify.email.to`	string	required	Rendered as `#SBATCH --mail-user`.
`notify.email.on`	list of events	`[end, fail]`	Rendered as `#SBATCH --mail-type`.

Supported events:

Event	Slurm mail type
`start`	`BEGIN`
`end`	`END`
`fail`	`FAIL`
`all`	`ALL`

Rules:

When on is omitted or empty, defaults to [end, fail].
If all is present, it replaces all other events.
Cannot be combined with raw --mail-type or --mail-user in x-slurm.submit_args.

`x-slurm.cache_dir`

Shape: string
Default precedence: explicit x-slurm.cache_dir, then [profiles.<name>.cache].dir, then [defaults.cache].dir, then $HOME/.cache/hpc-compose.
Notes:
- Relative paths and environment variables are resolved against the compose file directory.
- Settings cache paths are resolved against the settings base directory.
- Paths under /tmp, /var/tmp, /private/tmp, and /dev/shm are accepted by parsing and planning, but preflight reports them as unsafe because they are not valid shared-cache locations for login-node prepare plus compute-node reuse.
- The path must be visible from both the login node and the compute nodes.

Settings example:

[defaults.cache]
dir = "/cluster/shared/hpc-compose-cache"

[profiles.dev.cache]
dir = "/cluster/shared/dev-hpc-compose-cache"

`x-slurm.runtime_root`

Shape: string
Default: <submit_dir>/.hpc-compose, where <submit_dir> is the directory you submit from.
Notes:
- Holds per-job runtime state at <runtime_root>/<job_id>/ (logs/, metrics/, state.json, artifacts/).
- Relative paths resolve against the submit directory; absolute paths are used as-is.
- The resolved path is baked into the rendered JOB_ROOT, so a running job does not depend on $SLURM_SUBMIT_DIR being set or shared-visible.
- Set an override to relocate bulky runtime state (for example, onto a shared scratch project space) while submission metadata stays next to the compose file.
- An override under /tmp, /var/tmp, /private/tmp, or /dev/shm is rejected by preflight because it would not be visible from the compute nodes. The default layout is governed by the submission directory and is not policed here.

`runtime`

runtime:
  backend: apptainer
  gpu: auto

Field	Shape	Default	Notes
`backend`	`pyxis`, `apptainer`, `singularity`, or `host`	`pyxis`	Selects the runtime used inside Slurm steps.
`gpu`	`auto`, `none`, or `nvidia`	`auto`	For Apptainer/Singularity, controls `--nv`; `auto` enables it when Slurm GPU resources are requested.

Backend notes:

pyxis uses srun --container-* flags and Enroot .sqsh artifacts.
apptainer and singularity build or reuse .sif artifacts and launch them through apptainer exec/run or singularity exec/run inside srun.
host runs commands directly under srun; services must set command or entrypoint, and image prepare blocks, service volumes, and x-slurm.mpi.host_mpi.bind_paths are not allowed because no container bind mount is applied.
x-enroot.prepare is a Pyxis/Enroot compatibility spelling. Prefer x-runtime.prepare for new specs, especially with Apptainer/Singularity.

`x-slurm.scratch`, `stage_in`, `stage_out`, and `burst_buffer`

x-slurm:
  scratch:
    scope: shared
    base: /scratch/$USER/jobs
    mount: /scratch
    cleanup: on_success
  stage_in:
    - from: /shared/input
      to: /scratch/input
      mode: rsync
  stage_out:
    - from: /scratch/output
      to: /shared/results/${SLURM_JOB_ID}
      when: always
      mode: copy
  burst_buffer:
    directives:
      - "#BB create_persistent name=data capacity=100G"

scratch.base is a host path. scratch.mount is the container-visible mount point.
scratch.scope is node_local or shared; cluster profiles can warn when a shared scratch path does not look shared.
scratch.cleanup is always, on_success, or never.
stage_in runs before services launch; stage_out runs during teardown.
mode is rsync or copy; rsync falls back to cp -R when rsync is unavailable.
stage_out.when is always, on_success, or on_failure.
${SLURM_JOB_ID} is preserved in scratch and staging paths for runtime expansion.
burst_buffer.directives entries are emitted as raw batch-script directives and must start with #BB or #DW.

Staging HuggingFace models and datasets (`hf://`)

A stage_in entry can fetch a HuggingFace model or dataset instead of copying a filesystem path. Set the typed hf block (in place of from) and a destination to:

x-slurm:
  cache_dir: /cluster/shared/hpc-compose-cache
  stage_in:
    - to: /models/llama-3.1-8b
      hf:
        repo: meta-llama/Llama-3.1-8B
        revision: 0e9e39f249a16976918f6564b8830bc894c89659
        kind: model        # or `dataset` for a dataset repo

Each stage_in entry sets exactly one of from (a filesystem path) or hf (a HuggingFace source); setting both or neither is rejected at validation time.
hf.revision must be an immutable pin — a commit SHA or an explicit immutable tag. Floating refs such as main, master, or HEAD are rejected so the rendered job is reproducible.
hf.kind is model (default) or dataset; datasets are fetched with huggingface-cli download --repo-type dataset.
The download runs inside the Slurm allocation on the compute node, never on your laptop or over SSH — this preserves the OTP-per-SSH and laptop-driven contract. The rendered batch script contains a guarded huggingface-cli download <repo> --revision <sha> --local-dir <cache-path> step; it never mounts or passes an hf:// URI to the container runtime.
Artifacts land in a content-addressed directory under x-slurm.cache_dir (<cache_dir>/{models,datasets}/<key>) and are guarded by a completion marker, so a repeated job reuses the staged copy instead of re-downloading. The staged copy is then materialized into the entry’s to path for the service.
For gated repos, export HF_TOKEN in the job environment; it is imported at runtime by huggingface-cli and is never written into the rendered script, the submission record, or the cache manifest. HF_HOME/HF_HUB_CACHE are honored only via ${VAR:-default} guards.
Override the CLI invoked inside the job with --huggingface-cli-bin <PATH> (default huggingface-cli).

Per-service scratch opt-out

When top-level x-slurm.scratch is configured, every service receives the scratch mount by default. To exclude an individual service (for example, a sidecar that should not see job-local scratch), set services.<name>.x-slurm.scratch.enabled: false:

services:
  helper:
    image: busybox
    command: /bin/true
    x-slurm:
      scratch:
        enabled: false

Multi-node placement rules

x-slurm.nodes > 1 reserves a multi-node allocation.
Helper services remain single-node steps and are pinned to the allocation’s primary node.
When a multi-node job has exactly one service, that service defaults to the distributed full-allocation step.
Services may use services.<name>.x-slurm.placement to select explicit allocation node indices.
Overlapping explicit placements are rejected unless one side sets allow_overlap: true or uses share_with.
Any service spanning more than one node may use readiness.type: sleep or readiness.type: log, or TCP/HTTP readiness only with an explicit non-local host or URL.

`x-slurm.metrics`

x-slurm:
  metrics:
    interval_seconds: 5
    collectors: [gpu, slurm]

Shape: mapping
Default: omitted
Notes:
- Omitting the block disables runtime metrics sampling.
- If the block is present and enabled is omitted, metrics sampling is enabled.
- interval_seconds defaults to 5 and must be at least 1.
- collectors defaults to [gpu, slurm].
- Supported collectors:
  - gpu samples device and process telemetry through nvidia-smi
  - slurm samples job-step CPU and memory data through sstat
- In multi-node jobs, gpu sampling launches one best-effort sampler task per allocated node and writes node metadata into GPU rows; legacy rows without node remain readable as primary-node samples.
- Sampler files are written under the active job workspace’s metrics/ directory and are also visible inside containers at /hpc-compose/job/metrics. For ordinary runs that is <runtime-root>/<job-id>/metrics; for resume-aware attempts it is <runtime-root>/<job-id>/attempts/<attempt>/metrics.
- Diagnostics are written under metrics/diagnostics/ when available, including nvidia-smi topo -m, nvidia-smi -q, selected fabric/GPU environment variables, and best-effort ibstat, ibv_devinfo, ucx_info -v, and fi_info output.

`x-slurm.rendezvous`

Client-side cross-job discovery resolves records from <cache_dir>/rendezvous/<name>/latest.json before launching services:

x-slurm:
  cache_dir: /cluster/shared/hpc-compose-cache
  rendezvous: model-server

The mapping form supports multiple names and a timeout:

x-slurm:
  rendezvous:
    discover:
      - model-server
      - tokenizer
    timeout_seconds: 60
    require: true

Resolved records become generic variables such as HPC_COMPOSE_RDZV_URL and name-scoped variables such as HPC_COMPOSE_RDZV_MODEL_SERVER_URL.

`x-slurm.artifacts`

x-slurm:
  artifacts:
    collect: always
    export_dir: ./results/${SLURM_JOB_ID}
    paths:
      - /hpc-compose/job/metrics/**
    bundles:
      checkpoints:
        paths:
          - /hpc-compose/job/checkpoints/*.pt

Shape: mapping
Default: omitted
Notes:
- Omitting the block disables tracked artifact collection.
- collect defaults to always. Supported values are always, on_success, and on_failure.
- export_dir is required and is resolved relative to the compose file directory when hpc-compose artifacts runs.
- ${SLURM_JOB_ID} is preserved in export_dir until hpc-compose artifacts expands it from tracked metadata.
- paths remains supported as the implicit default bundle.
- bundles is optional. Bundle names must match [A-Za-z0-9_-]+, and default is reserved for top-level paths.
- At least one source path must be present in paths or bundles.
- Every source path must be an absolute container-visible path rooted at /hpc-compose/job.
- Paths under /hpc-compose/job/artifacts are rejected.
- Collection happens during batch teardown and is best-effort.
- Collected payloads and manifest.json are written under the active job workspace’s artifacts/ directory. For ordinary runs that is <runtime-root>/<job-id>/artifacts; for resume-aware attempts it is <runtime-root>/<job-id>/attempts/<attempt>/artifacts.
- hpc-compose artifacts --bundle <name> exports only the selected bundle or bundles.
- hpc-compose artifacts --tarball also writes one <bundle>.tar.gz archive per exported bundle.
- Export writes per-bundle provenance metadata under <export_dir>/_hpc-compose/bundles/<bundle>.json.

`x-slurm.resume`

x-slurm:
  resume:
    path: /shared/$USER/runs/my-run

Shape: mapping
Default: omitted
Notes:
- Omitting the block disables resume semantics.
- path is required and must be an absolute host path.
- /hpc-compose/... paths are rejected because path must point at shared host storage, not a container-visible path.
- /tmp and /var/tmp technically validate, but preflight warns because those paths are not reliable resume storage.
- When enabled, hpc-compose mounts path into every service at /hpc-compose/resume.
- Services also receive HPC_COMPOSE_RESUME_DIR, HPC_COMPOSE_ATTEMPT, and HPC_COMPOSE_IS_RESUME.
- The canonical resume source is the shared path, not exported artifact bundles.
- Attempt-specific runtime state moves under <runtime-root>/<job-id>/attempts/<attempt>/, and the top-level logs, metrics, artifacts, and state.json paths continue to point at the latest attempt for compatibility.

Tracked-record provenance

Every tracked submission record auto-pins best-effort provenance so a run self-describes what produced it: the hpc-compose tool version, the git state of the working tree (HEAD SHA, dirty flag, and branch — read locally and static-safe, null outside a git repository or when git is unavailable, and never fabricated), and the per-service image reference as launched. This is tracked-record metadata, not a compose field, so there is no YAML key to set. hpc-compose diff surfaces provenance deltas in a dedicated provenance section.

Allocation metadata inside services

Every service receives:

HPC_COMPOSE_JOB_DIR
HPC_COMPOSE_PRIMARY_NODE
HPC_COMPOSE_NODE_COUNT
HPC_COMPOSE_NODELIST
HPC_COMPOSE_NODELIST_FILE
HPC_COMPOSE_SERVICE_PRIMARY_NODE
HPC_COMPOSE_SERVICE_NODE_COUNT
HPC_COMPOSE_SERVICE_NODELIST
HPC_COMPOSE_SERVICE_NODELIST_FILE

HPC_COMPOSE_JOB_DIR is the per-job scratch directory and the portable way to write working files: it resolves to /hpc-compose/job under the container backends (where the job directory is bind-mounted there) and to the real on-node job path under the host backend (where nothing is mounted at /hpc-compose/job). Writing under $HPC_COMPOSE_JOB_DIR keeps a spec working unchanged across backends and lands files where artifact collection looks — artifacts.paths declared as /hpc-compose/job/** are collected from the same location. Do not hard-code /hpc-compose/job in host service commands: that path is not mounted there and writing to it requires root.

The allocation-wide data is also written under /hpc-compose/job/allocation/primary_node and /hpc-compose/job/allocation/nodes.txt. Service-scoped node lists are written under /hpc-compose/job/allocation/service-nodelists/.

Multi-node services also receive distributed launch helpers:

HPC_COMPOSE_DIST_MASTER_ADDR
HPC_COMPOSE_DIST_MASTER_PORT
HPC_COMPOSE_DIST_RDZV_ENDPOINT
HPC_COMPOSE_DIST_NNODES
HPC_COMPOSE_DIST_NODE_RANK
HPC_COMPOSE_DIST_LOCAL_RANK
HPC_COMPOSE_DIST_GLOBAL_RANK
HPC_COMPOSE_DIST_NPROC_PER_NODE
HPC_COMPOSE_DIST_WORLD_SIZE
HPC_COMPOSE_DIST_HOSTFILE

HPC_COMPOSE_DIST_NPROC_PER_NODE is derived from a service environment override, GPU requests, ntasks_per_node, then 1. The distributed hostfile is written under /hpc-compose/job/allocation/distributed-hostfiles/. When a discovered .hpc-compose/cluster.toml contains [distributed.env], those profile variables are injected only for multi-node services; explicit service environment values win on name conflicts and are still the durable config source.

Services that configure services.<name>.x-slurm.mpi also receive:

HPC_COMPOSE_MPI_TYPE
HPC_COMPOSE_MPI_PROFILE when x-slurm.mpi.profile is set
HPC_COMPOSE_MPI_IMPLEMENTATION when x-slurm.mpi.implementation is set or implied by x-slurm.mpi.profile
HPC_COMPOSE_MPI_HOSTFILE

The MPI hostfile is written under /hpc-compose/job/allocation/mpi-hostfiles/ and contains the service’s effective node list. When ntasks_per_node is known, each host line includes slots=<ntasks_per_node>. For a single-node service with ntasks but no ntasks_per_node, the hostfile uses slots=<ntasks>. Otherwise it emits one node per line without slots.

MPI services also forward common PMI, PMIx, and Slurm rank variables into the container through Pyxis --container-env, including PMI_RANK, PMI_SIZE, PMIX_RANK, PMIX_NAMESPACE, SLURM_PROCID, SLURM_LOCALID, SLURM_NODEID, SLURM_NTASKS, and SLURM_TASKS_PER_NODE.

Services that configure services.<name>.x-slurm.parallelism also receive:

HPC_COMPOSE_TP_SIZE (the declared tensor value)
HPC_COMPOSE_PP_SIZE (the declared pipeline value)

These are descriptive literal exports. They are emitted for every service that declares parallelism, including single-node services, and are per-service only: a top-level x-slurm.parallelism block is validated and shown in config --effective but does not by itself export env into services.

`gres` and `gpus`

When both gres and gpus are set at the same level, gres takes priority and gpus is ignored.

Service fields

Field	Shape	Default	Notes
`extends`	string or mapping	omitted	Authoring-only service template reference. See `extends`.
`image`	string	required unless `runtime.backend: host`	Can be a remote image reference, a local `.sqsh` / `.squashfs` path for Pyxis, or a local `.sif` path for Apptainer/Singularity.
`command`	string or list of strings	omitted	Shell form or exec form.
`entrypoint`	string or list of strings	omitted	Must use the same form as `command` when both are present.
`script`	string	omitted	Multi-line shell script sugar for `command: ["/bin/sh", "-lc", script]`; mutually exclusive with `command` and `entrypoint`.
`environment`	mapping or list of `KEY=VALUE` strings	omitted	Both forms normalize to key/value pairs.
`modules`	list of strings	omitted	List-only shorthand for service `x-env.modules.load`; cannot be combined with service `x-env.modules`.
`volumes`	list of `host_path:container_path` strings	omitted	Runtime bind mounts. Host paths resolve against the compose file directory.
`working_dir`	string	omitted	Valid only when the service also has an explicit `command` or `entrypoint`.
`depends_on`	list or mapping	omitted	Dependency list with `service_started` or `service_healthy` conditions.
`readiness`	mapping	omitted	Post-launch readiness gate.
`healthcheck`	mapping	omitted	Compose-compatible sugar for a subset of `readiness`. Mutually exclusive with `readiness`.
`assert`	mapping	omitted	Post-run service contract checked during batch cleanup and surfaced in `status`.
`x-env`	mapping	omitted	Structured host-side module, Spack view, and environment setup for this service.
`x-slurm`	mapping	omitted	Per-service Slurm overrides.
`x-runtime`	mapping	omitted	Backend-neutral image preparation rules.
`x-enroot`	mapping	omitted	Pyxis/Enroot preparation compatibility alias.

Image rules

Remote images

Any image reference without an explicit :// scheme is prefixed with docker://.
Explicit schemes are allowed only for docker://, dockerd://, and podman://.
Other schemes are rejected.
Shell variables in the image string are expanded at plan time.
Unset variables expand to empty strings.

Local images

Pyxis local image paths must point to .sqsh or .squashfs files.
Apptainer/Singularity local image paths must point to .sif files.
Relative paths are resolved against the compose file directory.
Paths that look like build contexts are rejected.

`command`, `entrypoint`, and `script`

Both fields accept either:

a string, interpreted as shell form
a list of strings, interpreted as exec form

Rules:

If both fields are present, they must use the same form.
Mixed string/array combinations are rejected.
If neither field is present, the image default entrypoint and command are used.
If working_dir is set, at least one of command or entrypoint must also be set.
A multi-line string-form command is automatically normalized to ["/bin/sh", "-lc", command] so YAML block scalars run as one shell script.
Single-line string-form command remains shell form.
script is a convenience field for multi-line shell snippets and normalizes to command: ["/bin/sh", "-lc", script].
script cannot be combined with command or entrypoint.

`environment`

Accepted forms:

environment:
  APP_ENV: prod
  LOG_LEVEL: info

environment:
  - APP_ENV=prod
  - LOG_LEVEL=info

Rules:

List items must use KEY=VALUE syntax.
.env from the compose file directory is loaded automatically when present.
Shell environment variables override .env; .env fills only missing variables.
environment, x-runtime.prepare.env, and compatibility x-enroot.prepare.env values support $VAR, ${VAR}, ${VAR:-default}, and ${VAR-default} interpolation.
Missing variables without defaults are errors.
Use $$ for a literal dollar sign in interpolated fields.
String-form shell snippets are still literal. For example, $PATH inside a string-form command is not expanded at plan time.

`volumes`

Accepted form:

volumes:
  - ./app:/workspace
  - /shared/data:/data
  - /shared/reference:/reference:ro

Rules:

Host paths are resolved against the compose file directory.
Runtime mounts accept host_path:container_path and host_path:container_path:ro|rw.
Pyxis mounts are passed through srun --container-mounts=...; Apptainer/Singularity mounts are passed as --bind.
Every containerized service also gets an automatic shared mount at /hpc-compose/job, backed by the active job workspace on the host. For ordinary runs that is <runtime-root>/<job-id>; for resume-aware attempts it is <runtime-root>/<job-id>/attempts/<attempt>.
/hpc-compose/job is reserved and cannot be used as an explicit volume destination.

Warning

If a mounted file is a symlink, the symlink target must also be visible from inside the mounted directory. Otherwise the path can exist on the host but fail inside the container.

`depends_on`

Accepted forms:

depends_on:
  - redis

depends_on:
  redis:
    condition: service_started

depends_on:
  redis:
    condition: service_healthy

Rules:

List form means condition: service_started.
Map form accepts condition: service_started, condition: service_healthy, and condition: service_completed_successfully.
service_healthy requires the dependency service to define readiness.
service_started waits only for the dependency process to be launched and still alive.
service_healthy waits for the dependency readiness check to succeed.
service_completed_successfully waits for the dependency to exit with status 0 before launching the dependent service, which is useful for one-shot DAG stages such as preprocess -> train -> postprocess.

`readiness`

Supported types:

Sleep

readiness:
  type: sleep
  seconds: 5

seconds is required.

TCP

readiness:
  type: tcp
  host: 127.0.0.1
  port: 6379
  timeout_seconds: 30

host defaults to 127.0.0.1.
timeout_seconds defaults to 60.

Log

readiness:
  type: log
  pattern: "Server started"
  timeout_seconds: 60

timeout_seconds defaults to 60.

HTTP

readiness:
  type: http
  url: http://127.0.0.1:8080/health
  status_code: 200
  timeout_seconds: 30

status_code defaults to 200.
timeout_seconds defaults to 60.
The readiness check polls the URL through curl.

`healthcheck`

healthcheck is accepted as migration sugar and is normalized into the readiness model.

services:
  redis:
    image: redis:7
    healthcheck:
      test: ["CMD", "nc", "-z", "127.0.0.1", "6379"]
      timeout: 30s

Rules:

healthcheck and readiness are mutually exclusive.
Supported probe forms are a constrained subset:
- ["CMD", "nc", "-z", HOST, PORT]
- ["CMD-SHELL", "nc -z HOST PORT"]
- recognized curl probes against http:// or https:// URLs
- recognized wget --spider probes against http:// or https:// URLs
timeout maps to timeout_seconds.
disable: true disables readiness for that service.
interval, retries, and start_period are parsed but rejected.
HTTP-style healthchecks normalize to readiness.type: http with status_code: 200.

`assert`

assert defines post-run contracts for a service. Checks run in the rendered script’s cleanup() after services are reaped and before artifact collection or stage-out. Any failed assertion marks the job failed, even when the service uses x-slurm.failure_policy.mode: ignore.

services:
  train:
    image: trainer:latest
    command: python train.py
    assert:
      exit_code: 0
      artifacts_contain: "model/*.pt"
      max_duration_seconds: 7200

Field	Shape	Notes
`exit_code`	integer `0..255`	Expected final service exit code.
`artifacts_contain`	string	Glob that must match at least one path. Relative patterns resolve under `/hpc-compose/job`; absolute patterns must stay under `/hpc-compose/job`.
`max_duration_seconds`	positive integer	Maximum wall-clock seconds from first service launch to terminal service exit, including restart time.

At least one assertion field is required. Assertion results are written into runtime state.json; hpc-compose status --format json includes them under each service’s assertions object.

Service-level `x-slurm`

These fields live under services.<name>.x-slurm.

Field	Shape	Default	Notes
`nodes`	positive integer	omitted	Legacy shorthand: `1` for a helper step, or the full top-level allocation node count for a full-allocation distributed service. Partial multi-node counts require `placement.node_count`.
`placement`	mapping	omitted	Explicit node-index placement inside the allocation.
`ntasks`	positive integer	omitted	Adds `--ntasks` to that service’s `srun`.
`ntasks_per_node`	positive integer	omitted	Adds `--ntasks-per-node` to that service’s `srun`.
`cpus_per_task`	positive integer	omitted	Adds `--cpus-per-task` to that service’s `srun`.
`gpus`	positive integer	omitted	Adds `--gpus` when `gres` is not set.
`gres`	string	omitted	Adds `--gres` to that service’s `srun`. Takes priority over `gpus`.
`gpus_per_node`	positive integer	omitted	Adds `--gpus-per-node` to that service’s `srun`.
`gpus_per_task`	positive integer	omitted	Adds `--gpus-per-task` to that service’s `srun`.
`cpus_per_gpu`	positive integer	omitted	Adds `--cpus-per-gpu` to that service’s `srun`.
`mem_per_gpu`	string	omitted	Adds `--mem-per-gpu` to that service’s `srun`.
`gpu_bind`	string	omitted	Adds `--gpu-bind` to that service’s `srun`.
`cpu_bind`	string	omitted	Adds `--cpu-bind` to that service’s `srun`.
`mem_bind`	string	omitted	Adds `--mem-bind` to that service’s `srun`.
`distribution`	string	omitted	Adds `--distribution` to that service’s `srun`.
`hint`	string	omitted	Adds `--hint` to that service’s `srun`.
`time_limit`	string	omitted	Advisory per-service time limit. Validated against Slurm time formats but not passed to `srun`. `inspect` surfaces warnings when the limit exceeds allocation time or conflicts with dependencies. Accepted formats: `MM`, `MM:SS`, `HH:MM:SS`, `D-HH`, `D-HH:MM`, `D-HH:MM:SS`.
`extra_srun_args`	list of strings	omitted	Appended directly to the service’s `srun` command.
`mpi`	mapping	omitted	Adds first-class MPI launch metadata and `srun --mpi=<type>`.
`failure_policy`	mapping	omitted	Per-service failure handling (`fail_job`, `ignore`, `restart_on_failure`).
`prologue`	string or mapping	omitted	Per-service shell hook run before each launch attempt. String shorthand runs on the host.
`epilogue`	string or mapping	omitted	Per-service shell hook run after each service exit attempt. String shorthand runs on the host.
`hooks`	list of mappings	omitted	Host-side event hooks for failure-policy transitions such as accepted restarts and crash-loop window exhaustion.
`scratch`	mapping	omitted	Per-service scratch opt-out. Set `enabled: false` to exclude a service from the shared scratch mount when top-level `x-slurm.scratch` is configured.
`rendezvous`	mapping	omitted	Provider registration config for cross-job service discovery.
`parallelism`	mapping `{ tensor, pipeline }`	omitted	Descriptive per-service tensor/pipeline geometry. Validation-only and cross-checked against this service’s `gpus_per_node`. See `x-slurm.parallelism`.

`services.<name>.x-slurm.rendezvous`

Provider-side registration writes an atomic shared-cache record after readiness succeeds when readiness is configured:

services:
  model:
    image: python:3.12-slim
    command: python -m http.server 8000
    readiness:
      type: tcp
      port: 8000
    x-slurm:
      rendezvous:
        register:
          name: model-server
          port: 8000
          protocol: http
          path: /
          ttl_seconds: 3600

Names are single safe path components using ASCII letters, digits, ., _, and -. Rendezvous is same-cluster shared-storage coordination only; it does not provide DNS, tunneling, or authentication.

`services.<name>.x-slurm.prologue` / `epilogue`

services:
  trainer:
    image: trainer:latest
    command: python train.py
    x-slurm:
      prologue: |
        module load cuda/12.1
        nvidia-smi
      epilogue:
        context: container
        script: |
          tar czf /shared/logs-${SLURM_JOB_ID}.tar.gz /hpc-compose/job/logs

Shape: either a block string, or a mapping with script and optional context.
context: host (default) or container.
Hook scripts are emitted as trusted shell and are not Compose-interpolated, so runtime variables such as ${SLURM_JOB_ID} are preserved.
Hooks run once per service launch attempt, including restart_on_failure retries.
Host hooks run in the generated batch supervisor on the allocation’s primary execution context. Container hooks wrap the service command inside the container and can use /hpc-compose/job.
Hook stdout/stderr is written to the service log.
Container hooks require an explicit command or entrypoint; image-default services cannot be wrapped.

`services.<name>.x-slurm.hooks`

services:
  trainer:
    image: trainer:latest
    command: python train.py
    x-slurm:
      failure_policy:
        mode: restart_on_failure
      hooks:
        - on: restart
          context: host
          script: |
            echo "Service $HPC_COMPOSE_SERVICE_NAME restarted (attempt $HPC_COMPOSE_ATTEMPT)" >> /shared/restart.log
        - on: window_exhausted
          script: |
            curl -X POST "$WEBHOOK_URL" -d '{"alert": "crash loop detected"}'

Shape: list of mappings with on, script, and optional context.
on: restart or window_exhausted.
context: host only. Omitted context defaults to host; container is rejected for event hooks.
restart runs after a non-zero exit has passed the lifetime and rolling-window guards, after restart counters are recorded, and before backoff/relaunch.
window_exhausted runs only when the rolling-window guard blocks another restart. It does not run for lifetime max_restarts exhaustion.
Event hooks are best-effort observability hooks. A non-zero hook exit is logged to the service log and does not change the restart or failure-policy outcome.
Event hook scripts are emitted as trusted shell and are not Compose-interpolated.
Event hooks receive HPC_COMPOSE_HOOK_PHASE, HPC_COMPOSE_SERVICE_NAME, HPC_COMPOSE_SERVICE_LOG, HPC_COMPOSE_SERVICE_EXIT_CODE, HPC_COMPOSE_ATTEMPT, HPC_COMPOSE_RESTART_COUNT, HPC_COMPOSE_MAX_RESTARTS, HPC_COMPOSE_WINDOW_SECONDS, HPC_COMPOSE_MAX_RESTARTS_IN_WINDOW, and HPC_COMPOSE_RESTART_FAILURES_IN_WINDOW.

`services.<name>.x-slurm.placement`

services:
  a:
    image: app:a
    x-slurm:
      placement: { node_range: "0-3" }
  b:
    image: app:b
    x-slurm:
      placement: { node_range: "4-7" }
  ps:
    image: app:b
    x-slurm:
      placement: { share_with: b }

Exactly one selector is required:

Field	Shape	Notes
`node_range`	string	Zero-based inclusive allocation indices, for example `"0-3"` or `"0-3,6"`.
`node_count`	integer	Selects this many eligible nodes starting at `start_index`, default `0`.
`node_percent`	integer `1..100`	Selects `ceil(percent * eligible_nodes / 100)`, minimum one node.
`share_with`	string	Reuses another service’s resolved node set for explicit co-location.

Optional fields:

start_index: applies to node_count and node_percent.
exclude: zero-based allocation indices removed from the eligible set and passed to srun --exclude.
allow_overlap: permits intentional overlap with another explicit placement.

Node indices are resolved against the Slurm allocation order from scontrol show hostnames "$SLURM_JOB_NODELIST". At runtime, containers receive both allocation-wide metadata (HPC_COMPOSE_NODELIST) and service-scoped metadata (HPC_COMPOSE_SERVICE_NODELIST, HPC_COMPOSE_SERVICE_NODELIST_FILE, HPC_COMPOSE_SERVICE_PRIMARY_NODE, HPC_COMPOSE_SERVICE_NODE_COUNT).

`services.<name>.x-slurm.mpi`

services:
  trainer:
    image: mpi-image:latest
    command: /usr/local/bin/train
    x-slurm:
      nodes: 2
      ntasks_per_node: 4
      mpi:
        type: pmix_v4
        profile: openmpi
        implementation: openmpi
        launcher: srun
        expected_ranks: 8
        host_mpi:
          bind_paths:
            - /opt/site/openmpi:/opt/site/openmpi:ro
          env:
            MPI_DIR: /opt/site/openmpi

Shape: mapping
Default: omitted
type is an exact srun --mpi=<type> plugin token. Common values include pmix, pmix_v4, pmi2, pmi1, and openmpi; use srun --mpi=list or hpc-compose doctor cluster-report on the target cluster to discover site-specific values.
Notes:
- Rendered as --mpi=<type> on the service’s srun command.
- profile is optional compatibility metadata used for validation, cluster-profile diagnostics, and doctor mpi-smoke output. Supported values are openmpi, mpich, and intel_mpi.
- profile does not auto-select or rewrite type; use the exact token that your cluster reports through srun --mpi=list.
- launcher defaults to srun; other launchers are rejected.
- implementation is optional metadata for diagnostics. Supported values are openmpi, mpich, intel_mpi, mvapich2, cray_mpi, hpe_mpi, and unknown.
- When both profile and implementation are set, they must describe the same MPI family.
- expected_ranks, when set, must match the resolved Slurm task geometry.
- host_mpi.bind_paths uses host_path:container_path[:ro|rw] syntax, is validated like service volumes, and is automatically mounted into the service.
- host_mpi.env is injected into the service environment after normal service environment entries.
- Cannot be combined with raw --mpi... entries in extra_srun_args.
- MPI services receive HPC_COMPOSE_MPI_TYPE and HPC_COMPOSE_MPI_HOSTFILE.
- MPI services also receive HPC_COMPOSE_MPI_PROFILE when profile is set and HPC_COMPOSE_MPI_IMPLEMENTATION when implementation is set or implied by profile.
- hpc-compose doctor mpi-smoke -f compose.yaml --service trainer renders a smoke probe for the service; add --submit to run it through Slurm. hpc-compose doctor fabric-smoke -f compose.yaml --service trainer --checks auto extends the same pattern with NCCL, UCX, OFI, and InfiniBand diagnostics when available. Smoke plans keep allocation and MPI launch settings, but strip application workflow blocks such as setup, scratch staging, resume metadata, artifacts, and burst-buffer directives.

Profile-specific compatibility checks are intentionally conservative:

profile: openmpi expects a PMIx-capable type such as pmix or pmix_v*, with pmi2 accepted as a fallback.
profile: mpich expects pmi2 or a PMIx-capable setup.
profile: intel_mpi expects pmi2; preflight and doctor warn when no I_MPI_PMI_LIBRARY or cluster-profile PMI2 library is visible.

`services.<name>.x-slurm.failure_policy`

services:
  worker:
    image: python:3.11-slim
    x-slurm:
      failure_policy:
        mode: restart_on_failure
        max_restarts: 3
        backoff_seconds: 5
        window_seconds: 60
        max_restarts_in_window: 3

Field	Shape	Default	Notes
`mode`	`fail_job` \| `ignore` \| `restart_on_failure`	`fail_job`	`fail_job` keeps fail-fast behavior. `ignore` keeps the job running after non-zero exits. `restart_on_failure` restarts on non-zero exits only.
`max_restarts`	integer	`3` when `mode=restart_on_failure`	Required to be at least `1` after defaults are applied. Valid only for `restart_on_failure`.
`backoff_seconds`	integer	`5` when `mode=restart_on_failure`	Fixed delay between restart attempts. Required to be at least `1` after defaults are applied. Valid only for `restart_on_failure`.
`window_seconds`	integer	`60` when `mode=restart_on_failure`	Rolling window for counting restart-triggering exits. Required to be at least `1` after defaults are applied. Valid only for `restart_on_failure`.
`max_restarts_in_window`	integer	resolved `max_restarts` when `mode=restart_on_failure`	Maximum restart-triggering exits allowed within `window_seconds`. Required to be at least `1` after defaults are applied. Valid only for `restart_on_failure`.

Rules:

In a multi-node allocation, implicit helper services are pinned to HPC_COMPOSE_PRIMARY_NODE.
Explicit service placements may not overlap unless one side sets placement.allow_overlap: true or uses placement.share_with.
max_restarts, backoff_seconds, window_seconds, and max_restarts_in_window are rejected unless mode: restart_on_failure.
Restart attempts count relaunches after the initial launch.
Restarts trigger only for non-zero exits.
restart_on_failure enforces both a lifetime cap (max_restarts) and a rolling-window cap (max_restarts_in_window within window_seconds) during one live batch-script execution.
If you omit the rolling-window fields, restart_on_failure still enables default crash-loop protection with window_seconds: 60 and max_restarts_in_window: <resolved max_restarts>.
Services configured with mode: ignore cannot be used as dependencies in depends_on.

Examples:

Use the defaults when you only need bounded retries:

services:
  worker:
    image: python:3.11-slim
    x-slurm:
      failure_policy:
        mode: restart_on_failure

That resolves to:

max_restarts: 3
backoff_seconds: 5
window_seconds: 60
max_restarts_in_window: 3

Use explicit fields when you need a larger lifetime budget but still want a tighter crash-loop guard:

services:
  worker:
    image: python:3.11-slim
    x-slurm:
      failure_policy:
        mode: restart_on_failure
        max_restarts: 8
        backoff_seconds: 10
        window_seconds: 60
        max_restarts_in_window: 3

Semantics:

The initial launch does not count as a restart.
restart_count counts granted relaunches after the initial launch.
max_restarts_in_window counts restart-triggering non-zero exits whose timestamps still satisfy now - event < window_seconds.
If a non-zero exit would exceed the rolling-window cap, the job fails immediately and that blocked exit is not recorded as a consumed restart.
Successful exits do not trigger restarts and do not add entries to the rolling window.
The rolling window is attempt-local to one live batch-script execution. It is not hydrated from prior state.json, resume metadata, or Slurm requeue history.
x-slurm.hooks can observe accepted restart events and blocked window_exhausted events without changing the policy decision.

Tracked state:

status --format json includes failure_policy_mode, restart_count, max_restarts, window_seconds, max_restarts_in_window, restart_failures_in_window, and last_exit_code for each tracked service.
Text status renders the live rolling-window budget as window=<current>/<max>@<seconds>s.

Unknown keys under top-level x-slurm or per-service x-slurm cause hard errors.

`x-runtime.prepare` and `x-enroot.prepare`

x-runtime.prepare lets a service build a prepared runtime image from its base image before submission. x-enroot.prepare remains accepted as a Pyxis-only compatibility spelling.

services:
  app:
    image: python:3.11-slim
    x-runtime:
      prepare:
        commands:
          - pip install --no-cache-dir numpy pandas
        mounts:
          - ./requirements.txt:/tmp/requirements.txt
        env:
          PIP_CACHE_DIR: /tmp/pip-cache
        root: true

Field	Shape	Default	Notes
`commands`	list of strings	required when `prepare` is present	Each command runs through the selected backend’s writable prepare flow.
`mounts`	list of `host_path:container_path` strings	omitted	Visible only during prepare. Relative host paths resolve against the compose file directory.
`env`	mapping or list of `KEY=VALUE` strings	omitted	Passed only during prepare. Values support the same interpolation rules as `environment`.
`root`	boolean	`true`	Controls whether prepare commands request root/fakeroot behavior where the backend supports it.

Rules:

If x-runtime.prepare or x-enroot.prepare is present, commands cannot be empty.
A service may not set both spellings.
x-enroot.prepare is rejected when runtime.backend is not pyxis.
If prepare.mounts is non-empty, the service rebuilds on every prepare or up.
Remote base images are imported under cache_dir/base.
Prepared images are exported under cache_dir/prepared.
Unknown keys under x-runtime, x-enroot, or prepare cause hard errors.

Unsupported Compose keys

These keys are rejected with explicit messages:

build
ports
networks
network_mode
Compose restart (use services.<name>.x-slurm.failure_policy)
deploy

Any other unknown key at the service level is also rejected.

CLI Reference
Glossary
Full Example Specs
Roadmap and Non-Goals
Examples

Files and Directories

hpc-compose writes to three independent on-disk roots, and keeping them separate is deliberate. Compose-level metadata lives next to the compose file so tracked records travel with your project; per-job runtime state lives under a per-job runtime root resolved at submit time; and the cache is a content-addressed store shared across jobs and visible from both the login node and the compute nodes. src/tracked_paths.rs is the single source of truth for every leaf name documented here, so the layout below matches what tooling reads and writes exactly.

The three roots at a glance

Root	Default location	Set with	Scope	Holds
Metadata directory	`<compose-file-dir>/.hpc-compose/`	(always next to the compose file)	Per compose file	Tracked job records, latest pointers, sweep manifests
Per-job runtime root	`<submit-dir>/.hpc-compose/<job-id>/`	`x-slurm.runtime_root`	Per job	Logs, metrics, artifacts, allocation files, state
Cache directory	`$HOME/.cache/hpc-compose/`	`x-slurm.cache_dir`	Shared across jobs	Content-addressed images, enroot caches, rendezvous records

The metadata directory and the default per-job runtime root share the same .hpc-compose/ directory name, but they are addressed independently: the metadata root is anchored to the compose file’s directory, while the runtime root is anchored to the submit directory (and is overridable). They coincide only when you submit from the directory that holds the compose file and leave x-slurm.runtime_root unset.

Metadata directory

The metadata directory sits next to the compose file (metadata_root_for joins .hpc-compose onto the compose file’s parent). It holds the durable record of every submission plus the latest-pointers that let follow-up commands reconnect without resubmitting.

<compose-file-dir>/.hpc-compose/
├── latest.json              # most recent `up` (main) submission record
├── latest-run.json          # most recent `run` submission record
├── latest-canary.json       # most recent `germinate` canary record
├── latest-notebook.json     # most recent `notebook` server record
├── jobs/
│   └── <job-id>.json        # one tracked SubmissionRecord per submitted job
└── sweeps/
    ├── latest.json          # most recent sweep manifest pointer
    └── <sweep-id>/
        └── sweep.json       # per-sweep manifest

Leaf	Kind	Contents
`latest.json`	file	`SubmissionRecord` for the most recent `up` (main-kind) submission.
`latest-run.json`	file	`SubmissionRecord` for the most recent `run` submission.
`latest-canary.json`	file	`SubmissionRecord` for the most recent `germinate` canary submission.
`latest-notebook.json`	file	`SubmissionRecord` for the most recent tracked `notebook` submission.
`jobs/<job-id>.json`	file	The authoritative `SubmissionRecord` for one job, keyed by Slurm job id.
`sweeps/latest.json`	file	Pointer to the most recent sweep manifest.
`sweeps/<sweep-id>/sweep.json`	file	Manifest describing one sweep and its trials.

A SubmissionRecord carries the paths the runtime root resolves to, including runtime_root (the resolved x-slurm.runtime_root override, present only when set), batch_log, batch_log_managed, and service_logs (the authoritative service-name to log-path map; see Log lifecycle). The current SubmissionRecord schema version is 3. Records written by schema 3 persist the runtime_root override when one was set; older records that lack the field fall back to the default <submit-dir>/.hpc-compose layout when read.

Per-job runtime root

Each job gets its own runtime root: <runtime-root>/<job-id>/, where <runtime-root> defaults to <submit-dir>/.hpc-compose (runtime_root_for) and is overridable with x-slurm.runtime_root (resolve_runtime_root). The renderer resolves this to an absolute path at submit time and bakes it into the rendered JOB_ROOT, so a running job never depends on $SLURM_SUBMIT_DIR being set or shared-visible at compute-node runtime. A relative x-slurm.runtime_root resolves against the submit directory; an absolute one is used as-is.

<runtime-root>/
├── logs/
│   ├── hpc-compose-%j.out        # default batch log (job-id, Slurm-expanded)
│   └── <service-token>.log       # one log per service (see Log lifecycle)
└── <job-id>/
    ├── state.json                # job state snapshot (latest view)
    ├── logs/
    │   └── <service-token>.log   # per-service logs, latest attempt
    ├── metrics/
    │   ├── meta.json
    │   ├── gpu.jsonl
    │   ├── gpu_processes.jsonl
    │   ├── slurm.jsonl
    │   ├── diagnostics/
    │   └── gpu-node-samples/
    ├── artifacts/
    │   ├── manifest.json
    │   └── payload/
    ├── allocation/
    │   ├── primary_node
    │   ├── nodes.txt
    │   ├── service-nodelists/
    │   ├── mpi-hostfiles/
    │   └── distributed-hostfiles/
    ├── service-exits/
    ├── hooks/
    └── attempts/                 # resume-aware runs only
        └── <n>/                  # logs/, metrics/, artifacts/, state.json per attempt

Leaf (under `<job-id>/`)	Kind	Contents
`state.json`	file	Latest-view job state snapshot used by `status` and friends.
`logs/<service-token>.log`	file	One log per service for the latest attempt; the filename is encoded (see below).
`metrics/meta.json`	file	Metrics collection metadata.
`metrics/gpu.jsonl`	file	Per-sample GPU metrics.
`metrics/gpu_processes.jsonl`	file	Per-sample GPU process attribution.
`metrics/slurm.jsonl`	file	Slurm step statistics samples.
`metrics/diagnostics/`	dir	Collected diagnostic artifacts.
`metrics/gpu-node-samples/`	dir	Per-node GPU sample files.
`artifacts/manifest.json`	file	Manifest describing exported artifacts.
`artifacts/payload/`	dir	The exported artifact payload tree.
`allocation/primary_node`	file	Hostname of the primary allocation node.
`allocation/nodes.txt`	file	The full allocation node list.
`allocation/service-nodelists/`	dir	Per-service node lists.
`allocation/mpi-hostfiles/`	dir	Generated MPI hostfiles.
`allocation/distributed-hostfiles/`	dir	Generated distributed (torchrun-style) hostfiles.
`service-exits/`	dir	Per-service exit markers (`<service>.jsonl`).
`hooks/`	dir	Materialized prologue/epilogue/event hook scripts and their manifest.
`attempts/<n>/`	dir	Per-attempt copies of `logs/`, `metrics/`, `artifacts/`, and `state.json` for resume-aware runs. These per-attempt `state.json` files are the data source for `hpc-compose checkpoints` attempt/requeue history.

The batch script keeps the root-level logs/, metrics/, artifacts/, and state.json as the “latest” view (it updates them to point at the most recent attempt) so status and export commands read the latest attempt without reconstructing shell logic.

Default batch log location

When you do not set x-slurm.output, real submissions get a baked --output directive at <runtime-root>/logs/hpc-compose-%j.out. Note that this parent is job-id-free (<runtime-root>/logs/, not under <runtime-root>/<job-id>/), because Slurm opens --output before the script body runs, so the CLI pre-creates that directory host-side before sbatch. The default basename deliberately avoids %x so a raw job name can never become a path component; %j is expanded by Slurm. Setting x-slurm.output replaces this default entirely. Dry-run previews (inspect, render) keep the portable Slurm default instead of a baked absolute path so committed example renders stay machine-independent.

Cache directory

The cache directory defaults to $HOME/.cache/hpc-compose/ and is set with x-slurm.cache_dir (resolved with the precedence documented in Spec Reference). It must be visible from both the login node and the compute nodes. Image artifacts are content-addressed: the filename embeds a short hash of the cache key, so identical inputs reuse the same artifact across jobs and machines.

<cache_dir>/
├── base/
│   ├── <hash>-<label>.sqsh        # imported base image
│   ├── <hash>-<label>.sqsh.json   # manifest sidecar
│   └── <hash>-<label>.sqsh.json.lock  # advisory-lock sidecar
├── prepared/
│   ├── <hash>-<name>.sqsh         # prepared runtime image
│   └── <hash>-<name>.sqsh.json    # manifest sidecar
├── enroot/                        # login-node shared enroot store
│   ├── cache/
│   ├── data/
│   └── tmp/
├── runtime/
│   └── <job-id>/                  # per-job compute-node enroot runtime cache
│       ├── cache/                 # ENROOT_CACHE_PATH
│       ├── data/                  # ENROOT_DATA_PATH
│       └── tmp/                   # ENROOT_TEMP_PATH
└── rendezvous/
    └── <name>/
        ├── latest.json            # current provider for this rendezvous name
        └── <token>.json           # historical per-registration records

Leaf	Kind	Contents
`base/<hash>-<label>.sqsh`	file	A base image imported from a remote reference, named by `<short-hash>-<label>`.
`base/<hash>-<label>.sqsh.json`	file	Manifest tracking the cache entry.
`base/<hash>-<label>.sqsh.json.lock`	file	Advisory-lock sidecar that serializes concurrent manifest read-modify-write.
`prepared/<hash>-<name>.sqsh`	file	A prepared runtime image derived from a base image plus prepare steps, named by `<short-hash>-<service-name>`.
`prepared/<hash>-<name>.sqsh.json`	file	Manifest tracking the prepared entry.
`enroot/cache/`, `enroot/data/`, `enroot/tmp/`	dir	The shared login-node enroot store used during host-side prepare. `enroot/tmp` is the default extraction scratch; redirect it to node-local storage with `x-slurm.enroot_temp_dir` (or `cache.enroot_temp_dir` / `HPC_COMPOSE_ENROOT_TEMP_DIR`) to avoid `Stale file handle` on shared filesystems.
`runtime/<job-id>/{cache,data,tmp}/`	dir	The per-job compute-node enroot runtime cache; the renderer exports `ENROOT_CACHE_PATH`/`ENROOT_DATA_PATH`/`ENROOT_TEMP_PATH` at these paths (`enroot_runtime_job_dir`). Namespaced by job id so removing it never touches the shared cache root.
`rendezvous/<name>/latest.json`	file	The current provider record for one rendezvous name (atomic latest pointer).
`rendezvous/<name>/<token>.json`	file	Historical per-registration records, retained until TTL expiry or owner cleanup.

Manifest .lock sidecars carry no data and only serialize writers; the manifest JSON next to each artifact is the persisted record. See Connect Jobs Across Allocations for how rendezvous records are produced and resolved.

Repo staging vs cluster workspace provisioning

The three roots above are written by hpc-compose itself. They are not the same as the cluster workspaces and site storage directories your job reads and writes — those you provision yourself.

When you submit from a laptop with hpc-compose up --remote, the project is first staged to a per-project directory on the login node:

~/.hpc-compose-remote/<project>/      # rsync'd copy of your settings base on the login node

The staged root is the settings base: the directory that contains .hpc-compose/settings.toml. Keep that file at the repo root so your whole source tree is staged. If your compose file sits in a subdirectory with no repo-root settings file, only that subdirectory is staged and the rest of your tree is hidden from the job (hpc-compose warns when it stages only a subdir). The stage includes project settings (.hpc-compose/settings.toml, .hpc-compose/cluster.toml) but excludes tracked job/runtime state. See Submit From Your Laptop With up --remote.

Staging copies your repo. It does not allocate cluster workspaces (for example ws_allocate) or create site storage directories. You must create cache, dataset, checkpoint, and other site storage paths yourself before the run — a missing host bind-mount or storage directory blocks preflight.

Preflight remediation reflects this boundary. For a relative or in-repo missing path it tells you to create the directory; for an absolute missing path it notes that the path may be a cluster workspace or site storage location and should be provisioned with your site’s allocation command (for example ws_allocate) or an x-slurm.setup step, because hpc-compose stages your repo but does not allocate workspaces or create site storage directories.

Bootstrapping required directories

x-slurm.setup is the declarative bootstrap phase: its commands run on the allocated node before any service starts, so it is the right place to create the cache/data/results sub-directories your bind mounts expect. Allocate (or look up) the workspace first, then create the layout declaratively:

x-slurm:
  setup:
    # $WORKSPACE is resolved on the node (e.g. exported by an earlier step or your shell rc);
    # ws_allocate / ws_find belong in your session, not here, because they allocate quota.
    - mkdir -p "$WORKSPACE"/{cache,data,results,runtime}

For in-repo directories (relative bind-mount sources such as ./results), commit them with a .gitkeep so they exist and are staged, rather than relying on them being created at runtime. Use absolute cluster paths for large/scratch data that should not be staged, and relative in-repo paths for small inputs that travel with the project.

Excluding files from staging (`.hpcignore`)

A repo-root .hpcignore adds extra excludes on top of .gitignore when the source tree is snapshotted (for up, prepare, and up --remote). It uses gitignore-style patterns, so anchoring matters:

An unanchored directory pattern like data/ matches that name at any depth — including a Python package subtree such as src/mypackage/data/. Excluding package source there causes ModuleNotFoundError at runtime.
Anchor artifact patterns to the repo root with a leading slash — /data/, /runs/, /results/ — so they only match the top-level artifact directories and never a nested package.

hpc-compose warns when .hpcignore excludes any .py file (the usual symptom of this mistake). To see exactly what an .hpcignore removes from the snapshot, set HPC_COMPOSE_DEBUG_STAGING=1, which lists every excluded path during staging.

Environment variables that affect paths

hpc-compose both reads some path-affecting variables from your environment and sets others into the running job. The table below consolidates the relevant ones.

Variable	Direction	Effect
`HOME`	Read from environment	Anchors the default cache directory (`$HOME/.cache/hpc-compose`) when `x-slurm.cache_dir` is unset.
`SLURM_SUBMIT_DIR`	Read from environment	Now only a preview fallback: dry-run renders use `${SLURM_SUBMIT_DIR:-$PWD}/.hpc-compose` for `JOB_ROOT`. Real submissions bake an absolute runtime root, so the running job no longer depends on it.
`SLURM_JOB_ID`	Read from environment (set by Slurm)	Selects the per-job runtime root (`JOB_ROOT/<job-id>`) and the per-job enroot runtime dir (`runtime/<job-id>`); expanded into `%j` in the default batch log.
`ENROOT_CACHE_PATH`	Set by hpc-compose	Exported to `<cache_dir>/runtime/<job-id>/cache` in the rendered batch script.
`ENROOT_DATA_PATH`	Set by hpc-compose	Exported to `<cache_dir>/runtime/<job-id>/data`.
`ENROOT_TEMP_PATH`	Set by hpc-compose	Exported to `<cache_dir>/runtime/<job-id>/tmp` at compute-node runtime; during prepare it defaults to `<cache_dir>/enroot/tmp` unless redirected (see `HPC_COMPOSE_ENROOT_TEMP_DIR`).
`HPC_COMPOSE_ENROOT_TEMP_DIR`	Read from environment	Overrides the prepare-time enroot extraction scratch (default `<cache_dir>/enroot/tmp`). Mirrors `x-slurm.enroot_temp_dir`/`cache.enroot_temp_dir`; for `up --remote` prefer the spec or settings field, because a laptop env var does not propagate over SSH.
`HPC_COMPOSE_PREPARE_GPU`	Read from environment	Opts prepare-time image building back into enroot’s NVIDIA hook. Default is off: prepare runs CPU-only on the login node (`NVIDIA_VISIBLE_DEVICES=void`) so a CUDA image’s baked GPU request does not make the hook fail where no driver is present; GPUs are injected at Slurm/Pyxis runtime instead. Set to `1`/`true`/`yes`/`on` only when the prepare host actually has a driver.
`HPC_COMPOSE_BACKEND_OVERRIDE`	Read from environment	Selects the runtime backend used by the batch script (defaults to `slurm`).
`HPC_COMPOSE_DEV_CONTROL_DIR`	Read from environment	When set, enables the dev control directory used for live restart requests during local smoke-tests.
`HPC_COMPOSE_DEBUG_STAGING`	Read from environment	When truthy, lists every path excluded from the source snapshot by `.hpcignore` during staging (a staged-file manifest aid for debugging ignore rules).
`HPC_COMPOSE_SERVICE_LOG`	Set by hpc-compose	Points each service and its hooks at the in-container path of that service’s log file.
`HPC_COMPOSE_RESUME_DIR`	Set by hpc-compose	The in-container path of the resume directory for resume-aware runs.

During login-node prepare the same enroot variables are pointed at the shared <cache_dir>/enroot/{cache,data,tmp} store rather than the per-job runtime/<job-id> store. The persistent layer cache (ENROOT_CACHE_PATH) always stays under cache_dir, but the temporary extraction scratch (ENROOT_TEMP_PATH) — and, when that scratch is redirected, the transient prepare rootfs (ENROOT_DATA_PATH, where enroot create unsquashes the image before the prepared .sqsh is exported) — can be moved to fast node-local storage together. By default the scratch stays at <cache_dir>/enroot/tmp; opt in by setting x-slurm.enroot_temp_dir in the spec (interpolation-aware, e.g. /tmp/${USER}-hpc-compose-enroot), cache.enroot_temp_dir in .hpc-compose/settings.toml (project-wide default, mirroring cache.dir), or the HPC_COMPOSE_ENROOT_TEMP_DIR environment variable. Precedence is HPC_COMPOSE_ENROOT_TEMP_DIR > x-slurm.enroot_temp_dir > cache.enroot_temp_dir > the <cache_dir>/enroot/tmp default. When the scratch is left at its default the prepare rootfs stays on the shared cache (<cache_dir>/enroot/data); redirecting the scratch moves both the extraction scratch and the transient rootfs to an hpc-compose-owned per-process subdir under the node-local path. This matters on shared NFS/Lustre/GPFS home/work storage, where the extract-then-mksquashfs import and the unsquashfs create step are slow and can fail with Stale file handle (ESTALE); pointing the scratch at node-local /tmp keeps the final .sqsh and layer cache on the shared cache while extraction and rootfs creation happen locally. The override applies to prepare-time import only, not the compute-node runtime. hpc-compose preflight surfaces the resolved enroot temp path, and hpc-compose context shows the settings-level value. The full set of HPC_COMPOSE_* runtime variables injected into services (distributed, rendezvous, MPI, scratch, and hook variables) is described in Monitor a Run and the feature guides.

Cleanup scope

Different commands reap different subsets of these roots. The table is precise about what each one deletes and what it leaves intact.

Command / mechanism	Deletes	Preserves
`down` (a.k.a. `cancel`)	The job’s tracked record `jobs/<job-id>.json`, the per-job runtime root `<runtime-root>/<job-id>/`, the hpc-compose-managed default batch log when `x-slurm.output` was not set, the per-job enroot dir `<cache_dir>/runtime/<job-id>/`, and this job’s owned rendezvous records. Repairs the latest pointers afterward.	Other jobs’ records and runtime roots, user-pinned `x-slurm.output` files, the shared cache root, `base/`/`prepared/` artifacts, and other jobs’ rendezvous records.
`clean`	The same per-job state as `down` for each reaped record (tracked record, per-job runtime root, managed default batch log, per-job enroot dir, owned rendezvous records), selected by `--age DAYS` or `--all` (all except the latest).	The retained records and their runtime roots, user-pinned `x-slurm.output` files, the shared cache root, and content-addressed artifacts.
Batch teardown trap (`x-slurm.cleanup.runtime_cache`)	Only the per-job enroot runtime cache (`ENROOT_CACHE_PATH`/`DATA_PATH`/`TEMP_PATH` under `runtime/<job-id>/`), and only when the policy opts in. Default is `never`; `on_success` runs only on exit code 0; `always` runs on every clean exit.	Everything else. Because cancelled or crashed jobs never run the trap, host-side `down`/`clean` are the reliable reapers of `runtime/<job-id>`.
`cache prune` (`--age DAYS` or `--all-unused`)	Content-addressed artifacts (`base/` and `prepared/` entries plus their manifest/lock sidecars) that are expired or no longer referenced, and now-empty parent directories left behind.	The cache root itself (never removed), still-referenced artifacts, and non-empty parent directories.
`down --purge-cache`	In addition to the per-job teardown above, the cached artifacts attributed to this submission.	The shared cache root and artifacts belonging to other jobs.
`sweep` cleanup	Tracked sweep trial records and per-trial runtime state, consistent with `clean`.	The sweep manifest history under `sweeps/` unless explicitly removed, and the cache.
`rendezvous prune`	Expired rendezvous records (latest and historical) across all names.	Live `latest.json` pointers and other jobs’ unexpired records.

Two things to keep in mind: tracked metadata records live next to the compose file while the managed default batch log lives under <runtime-root>/logs/, so cleanup uses the persisted record to remove only the log hpc-compose owns; and the per-job enroot dir is namespaced by job id, so reaping it can never touch the shared cache root or another job’s runtime cache.

Log lifecycle

The default batch log (sbatch stdout/stderr) is <runtime-root>/logs/hpc-compose-%j.out unless you set x-slurm.output (see Default batch log location).

Service logs are written one-per-service under <job-id>/logs/. The filename is produced by a reversible token encoding of the service name: each non-alphanumeric byte becomes an _x{hh}_ hex sequence. For example, db.primary (the . is byte 0x2e) becomes db_x2e_primary.log. Do not parse these filenames by hand; the authoritative service-name to log-path map is SubmissionRecord.service_logs, which logs, watch, and replay read.

For resume-aware runs, each attempt’s logs and state are preserved under attempts/<n>/, while the root-level logs//state.json track the latest attempt.

Automatic size-based log rotation is not yet implemented. There is no x-slurm.logs key; cap log volume from inside your service command (for example by limiting verbosity or rotating within your own process) if a long-running service can produce unbounded output.

Spec Reference
Architecture for Contributors
Monitor a Run
Manage the Cache and Clean Up
Operate a Real Cluster Run

Glossary

Core hpc-compose terms, in one place. The short version of this list also appears on the Overview page; this page is the fuller reference.

One-line definitions; follow the link for the owning reference section.

allocation: The single Slurm job where all of an application's services run; one spec compiles to one allocation. See Execution Model.
artifact bundle: A named group of output paths declared under x-slurm.artifacts and exported with hpc-compose artifacts. See x-slurm.artifacts.
canary: A short, minimized probe run from hpc-compose germinate that writes latest-canary.json and leaves latest.json untouched. See germinate.
cache directory: Shared storage for imported and prepared images, visible from both the submission host and the compute nodes. See x-slurm.cache_dir.
compose file / spec: The YAML file describing services, runtime backend, and Slurm settings; "spec" and "compose file" are the same thing. See Spec Reference.
context: The resolved view of settings, profile, binaries, interpolation variables, and runtime paths for an invocation. See context.
failure policy: Per-service restart behavior under services.<name>.x-slurm.failure_policy. See failure_policy.
local mode: Running a plan on the current Linux host through the local Pyxis/Enroot supervisor instead of submitting to Slurm; single-host and Pyxis-only. See up --local.
login node / submission host: The host where you run hpc-compose and from which jobs are submitted; "login node" and "submission host" name the same machine. See Operate a Real Cluster Run.
preflight: Checks of local tools, paths, backend support, and optional cluster profiles before a run. See preflight.
prepare: The login-node phase that imports base images and builds prepared runtime artifacts, reused later by up and run. See x-runtime.prepare.
profile: A named settings block in .hpc-compose/settings.toml, selected with --profile <name>. See Common Flags.
readiness: A gate that holds a dependent service until a probe passes; types are sleep, tcp, http, and log. See readiness.
rendezvous: Same-cluster service discovery through JSON records under the shared cache directory; not DNS, auth, or a service mesh. See x-slurm.rendezvous.
resume: Resume-aware reruns backed by a shared x-slurm.resume.path and attempt-aware state. See x-slurm.resume.
right-sizing: Comparing requested versus observed usage to suggest reductions (inspect --rightsize) plus the efficiency grade from score. See Tracked Runtime.
runtime backend: The mechanism used to launch services: Pyxis/Enroot, Apptainer, Singularity, or host software, selected with runtime.backend. See runtime.
service: One container or host process in the allocation, defined under services.<name> (steps is an accepted alias). See Service fields.
smoke test: A finite end-to-end run (hpc-compose test) where every service must start, pass readiness, and complete successfully. See test.
sweep: An embedded sweep block expanded by hpc-compose sweep submit into many independent tracked allocations, one per trial. See sweep.
tracked job / tracked run: Metadata under .hpc-compose/<job-id>/ that lets status, ps, watch, logs, stats, and artifacts reconnect to a run later; "tracked job" and "tracked run" are the same thing. See Tracked Runtime.
x-runtime.prepare: The spec block for image-preparation commands and mounts; x-enroot.prepare is an accepted Pyxis/Enroot alias. See x-runtime.prepare.
x-slurm: The spec section for Slurm settings and hpc-compose runtime extensions, available at the top level and per service. See x-slurm.

CLI Reference
Spec Reference
Full Example Specs
Roadmap and Non-Goals

Full Example Specs

This appendix embeds the runnable repository example YAML files directly from examples/.

export CACHE_DIR=/cluster/shared/hpc-compose-cache
mkdir -p "$CACHE_DIR"
test -w "$CACHE_DIR"

App Redis Worker

Source: examples/app-redis-worker.yaml

name: redis-demo

x-slurm:
  job_name: redis-demo
  time: "00:15:00"
  mem: 8G
  cpus_per_task: 2
  cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}

services:
  redis:
    image: redis:7
    command: redis-server --save "" --appendonly no
    readiness:
      type: tcp
      host: 127.0.0.1
      port: 6379
      timeout_seconds: 30
    x-slurm:
      cpus_per_task: 1

  worker:
    image: redis:7
    depends_on:
      redis:
        condition: service_healthy
    command:
      - /bin/sh
      - -lc
      - |
        redis-cli -h 127.0.0.1 ping
        while true; do
          redis-cli -h 127.0.0.1 incr jobs
          sleep 2
        done
    x-slurm:
      cpus_per_task: 1

Canary Right Size

Source: examples/canary-right-size.yaml

name: canary-right-size

x-slurm:
  job_name: canary-right-size
  partition: gpu
  time: "04:00:00"
  mem: 64G
  gpus: 4
  cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}
  metrics:
    enabled: true
    interval_seconds: 10

services:
  trainer:
    image: python:3.12-slim
    command:
      - /bin/sh
      - -lc
      - |
        python - <<'PY'
        import time
        data = bytearray(512 * 1024 * 1024)
        print(f"allocated {len(data)} bytes")
        time.sleep(20)
        PY
    x-slurm:
      cpus_per_task: 8

Dev Python App

Source: examples/dev-python-app.yaml

name: dev-python-app

x-slurm:
  job_name: dev-python-app
  time: "00:30:00"
  mem: 8G
  cpus_per_task: 2

services:
  app:
    image: python:3.11-slim
    working_dir: /workspace
    volumes:
      - ./app:/workspace
    command:
      - python
      - -m
      - main
    x-runtime:
      prepare:
        commands:
          - pip install --no-cache-dir fastapi uvicorn openai

Dev Python Smoke

Source: examples/dev-python-smoke.yaml

name: dev-python-smoke

x-slurm:
  job_name: dev-python-smoke
  time: "00:01:00"
  mem: 2G
  cpus_per_task: 1

services:
  app:
    image: python:3.11-slim
    working_dir: /workspace
    volumes:
      - ./app:/workspace
    command:
      - python
      - -c
      - "import main; print('smoke ok', flush=True)"
    x-runtime:
      prepare:
        commands:
          - pip install --no-cache-dir fastapi uvicorn openai

Fairseq Preprocess

Source: examples/fairseq-preprocess.yaml

name: fairseq-preprocess

x-slurm:
  job_name: fairseq-preprocess
  time: "02:00:00"
  mem: 32G
  cpus_per_task: 8
  cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}

services:
  preprocess:
    image: python:3.11-slim
    volumes:
      - /shared/$USER/data/raw:/data/raw
      - /shared/$USER/data/processed:/data/processed
    environment:
      INPUT_DIR: /data/raw
      OUTPUT_DIR: /data/processed
      NUM_WORKERS: "8"
    command:
      - /bin/sh
      - -lc
      - |
        python -c "
        import os, json, hashlib, multiprocessing
        from pathlib import Path
        from concurrent.futures import ProcessPoolExecutor

        input_dir = Path(os.environ['INPUT_DIR'])
        output_dir = Path(os.environ['OUTPUT_DIR'])
        num_workers = int(os.environ['NUM_WORKERS'])
        output_dir.mkdir(parents=True, exist_ok=True)

        files = sorted(input_dir.glob('*.txt'))
        if not files:
            print(f'No .txt files found in {input_dir}')
            exit(1)
        print(f'Found {len(files)} input files')

        def process_file(path):
            text = path.read_text(encoding='utf-8', errors='replace')
            lines = [l.strip() for l in text.splitlines() if l.strip()]
            tokens = []
            for line in lines:
                tokens.extend(line.lower().split())
            out = output_dir / f'{path.stem}.jsonl'
            with open(out, 'w') as f:
                for i, line in enumerate(lines):
                    record = {
                        'id': f'{path.stem}_{i}',
                        'text': line,
                        'tokens': len(line.split()),
                    }
                    f.write(json.dumps(record) + '\n')
            return path.name, len(lines), len(tokens)

        with ProcessPoolExecutor(max_workers=num_workers) as pool:
            results = list(pool.map(process_file, files))

        total_lines = sum(r[1] for r in results)
        total_tokens = sum(r[2] for r in results)
        for name, lines, tokens in results:
            print(f'  {name}: {lines} lines, {tokens} tokens')
        print(f'Total: {total_lines} lines, {total_tokens} tokens across {len(files)} files')

        manifest = {
            'files': len(files),
            'total_lines': total_lines,
            'total_tokens': total_tokens,
        }
        (output_dir / 'manifest.json').write_text(json.dumps(manifest, indent=2))
        print('Preprocessing complete')
        "
    x-slurm:
      cpus_per_task: 8

HF Stage Model

Source: examples/hf-stage-model.yaml

name: hf-stage-model

# Stage a pinned HuggingFace model into the job, then serve it.
#
# The download runs INSIDE the Slurm allocation (the compute node has network),
# never on your laptop or over SSH. hpc-compose renders a guarded
# `huggingface-cli download ... --revision <sha> --local-dir <cas-path>` step
# into the batch script and reuses the content-addressed copy on repeat runs.
#
# The revision MUST be an immutable pin (a commit SHA or an explicit immutable
# tag); floating refs like `main` are rejected at validation time so the job is
# reproducible. Set HF_TOKEN in the JOB environment for gated repos — it is
# imported at runtime by huggingface-cli and never written into the script.
x-slurm:
  job_name: hf-stage-model
  time: "02:00:00"
  gpus_per_node: 1
  cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}
  stage_in:
    - to: /models/llama-3.1-8b
      hf:
        repo: meta-llama/Llama-3.1-8B
        revision: 0e9e39f249a16976918f6564b8830bc894c89659
        kind: model

services:
  server:
    image: vllm/vllm-openai:v0.6.3
    command:
      - /bin/sh
      - -lc
      - |
        python -m vllm.entrypoints.openai.api_server \
          --model /models/llama-3.1-8b \
          --host 0.0.0.0 \
          --port 8000
    readiness:
      type: sleep
      seconds: 5
    x-slurm:
      gpus_per_node: 1

Jupyter

Source: examples/jupyter.yaml

name: jupyter

x-slurm:
  job_name: jupyter
  time: "08:00:00"
  mem: 16G
  cpus_per_task: 4
  gpus: 1
  cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}

services:
  notebook:
    image: jupyter/scipy-notebook:latest
    working_dir: /workspace
    volumes:
      - ./project:/workspace
    command:
      - jupyter
      - lab
      - --no-browser
      - --ip=0.0.0.0
      - --port
      - "8888"
      - --ServerApp.token
      - ${JUPYTER_TOKEN:-change-me}
      - --ServerApp.allow_remote_access
      - "True"
    readiness:
      type: log
      pattern: '/lab\?token='

Llama App

Source: examples/llama-app.yaml

name: llama-stack

x-slurm:
  job_name: llama-stack
  time: "02:00:00"
  mem: 32G
  cpus_per_task: 8
  gpus: 1
  cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}

services:
  llama:
    image: ghcr.io/ggml-org/llama.cpp:server-cuda
    volumes:
      - ./models:/models
    command:
      - /bin/sh
      - -lc
      - exec /app/llama-server -m /models/model.gguf --host 0.0.0.0 --port 8080
    readiness:
      type: tcp
      host: 127.0.0.1
      port: 8080
      timeout_seconds: 60
    x-slurm:
      gpus: 1
      cpus_per_task: 4

  app:
    image: python:3.11-slim
    depends_on:
      llama:
        condition: service_healthy
    working_dir: /workspace
    volumes:
      - ./app:/workspace
    environment:
      LLM_BASE_URL: http://127.0.0.1:8080/v1
    command:
      - python
      - -m
      - main
    x-runtime:
      prepare:
        commands:
          - pip install --no-cache-dir openai fastapi uvicorn
    x-slurm:
      cpus_per_task: 2

Llama UV Worker

Source: examples/llama-uv-worker.yaml

name: llama-uv-worker

x-slurm:
  job_name: llama-uv-worker
  time: "01:00:00"
  mem: 32G
  cpus_per_task: 8
  gpus: 1
  cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}

services:
  llama:
    image: ghcr.io/ggml-org/llama.cpp:server-cuda
    environment:
      GGUF_MODEL_PATH: /models/model.gguf
    volumes:
      - ./models:/models
    command:
      - /bin/sh
      - -lc
      - |
        set -eu
        rm -f /hpc-compose/job/request.done
        /app/llama-server -m "$$GGUF_MODEL_PATH" --host 0.0.0.0 --port 8080 &
        server_pid=$$!
        while [ ! -f /hpc-compose/job/request.done ]; do
          if ! kill -0 "$$server_pid" 2>/dev/null; then
            wait "$$server_pid"
            exit $$?
          fi
          sleep 1
        done
        kill "$$server_pid" 2>/dev/null || true
        wait "$$server_pid" || true
    readiness:
      type: log
      pattern: "main: model loaded"
      timeout_seconds: 300
    x-slurm:
      gpus: 1
      cpus_per_task: 4

  worker:
    image: python:3.11-slim
    working_dir: /workspace
    volumes:
      - ./llama-uv-worker:/workspace
    depends_on:
      llama:
        condition: service_healthy
    environment:
      OPENAI_BASE_URL: http://127.0.0.1:8080/v1
      MODEL_NAME: local-model
      REQUEST_DONE_PATH: /hpc-compose/job/request.done
    command:
      - /bin/sh
      - -lc
      - |
        set -eu
        UV_CACHE_DIR=/hpc-compose/job/.uv-cache uv run worker.py
    x-runtime:
      prepare:
        commands:
          - pip install --no-cache-dir uv
    x-slurm:
      cpus_per_task: 2

LLM Curl Workflow

Source: examples/llm-curl-workflow.yaml

name: llm-curl-workflow

x-slurm:
  job_name: llm-curl-workflow
  time: "00:30:00"
  mem: 32G
  cpus_per_task: 8
  gpus: 1
  cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}

services:
  llm:
    image: ghcr.io/ggml-org/llama.cpp:server-cuda
    volumes:
      - ./models:/models
    command:
      - /bin/sh
      - -lc
      - |
        set -eu
        rm -f /hpc-compose/job/request.done
        /app/llama-server -m /models/model.gguf --host 0.0.0.0 --port 8080 &
        server_pid=$$!
        while [ ! -f /hpc-compose/job/request.done ]; do
          if ! kill -0 "$$server_pid" 2>/dev/null; then
            wait "$$server_pid"
            exit $$?
          fi
          sleep 1
        done
        kill "$$server_pid" 2>/dev/null || true
        wait "$$server_pid" || true
    readiness:
      type: log
      pattern: "main: model loaded"
      timeout_seconds: 300
    x-slurm:
      gpus: 1
      cpus_per_task: 4

  curl_client:
    image: debian:bookworm-slim
    depends_on:
      llm:
        condition: service_healthy
    environment:
      LLM_BASE_URL: http://127.0.0.1:8080
    command:
      - /bin/sh
      - -lc
      - |
        set -eu
        cat >/tmp/request.json <<'JSON'
        {
          "model": "local-model",
          "messages": [
            {
              "role": "system",
              "content": "You are a concise assistant."
            },
            {
              "role": "user",
              "content": "Explain what readiness checks do in one sentence."
            }
          ],
          "temperature": 0.2,
          "max_tokens": 64
        }
        JSON
        echo "Sending test request to $$LLM_BASE_URL/v1/chat/completions"
        curl --fail --show-error --silent \
          -H 'Content-Type: application/json' \
          --data @/tmp/request.json \
          "$$LLM_BASE_URL/v1/chat/completions"
        touch /hpc-compose/job/request.done
    x-runtime:
      prepare:
        commands:
          - apt-get update
          - apt-get install -y --no-install-recommends bash ca-certificates curl
          - rm -rf /var/lib/apt/lists/*
    x-slurm:
      cpus_per_task: 1

LLM Curl Workflow Workdir

Source: examples/llm-curl-workflow-workdir.yaml

name: llm-curl-workflow

x-slurm:
  job_name: llm-curl-workflow
  time: "00:30:00"
  mem: 32G
  cpus_per_task: 8
  gpus: 1
  # Uncomment if your cluster requires them.
  # partition: gpu
  # account: my-project
  # Set CACHE_DIR to a path visible from the submission host and compute nodes.
  cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}

services:
  llm:
    image: ghcr.io/ggml-org/llama.cpp:server-cuda
    environment:
      MODEL_FILE: model.gguf
    volumes:
      - $HOME/models:/models
    command:
      - /bin/sh
      - -lc
      - |
        set -eu
        rm -f /hpc-compose/job/request.done
        /app/llama-server -m /models/$$MODEL_FILE --host 0.0.0.0 --port 8080 &
        server_pid=$$!
        while [ ! -f /hpc-compose/job/request.done ]; do
          if ! kill -0 "$$server_pid" 2>/dev/null; then
            wait "$$server_pid"
            exit $$?
          fi
          sleep 1
        done
        kill "$$server_pid" 2>/dev/null || true
        wait "$$server_pid" || true
    readiness:
      type: log
      pattern: "main: model loaded"
      timeout_seconds: 300
    x-slurm:
      gpus: 1
      cpus_per_task: 4

  curl_client:
    image: debian:bookworm-slim
    depends_on:
      llm:
        condition: service_healthy
    environment:
      LLM_BASE_URL: http://127.0.0.1:8080
    command:
      - /bin/sh
      - -lc
      - |
        set -eu
        cat >/tmp/request.json <<'JSON'
        {
          "model": "local-model",
          "messages": [
            {
              "role": "system",
              "content": "You are a concise assistant."
            },
            {
              "role": "user",
              "content": "Explain what readiness checks do in one sentence."
            }
          ],
          "temperature": 0.2,
          "max_tokens": 64
        }
        JSON
        echo "Sending test request to $$LLM_BASE_URL/v1/chat/completions"
        curl --fail --show-error --silent \
          -H 'Content-Type: application/json' \
          --data @/tmp/request.json \
          "$$LLM_BASE_URL/v1/chat/completions"
        touch /hpc-compose/job/request.done
    x-runtime:
      prepare:
        commands:
          - apt-get update
          - apt-get install -y --no-install-recommends bash ca-certificates curl
          - rm -rf /var/lib/apt/lists/*
    x-slurm:
      cpus_per_task: 1

Minimal Batch

Source: examples/minimal-batch.yaml

name: minimal-batch

x-slurm:
  job_name: minimal-batch
  time: "00:10:00"
  mem: 4G
  cpus_per_task: 2

services:
  app:
    image: python:3.11-slim
    command: python -c "print('Hello from Slurm!')"

MPI Hello

Source: examples/mpi-hello.yaml

name: mpi-hello

x-slurm:
  job_name: mpi-hello
  time: "00:15:00"
  mem: 8G
  cpus_per_task: 4
  cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}

services:
  mpi:
    image: debian:bookworm-slim
    command:
      - /bin/sh
      - -lc
      - /usr/local/bin/mpi_hello
    x-runtime:
      prepare:
        commands:
          - apt-get update
          - apt-get install -y --no-install-recommends libopenmpi-dev openmpi-bin gcc
          - |
            cat > /tmp/hello.c << 'EOF'
            #include <mpi.h>
            #include <stdio.h>
            int main(int argc, char **argv) {
                MPI_Init(&argc, &argv);
                int rank, size;
                MPI_Comm_rank(MPI_COMM_WORLD, &rank);
                MPI_Comm_size(MPI_COMM_WORLD, &size);
                printf("Hello from rank %d of %d\n", rank, size);
                MPI_Finalize();
                return 0;
            }
            EOF
            mpicc /tmp/hello.c -o /usr/local/bin/mpi_hello
          - rm -rf /var/lib/apt/lists/* /tmp/hello.c
    x-slurm:
      ntasks: 4
      cpus_per_task: 4
      mpi:
        type: pmix
        profile: openmpi
        implementation: openmpi

MPI PMIx v4 Host MPI

Source: examples/mpi-pmix-v4-host-mpi.yaml

name: mpi-pmix-v4-host-mpi

runtime:
  backend: pyxis

x-slurm:
  job_name: mpi-pmix-v4-host-mpi
  time: "00:20:00"
  nodes: 2
  ntasks_per_node: 2
  cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}

services:
  mpi:
    image: debian:bookworm-slim
    command:
      - /bin/sh
      - -lc
      - |
        echo "mpi_type=$$HPC_COMPOSE_MPI_TYPE"
        echo "hostfile=$$HPC_COMPOSE_MPI_HOSTFILE"
        cat "$$HPC_COMPOSE_MPI_HOSTFILE"
        /opt/site/openmpi/bin/mpirun --version || true
    x-slurm:
      nodes: 2
      ntasks_per_node: 2
      mpi:
        type: pmix_v4
        profile: openmpi
        implementation: openmpi
        launcher: srun
        expected_ranks: 4
        host_mpi:
          bind_paths:
            - /opt/site/openmpi:/opt/site/openmpi:ro
          env:
            MPI_HOME: /opt/site/openmpi

Multi Node MPI

Source: examples/multi-node-mpi.yaml

name: multi-node-mpi

x-slurm:
  job_name: multi-node-mpi
  time: "00:20:00"
  nodes: 2
  ntasks_per_node: 2
  cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}

services:
  bootstrap:
    image: alpine:3.20
    command:
      - /bin/sh
      - -lc
      - |
        echo "primary=$(cat /hpc-compose/job/allocation/primary_node)"
        sleep 30
    readiness:
      type: sleep
      seconds: 1
    x-slurm:
      nodes: 1

  mpi:
    image: python:3.11-slim
    depends_on:
      bootstrap:
        condition: service_healthy
    command:
      - /bin/sh
      - -lc
      - |
        echo "primary=$(cat /hpc-compose/job/allocation/primary_node)"
        echo "nodes=$(tr '\n' ' ' < /hpc-compose/job/allocation/nodes.txt)"
        echo "mpi_hostfile=$$HPC_COMPOSE_MPI_HOSTFILE"
        cat "$$HPC_COMPOSE_MPI_HOSTFILE"
        python - <<'PY'
        import os
        print("mpi placeholder")
        print("node_count", os.environ["HPC_COMPOSE_NODE_COUNT"])
        print("mpi_type", os.environ["HPC_COMPOSE_MPI_TYPE"])
        PY
    readiness:
      type: sleep
      seconds: 2
    x-slurm:
      nodes: 2
      ntasks_per_node: 2
      mpi:
        type: pmix
        profile: openmpi
        implementation: openmpi
        launcher: srun
        expected_ranks: 4

Multi Node Partitioned

Source: examples/multi-node-partitioned.yaml

name: multi-node-partitioned

x-slurm:
  job_name: multi-node-partitioned
  time: "00:20:00"
  nodes: 8
  cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}

services:
  service-a:
    image: alpine:3.20
    command:
      - /bin/sh
      - -lc
      - |
        echo "service-a nodes=$$HPC_COMPOSE_SERVICE_NODELIST"
        sleep 30
    readiness:
      type: sleep
      seconds: 1
    x-slurm:
      placement:
        node_range: "0-3"

  service-b:
    image: alpine:3.20
    command:
      - /bin/sh
      - -lc
      - |
        echo "service-b nodes=$$HPC_COMPOSE_SERVICE_NODELIST"
        sleep 30
    readiness:
      type: sleep
      seconds: 1
    x-slurm:
      placement:
        node_range: "4-7"

  parameter-server:
    image: alpine:3.20
    depends_on:
      service-b:
        condition: service_healthy
    command:
      - /bin/sh
      - -lc
      - |
        echo "co-located with service-b on $$HPC_COMPOSE_SERVICE_NODELIST"
        sleep 30
    readiness:
      type: sleep
      seconds: 1
    x-slurm:
      placement:
        share_with: service-b

  monitor:
    image: alpine:3.20
    command:
      - /bin/sh
      - -lc
      - |
        echo "monitor nodes=$$HPC_COMPOSE_SERVICE_NODELIST"
        sleep 30
    x-slurm:
      placement:
        node_percent: 25
        allow_overlap: true

Multi Node Torchrun

Source: examples/multi-node-torchrun.yaml

name: multi-node-torchrun

x-slurm:
  job_name: multi-node-torchrun
  time: "04:00:00"
  nodes: 2
  gpus_per_node: 4
  cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}

services:
  trainer:
    image: pytorch/pytorch:2.12.1-cuda13.2-cudnn9-runtime
    command:
      - /bin/sh
      - -lc
      - |
        echo "master=$$HPC_COMPOSE_DIST_MASTER_ADDR"
        echo "nodes=$$HPC_COMPOSE_SERVICE_NODELIST"
        echo "node_rank=$$HPC_COMPOSE_DIST_NODE_RANK"
        torchrun \
          --nnodes="$$HPC_COMPOSE_DIST_NNODES" \
          --nproc-per-node="$$HPC_COMPOSE_DIST_NPROC_PER_NODE" \
          --node-rank="$$HPC_COMPOSE_DIST_NODE_RANK" \
          --rdzv-backend=c10d \
          --rdzv-endpoint="$$HPC_COMPOSE_DIST_RDZV_ENDPOINT" \
          train.py
    readiness:
      type: sleep
      seconds: 5
    x-slurm:
      nodes: 2
      ntasks_per_node: 1
      gpus_per_node: 4

Multi Node Deepspeed

Source: examples/multi-node-deepspeed.yaml

name: multi-node-deepspeed

x-slurm:
  job_name: multi-node-deepspeed
  time: "04:00:00"
  nodes: 2
  gpus_per_node: 4
  cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}

services:
  trainer:
    image: pytorch/pytorch:2.12.1-cuda13.2-cudnn9-runtime
    command:
      - /bin/sh
      - -lc
      - |
        echo "master=$$HPC_COMPOSE_DIST_MASTER_ADDR"
        echo "nodes=$$HPC_COMPOSE_SERVICE_NODELIST"
        echo "node_rank=$$HPC_COMPOSE_DIST_NODE_RANK"
        deepspeed \
          --no_ssh \
          --hostfile "$$HPC_COMPOSE_DIST_HOSTFILE" \
          --num_nodes "$$HPC_COMPOSE_DIST_NNODES" \
          --num_gpus "$$HPC_COMPOSE_DIST_NPROC_PER_NODE" \
          --node_rank "$$HPC_COMPOSE_DIST_NODE_RANK" \
          --master_addr "$$HPC_COMPOSE_DIST_MASTER_ADDR" \
          --master_port "$$HPC_COMPOSE_DIST_MASTER_PORT" \
          train.py
    readiness:
      type: sleep
      seconds: 5
    x-slurm:
      nodes: 2
      ntasks_per_node: 1
      gpus_per_node: 4

Multi Node Accelerate

Source: examples/multi-node-accelerate.yaml

name: multi-node-accelerate

x-slurm:
  job_name: multi-node-accelerate
  time: "04:00:00"
  nodes: 2
  gpus_per_node: 4
  cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}

services:
  trainer:
    image: pytorch/pytorch:2.12.1-cuda13.2-cudnn9-runtime
    command:
      - /bin/sh
      - -lc
      - |
        echo "master=$$HPC_COMPOSE_DIST_MASTER_ADDR"
        echo "nodes=$$HPC_COMPOSE_SERVICE_NODELIST"
        echo "machine_rank=$$HPC_COMPOSE_DIST_NODE_RANK"
        accelerate launch \
          --multi_gpu \
          --num_machines "$$HPC_COMPOSE_DIST_NNODES" \
          --num_processes "$$HPC_COMPOSE_DIST_WORLD_SIZE" \
          --machine_rank "$$HPC_COMPOSE_DIST_NODE_RANK" \
          --main_process_ip "$$HPC_COMPOSE_DIST_MASTER_ADDR" \
          --main_process_port "$$HPC_COMPOSE_DIST_MASTER_PORT" \
          train.py
    readiness:
      type: sleep
      seconds: 5
    x-slurm:
      nodes: 2
      ntasks_per_node: 1
      gpus_per_node: 4

Multi Node Horovod

Source: examples/multi-node-horovod.yaml

name: multi-node-horovod

x-slurm:
  job_name: multi-node-horovod
  time: "04:00:00"
  nodes: 2
  gpus_per_node: 4
  cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}

services:
  trainer:
    image: horovod/horovod:latest
    command:
      - /bin/sh
      - -lc
      - |
        echo "rank=$$SLURM_PROCID local_rank=$$SLURM_LOCALID world=$$SLURM_NTASKS"
        python train_horovod.py
    readiness:
      type: sleep
      seconds: 5
    x-slurm:
      nodes: 2
      ntasks_per_node: 4
      gpus_per_node: 4
      mpi:
        type: pmix
        profile: openmpi
        expected_ranks: 8

Multi Node Jax

Source: examples/multi-node-jax.yaml

name: multi-node-jax

x-slurm:
  job_name: multi-node-jax
  time: "04:00:00"
  nodes: 2
  gpus_per_node: 4
  cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}

services:
  trainer:
    image: jaxai/jax:latest
    command:
      - /bin/sh
      - -lc
      - |
        echo "coordinator=$$HPC_COMPOSE_DIST_RDZV_ENDPOINT"
        echo "process_id=$$HPC_COMPOSE_DIST_NODE_RANK processes=$$HPC_COMPOSE_DIST_NNODES"
        python train_jax.py
    readiness:
      type: sleep
      seconds: 5
    x-slurm:
      nodes: 2
      ntasks_per_node: 1
      gpus_per_node: 4

NCCL Tests

Source: examples/nccl-tests.yaml

name: nccl-tests

x-slurm:
  job_name: nccl-tests
  time: "00:30:00"
  nodes: 2
  gpus_per_node: 4
  cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}

services:
  all-reduce:
    image: nvcr.io/nvidia/pytorch:24.08-py3
    command:
      - /bin/sh
      - -lc
      - |
        echo "rank=$$SLURM_PROCID local_rank=$$SLURM_LOCALID world=$$SLURM_NTASKS"
        if command -v all_reduce_perf >/dev/null 2>&1; then
          all_reduce_perf -b 8 -e 4G -f 2 -g 1
        elif [ -x /workspace/nccl-tests/build/all_reduce_perf ]; then
          /workspace/nccl-tests/build/all_reduce_perf -b 8 -e 4G -f 2 -g 1
        else
          echo "all_reduce_perf not found; use an image with nccl-tests installed" >&2
          exit 127
        fi
    readiness:
      type: sleep
      seconds: 2
    x-slurm:
      nodes: 2
      ntasks_per_node: 4
      gpus_per_node: 4
      mpi:
        type: pmix
        profile: openmpi
        expected_ranks: 8

Ray Symmetric

Source: examples/ray-symmetric.yaml

name: ray-symmetric

x-slurm:
  job_name: ray-symmetric
  time: "02:00:00"
  nodes: 2
  cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}

services:
  ray:
    image: rayproject/ray:2.49.0-py310
    command:
      - /bin/sh
      - -lc
      - |
        ray symmetric-run \
          --address "$$HPC_COMPOSE_DIST_RDZV_ENDPOINT" \
          --min-nodes "$$HPC_COMPOSE_DIST_NNODES" \
          -- \
          python app.py
    readiness:
      type: sleep
      seconds: 10
    x-slurm:
      nodes: 2
      ntasks_per_node: 1

Rendezvous Client

Source: examples/rendezvous-client.yaml

name: rendezvous-client

x-slurm:
  job_name: model-client
  time: "00:10:00"
  mem: 2G
  cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}
  rendezvous: model-server

services:
  client:
    image: curlimages/curl:8.10.1
    command:
      - /bin/sh
      - -lc
      - |
        curl -fsS "$${HPC_COMPOSE_RDZV_MODEL_SERVER_URL}"

Rendezvous Model Server

Source: examples/rendezvous-model-server.yaml

name: rendezvous-model-server

x-slurm:
  job_name: model-server
  partition: gpu
  time: "02:00:00"
  mem: 32G
  gpus: 1
  cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}

services:
  model:
    image: python:3.12-slim
    command:
      - /bin/sh
      - -lc
      - |
        python -m http.server 8000
    readiness:
      type: tcp
      port: 8000
      timeout_seconds: 60
    x-slurm:
      rendezvous:
        register:
          name: model-server
          port: 8000
          protocol: http
          path: /
          ttl_seconds: 3600

Ray Head Workers

Source: examples/ray-head-workers.yaml

name: ray-head-workers

x-slurm:
  job_name: ray-head-workers
  time: "02:00:00"
  nodes: 2
  cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}

services:
  head:
    image: rayproject/ray:2.49.0-py310
    command:
      - /bin/sh
      - -lc
      - |
        ray start --head --node-ip-address="$$HPC_COMPOSE_SERVICE_PRIMARY_NODE" --port=6379 --block
    readiness:
      type: sleep
      seconds: 10
    x-slurm:
      nodes: 1

  worker:
    image: rayproject/ray:2.49.0-py310
    command:
      - /bin/sh
      - -lc
      - |
        ray start --address="$$HPC_COMPOSE_PRIMARY_NODE:6379" --block
    depends_on:
      head:
        condition: service_healthy
    x-slurm:
      nodes: 1
      placement:
        node_range: "1"

Dask Scheduler Workers

Source: examples/dask-scheduler-workers.yaml

name: dask-scheduler-workers

x-slurm:
  job_name: dask-scheduler-workers
  time: "02:00:00"
  nodes: 2
  cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}

services:
  scheduler:
    image: ghcr.io/dask/dask:latest
    command:
      - /bin/sh
      - -lc
      - |
        dask scheduler --host "$$HPC_COMPOSE_SERVICE_PRIMARY_NODE" --port 8786
    readiness:
      type: tcp
      host: 127.0.0.1
      port: 8786
      timeout_seconds: 60
    x-slurm:
      nodes: 1

  workers:
    image: ghcr.io/dask/dask:latest
    command:
      - /bin/sh
      - -lc
      - |
        dask worker "tcp://$$HPC_COMPOSE_PRIMARY_NODE:8786"
    depends_on:
      scheduler:
        condition: service_healthy
    x-slurm:
      nodes: 2
      ntasks_per_node: 1

Spark Standalone

Source: examples/spark-standalone.yaml

name: spark-standalone

x-slurm:
  job_name: spark-standalone
  time: "02:00:00"
  nodes: 2
  cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}

services:
  master:
    image: apache/spark:3.5.3
    command:
      - /bin/sh
      - -lc
      - |
        /opt/spark/sbin/start-master.sh --host "$$HPC_COMPOSE_SERVICE_PRIMARY_NODE" --port 7077
        tail -f /opt/spark/logs/*
    readiness:
      type: tcp
      host: 127.0.0.1
      port: 7077
      timeout_seconds: 60
    x-slurm:
      nodes: 1

  workers:
    image: apache/spark:3.5.3
    command:
      - /bin/sh
      - -lc
      - |
        /opt/spark/sbin/start-worker.sh "spark://$$HPC_COMPOSE_PRIMARY_NODE:7077"
        tail -f /opt/spark/logs/*
    depends_on:
      master:
        condition: service_healthy
    x-slurm:
      nodes: 2
      ntasks_per_node: 1

  app:
    image: apache/spark:3.5.3
    command:
      - /bin/sh
      - -lc
      - |
        spark-submit --master "spark://$$HPC_COMPOSE_PRIMARY_NODE:7077" app.py
    depends_on:
      master:
        condition: service_healthy
    x-slurm:
      nodes: 1

Flux Nested

Source: examples/flux-nested.yaml

name: flux-nested

runtime:
  backend: host

x-slurm:
  job_name: flux-nested
  time: "01:00:00"
  nodes: 2
  cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}

services:
  flux:
    command:
      - /bin/sh
      - -lc
      - |
        flux start bash -lc 'flux run --label-io -N "$$HPC_COMPOSE_DIST_NNODES" hostname'
    x-slurm:
      nodes: 2
      ntasks_per_node: 1

Nextflow Bridge

Source: examples/nextflow-bridge.yaml

name: nextflow-bridge

runtime:
  backend: host

x-slurm:
  job_name: nextflow-bridge
  time: "02:00:00"
  nodes: 1
  cpus_per_task: 8
  mem: 16G
  cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}
  artifacts:
    export_dir: ./results/${SLURM_JOB_ID}
    paths:
      - /hpc-compose/job/nextflow-work/**
      - /hpc-compose/job/reports/**
      - /hpc-compose/job/logs/**

services:
  nextflow:
    command:
      - /bin/sh
      - -lc
      - |
        # Write under HPC_COMPOSE_JOB_DIR (the portable per-job scratch path) so
        # this spec runs on both the container and host backends; artifacts below
        # are declared with the equivalent /hpc-compose/job/** convention.
        mkdir -p "$$HPC_COMPOSE_JOB_DIR/nextflow-work" "$$HPC_COMPOSE_JOB_DIR/reports"
        nextflow run "$${NEXTFLOW_PIPELINE:-main.nf}" \
          -work-dir "$$HPC_COMPOSE_JOB_DIR/nextflow-work" \
          -with-report "$$HPC_COMPOSE_JOB_DIR/reports/report.html" \
          -with-trace "$$HPC_COMPOSE_JOB_DIR/reports/trace.txt" \
          $${NEXTFLOW_ARGS:-}
    environment:
      NEXTFLOW_PIPELINE: main.nf
      NEXTFLOW_ARGS: ""
    x-slurm:
      ntasks: 1

Snakemake Bridge

Source: examples/snakemake-bridge.yaml

name: snakemake-bridge

runtime:
  backend: host

x-slurm:
  job_name: snakemake-bridge
  time: "02:00:00"
  nodes: 1
  cpus_per_task: 8
  mem: 16G
  cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}
  artifacts:
    export_dir: ./results/${SLURM_JOB_ID}
    paths:
      - /hpc-compose/job/snakemake-work/**
      - /hpc-compose/job/reports/**
      - /hpc-compose/job/logs/**

services:
  snakemake:
    command:
      - /bin/sh
      - -lc
      - |
        # Write under HPC_COMPOSE_JOB_DIR (the portable per-job scratch path) so
        # this spec runs on both the container and host backends; the artifacts
        # below are declared with the equivalent /hpc-compose/job/** convention.
        mkdir -p "$$HPC_COMPOSE_JOB_DIR/snakemake-work" "$$HPC_COMPOSE_JOB_DIR/reports"
        snakemake \
          --snakefile "$${SNAKEMAKE_FILE:-Snakefile}" \
          --cores "$${SNAKEMAKE_CORES:-$${SLURM_CPUS_PER_TASK:-1}}" \
          --directory "$${SNAKEMAKE_WORKDIR:-$$HPC_COMPOSE_JOB_DIR/snakemake-work}" \
          --printshellcmds \
          $${SNAKEMAKE_ARGS:-}
    environment:
      SNAKEMAKE_FILE: Snakefile
      SNAKEMAKE_ARGS: ""
    x-slurm:
      ntasks: 1

Multi Stage Pipeline

Source: examples/multi-stage-pipeline.yaml

name: multi-stage-pipeline

x-slurm:
  job_name: multi-stage-pipeline
  time: "00:30:00"
  mem: 8G
  cpus_per_task: 4
  cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}

services:
  producer:
    image: python:3.11-slim
    command:
      - /bin/sh
      - -lc
      - |
        python -c "
        import csv, random, os

        output = '/hpc-compose/job/output.csv'
        with open(output, 'w', newline='') as f:
            writer = csv.writer(f)
            writer.writerow(['id', 'value', 'category'])
            for i in range(1000):
                writer.writerow([i, round(random.gauss(50, 15), 2), random.choice(['A', 'B', 'C'])])

        print(f'Wrote 1000 rows to {output}')
        print('producer complete')
        "
    readiness:
      type: log
      pattern: "producer complete"
      timeout_seconds: 60
    x-slurm:
      cpus_per_task: 1

  consumer:
    image: python:3.11-slim
    depends_on:
      producer:
        condition: service_healthy
    command:
      - /bin/sh
      - -lc
      - |
        python -c "
        import csv, collections

        with open('/hpc-compose/job/output.csv') as f:
            reader = csv.DictReader(f)
            rows = list(reader)

        by_cat = collections.defaultdict(list)
        for row in rows:
            by_cat[row['category']].append(float(row['value']))

        print(f'Read {len(rows)} rows')
        for cat in sorted(by_cat):
            vals = by_cat[cat]
            print(f'  {cat}: count={len(vals)}, mean={sum(vals)/len(vals):.2f}')

        print('consumer complete')
        "
    x-slurm:
      cpus_per_task: 1

Pipeline DAG

Source: examples/pipeline-dag.yaml

name: pipeline-dag

x-slurm:
  job_name: pipeline-dag
  time: "00:20:00"
  mem: 4G
  cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}

services:
  preprocess:
    image: alpine:3.20
    command:
      - /bin/sh
      - -lc
      - |
        mkdir -p /hpc-compose/job/pipeline
        printf 'records=3\n' > /hpc-compose/job/pipeline/prepared.txt

  train:
    image: alpine:3.20
    depends_on:
      preprocess:
        condition: service_completed_successfully
    command:
      - /bin/sh
      - -lc
      - |
        cat /hpc-compose/job/pipeline/prepared.txt
        printf 'accuracy=0.91\n' > /hpc-compose/job/pipeline/model.txt

  postprocess:
    image: alpine:3.20
    depends_on:
      train:
        condition: service_completed_successfully
    command:
      - /bin/sh
      - -lc
      - |
        cat /hpc-compose/job/pipeline/model.txt
        printf 'done\n' > /hpc-compose/job/pipeline/report.txt

Postgres ETL

Source: examples/postgres-etl.yaml

name: postgres-etl

x-slurm:
  job_name: postgres-etl
  time: "01:00:00"
  mem: 16G
  cpus_per_task: 4
  cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}

services:
  postgres:
    image: postgres:16
    environment:
      POSTGRES_USER: etl
      POSTGRES_PASSWORD: etl
      POSTGRES_DB: pipeline
    readiness:
      type: tcp
      host: 127.0.0.1
      port: 5432
      timeout_seconds: 30
    x-slurm:
      cpus_per_task: 2

  etl:
    image: python:3.11-slim
    depends_on:
      postgres:
        condition: service_healthy
    environment:
      DATABASE_URL: postgresql://etl:etl@127.0.0.1:5432/pipeline
    command:
      - /bin/sh
      - -lc
      - |
        python -c "
        import psycopg2, os

        conn = psycopg2.connect(os.environ['DATABASE_URL'])
        cur = conn.cursor()
        cur.execute('CREATE TABLE IF NOT EXISTS results (id SERIAL, value FLOAT)')
        for i in range(100):
            cur.execute('INSERT INTO results (value) VALUES (%s)', (i * 1.5,))
        conn.commit()
        cur.execute('SELECT count(*), avg(value) FROM results')
        count, avg = cur.fetchone()
        print(f'Inserted {count} rows, average value: {avg:.2f}')
        conn.close()
        "
    x-runtime:
      prepare:
        commands:
          - pip install --no-cache-dir psycopg2-binary
    x-slurm:
      cpus_per_task: 2

Restart Policy

Source: examples/restart-policy.yaml

name: restart-policy

x-slurm:
  job_name: restart-policy
  time: "00:10:00"
  mem: 4G
  cpus_per_task: 1
  cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}

services:
  flaky-worker:
    image: python:3.11-slim
    command:
      - /bin/sh
      - -lc
      - |
        python - <<'PY'
        import pathlib
        import sys
        import time

        state_dir = pathlib.Path("/hpc-compose/job/restart-policy")
        counter_path = state_dir / "attempts.txt"

        state_dir.mkdir(parents=True, exist_ok=True)
        attempts = int(counter_path.read_text()) if counter_path.exists() else 0
        attempts += 1
        counter_path.write_text(f"{attempts}\n")

        print(f"attempt {attempts}")
        if attempts <= 2:
            print("simulating transient failure")
            sys.exit(42)

        print("work completed after transient failures")
        time.sleep(1)
        PY
    x-slurm:
      failure_policy:
        mode: restart_on_failure
        max_restarts: 5
        backoff_seconds: 2
        window_seconds: 60
        max_restarts_in_window: 3

Training Checkpoints

Source: examples/training-checkpoints.yaml

name: training-checkpoints

x-slurm:
  job_name: training-checkpoints
  time: "04:00:00"
  mem: 64G
  cpus_per_task: 8
  gpus: 1
  cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}

services:
  trainer:
    image: pytorch/pytorch:2.12.1-cuda13.2-cudnn9-runtime
    volumes:
      - /shared/$USER/checkpoints:/checkpoints
    environment:
      CHECKPOINT_DIR: /checkpoints
      NUM_EPOCHS: "10"
    command:
      - /bin/sh
      - -lc
      - |
        python -c "
        import os, torch

        device = 'cuda' if torch.cuda.is_available() else 'cpu'
        print(f'Training on {device}')

        ckpt_dir = os.environ['CHECKPOINT_DIR']
        os.makedirs(ckpt_dir, exist_ok=True)

        model = torch.nn.Linear(128, 10).to(device)
        optimizer = torch.optim.Adam(model.parameters())
        data = torch.randn(256, 128, device=device)

        for epoch in range(int(os.environ['NUM_EPOCHS'])):
            out = model(data)
            loss = out.sum()
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            path = os.path.join(ckpt_dir, f'checkpoint_epoch_{epoch}.pt')
            torch.save({'epoch': epoch, 'model': model.state_dict()}, path)
            print(f'Epoch {epoch}: loss={loss.item():.4f}, saved {path}')

        print('Training complete')
        "
    x-slurm:
      gpus: 1
      cpus_per_task: 4

Training Resume

Source: examples/training-resume.yaml

name: training-resume

x-slurm:
  job_name: training-resume
  time: "04:00:00"
  mem: 64G
  cpus_per_task: 8
  gpus: 1
  cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}
  resume:
    path: /shared/$USER/runs/training-resume
  artifacts:
    export_dir: ./results/${SLURM_JOB_ID}
    paths:
      - /hpc-compose/job/checkpoints/**

services:
  trainer:
    image: pytorch/pytorch:2.12.1-cuda13.2-cudnn9-runtime
    environment:
      NUM_EPOCHS: "10"
    command:
      - /bin/sh
      - -lc
      - |
        python - <<'PY'
        import json
        import os
        import pathlib
        import time

        resume_dir = pathlib.Path(os.environ["HPC_COMPOSE_RESUME_DIR"])
        attempt = os.environ["HPC_COMPOSE_ATTEMPT"]
        is_resume = os.environ["HPC_COMPOSE_IS_RESUME"] == "1"
        checkpoint_dir = pathlib.Path("/hpc-compose/job/checkpoints")
        latest_state_path = resume_dir / "latest.json"

        resume_dir.mkdir(parents=True, exist_ok=True)
        checkpoint_dir.mkdir(parents=True, exist_ok=True)

        start_epoch = 0
        if latest_state_path.exists():
            state = json.loads(latest_state_path.read_text())
            start_epoch = state["next_epoch"]
            print(f"Resuming run at epoch {start_epoch} (attempt {attempt})")
        else:
            print(f"Starting fresh run (attempt {attempt})")

        for epoch in range(start_epoch, int(os.environ["NUM_EPOCHS"])):
            state = {
                "completed_epoch": epoch,
                "next_epoch": epoch + 1,
                "attempt": int(attempt),
                "is_resume": is_resume,
            }
            latest_state_path.write_text(json.dumps(state, indent=2) + "\n")
            artifact_path = checkpoint_dir / f"checkpoint_epoch_{epoch}.json"
            artifact_path.write_text(json.dumps(state, indent=2) + "\n")
            print(f"Epoch {epoch}: wrote {artifact_path}")
            time.sleep(1)
        PY

Training Sweep

Source: examples/training-sweep.yaml

name: training-sweep

x-slurm:
  job_name: training-sweep
  time: "00:20:00"
  mem: 8G
  cpus_per_task: 2
  cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}

sweep:
  parameters:
    lr: [0.001, 0.01, 0.1]
    batch_size: [32, 64]
  matrix: full

services:
  trainer:
    image: python:3.11-slim
    environment:
      LR: "${lr:-0.001}"
      BATCH_SIZE: "${batch_size:-32}"
      SWEEP_ID: "${HPC_COMPOSE_SWEEP_ID:-manual}"
      TRIAL_ID: "${HPC_COMPOSE_SWEEP_TRIAL:-manual}"
    command:
      - python
      - -c
      - |
        import os
        import random

        lr = float(os.environ["LR"])
        batch_size = int(os.environ["BATCH_SIZE"])
        random.seed(f"{lr}:{batch_size}")
        score = 0.8 + random.random() * 0.05

        print(f"sweep={os.environ['SWEEP_ID']} trial={os.environ['TRIAL_ID']}")
        print(f"lr={lr} batch_size={batch_size} score={score:.4f}")

Training Tensorboard

Source: examples/training-tensorboard.yaml

# GPU training with a live TensorBoard sidecar.
#
# The trainer writes TensorBoard event files to the in-job shared directory
# /hpc-compose/job/logs; the tensorboard sidecar serves them on port 6006 and
# is gated by an HTTP readiness probe. Reach it from your laptop with an SSH
# tunnel, e.g. `ssh -L 6006:<compute-node>:6006 <login-host>`, then open
# http://127.0.0.1:6006. The event files are exported as tracked artifacts.
name: training-tensorboard

x-slurm:
  job_name: training-tensorboard
  time: "01:00:00"
  mem: 32G
  cpus_per_task: 8
  gpus: 1
  cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}
  artifacts:
    export_dir: ./results/${SLURM_JOB_ID}
    paths:
      - /hpc-compose/job/logs/**

services:
  trainer:
    image: pytorch/pytorch:2.12.1-cuda13.2-cudnn9-runtime
    command:
      - /bin/sh
      - -lc
      - |
        set -eu
        mkdir -p /hpc-compose/job/logs
        python - <<'PY'
        import time
        from torch.utils.tensorboard import SummaryWriter

        writer = SummaryWriter("/hpc-compose/job/logs")
        for step in range(100):
            writer.add_scalar("loss", 1.0 / (step + 1), step)
            writer.flush()
            time.sleep(1)
        writer.close()
        PY
        touch /hpc-compose/job/request.done
    x-runtime:
      prepare:
        commands:
          - pip install --no-cache-dir tensorboard
    x-slurm:
      gpus: 1
      cpus_per_task: 4

  tensorboard:
    image: python:3.11-slim
    command:
      - /bin/sh
      - -lc
      - |
        set -eu
        mkdir -p /hpc-compose/job/logs
        tensorboard --logdir /hpc-compose/job/logs --host 0.0.0.0 --port 6006 &
        tb_pid=$$!
        while [ ! -f /hpc-compose/job/request.done ]; do
          if ! kill -0 "$$tb_pid" 2>/dev/null; then
            wait "$$tb_pid"
            exit $$?
          fi
          sleep 5
        done
        kill "$$tb_pid" 2>/dev/null || true
        wait "$$tb_pid" || true
    readiness:
      type: http
      url: http://127.0.0.1:6006
      timeout_seconds: 300
    x-runtime:
      prepare:
        commands:
          - pip install --no-cache-dir tensorboard
    x-slurm:
      cpus_per_task: 2

vLLM OpenAI

Source: examples/vllm-openai.yaml

name: vllm-openai

x-slurm:
  job_name: vllm-openai
  time: "01:00:00"
  mem: 64G
  cpus_per_task: 8
  gpus: 1
  cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}

services:
  vllm:
    image: vllm/vllm-openai:latest
    environment:
      MODEL_NAME: facebook/opt-125m
    command:
      - /bin/sh
      - -lc
      - |
        set -eu
        rm -f /hpc-compose/job/request.done
        python -m vllm.entrypoints.openai.api_server \
          --model $$MODEL_NAME \
          --host 0.0.0.0 \
          --port 8000 &
        server_pid=$$!
        while [ ! -f /hpc-compose/job/request.done ]; do
          if ! kill -0 "$$server_pid" 2>/dev/null; then
            wait "$$server_pid"
            exit $$?
          fi
          sleep 1
        done
        kill "$$server_pid" 2>/dev/null || true
        wait "$$server_pid" || true
    readiness:
      type: log
      pattern: "Uvicorn running on"
      timeout_seconds: 300
    x-slurm:
      gpus: 1
      cpus_per_task: 4

  client:
    image: python:3.11-slim
    depends_on:
      vllm:
        condition: service_healthy
    environment:
      OPENAI_BASE_URL: http://127.0.0.1:8000/v1
      MODEL_NAME: facebook/opt-125m
    command:
      - /bin/sh
      - -lc
      - |
        set -eu
        python -c "
        import openai, os

        client = openai.OpenAI(
            base_url=os.environ['OPENAI_BASE_URL'],
            api_key='unused',
        )
        response = client.chat.completions.create(
            model=os.environ['MODEL_NAME'],
            messages=[
                {'role': 'system', 'content': 'You are a concise assistant.'},
                {'role': 'user', 'content': 'What is HPC in one sentence?'},
            ],
            max_tokens=64,
            temperature=0.2,
        )
        print(response.choices[0].message.content)
        "
        touch /hpc-compose/job/request.done
    x-runtime:
      prepare:
        commands:
          - pip install --no-cache-dir openai
    x-slurm:
      cpus_per_task: 2

vLLM UV Worker

Source: examples/vllm-uv-worker.yaml

name: vllm-uv-worker

x-slurm:
  job_name: vllm-uv-worker
  time: "01:00:00"
  mem: 64G
  cpus_per_task: 8
  gpus: 1
  cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}

services:
  vllm:
    image: vllm/vllm-openai:latest
    environment:
      MODEL_NAME: facebook/opt-125m
    command:
      - /bin/sh
      - -lc
      - |
        set -eu
        rm -f /hpc-compose/job/request.done
        python -m vllm.entrypoints.openai.api_server \
          --model "$$MODEL_NAME" \
          --host 0.0.0.0 \
          --port 8000 &
        server_pid=$$!
        while [ ! -f /hpc-compose/job/request.done ]; do
          if ! kill -0 "$$server_pid" 2>/dev/null; then
            wait "$$server_pid"
            exit $$?
          fi
          sleep 1
        done
        kill "$$server_pid" 2>/dev/null || true
        wait "$$server_pid" || true
    readiness:
      type: log
      pattern: "Uvicorn running on"
      timeout_seconds: 300
    x-slurm:
      gpus: 1
      cpus_per_task: 4

  worker:
    image: python:3.11-slim
    working_dir: /workspace
    volumes:
      - ./vllm-uv-worker:/workspace
    depends_on:
      vllm:
        condition: service_healthy
    environment:
      OPENAI_BASE_URL: http://127.0.0.1:8000/v1
      MODEL_NAME: facebook/opt-125m
      REQUEST_DONE_PATH: /hpc-compose/job/request.done
    command:
      - /bin/sh
      - -lc
      - |
        set -eu
        UV_CACHE_DIR=/hpc-compose/job/.uv-cache uv run worker.py
    x-runtime:
      prepare:
        commands:
          - pip install --no-cache-dir uv
    x-slurm:
      cpus_per_task: 2

Eval Harness

Source: examples/eval-harness.yaml

name: eval-harness

x-slurm:
  job_name: eval-harness
  time: "01:00:00"
  mem: 64G
  cpus_per_task: 8
  gpus: 1
  cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}
  artifacts:
    export_dir: ./results/${SLURM_JOB_ID}
    paths:
      - /hpc-compose/job/results/**

# Sweep stub: benchmark a served model across models/tasks. The base spec stays
# runnable without sweep variables because the service env carries
# ${...:-default} fallbacks; `hpc-compose sweep submit` overrides model/tasks.
sweep:
  parameters:
    model: [facebook/opt-125m]
    tasks: [hellaswag]
  matrix: full

services:
  vllm:
    image: vllm/vllm-openai:latest
    environment:
      MODEL_NAME: "${model:-facebook/opt-125m}"
    command:
      - /bin/sh
      - -lc
      - |
        set -eu
        rm -f /hpc-compose/job/request.done
        python -m vllm.entrypoints.openai.api_server \
          --model "$$MODEL_NAME" \
          --host 0.0.0.0 \
          --port 8000 &
        server_pid=$$!
        while [ ! -f /hpc-compose/job/request.done ]; do
          if ! kill -0 "$$server_pid" 2>/dev/null; then
            wait "$$server_pid"
            exit $$?
          fi
          sleep 5
        done
        kill "$$server_pid" 2>/dev/null || true
        wait "$$server_pid" || true
    readiness:
      type: http
      url: http://127.0.0.1:8000/health
      timeout_seconds: 600
    x-slurm:
      gpus: 1
      cpus_per_task: 4

  client:
    image: python:3.11-slim
    depends_on:
      vllm:
        condition: service_healthy
    environment:
      OPENAI_BASE_URL: http://127.0.0.1:8000/v1
      MODEL_NAME: "${model:-facebook/opt-125m}"
      TASKS: "${tasks:-hellaswag}"
    command:
      - /bin/sh
      - -lc
      - |
        set -eu
        mkdir -p /hpc-compose/job/results
        lm_eval \
          --model local-completions \
          --model_args "base_url=$$OPENAI_BASE_URL/completions,model=$$MODEL_NAME,num_concurrent=4" \
          --tasks "$$TASKS" \
          --output_path /hpc-compose/job/results
        touch /hpc-compose/job/request.done
    x-runtime:
      prepare:
        commands:
          - pip install --no-cache-dir lm-eval
    x-slurm:
      cpus_per_task: 2

Cuda Probe

Source: examples/cuda-probe.yaml

name: cuda-probe

# Fast compute-node CUDA / GPU probe.
#
# No repo install, no uv, no model files: a tiny NVIDIA CUDA base image is
# imported and run as a one-shot Slurm job. (nvidia-smi is provided by the host
# driver and injected into the container at runtime by the enroot/pyxis NVIDIA
# hook — the CUDA base image itself does not bundle it.) It isolates
# "can the cluster give me a GPU?" from any framework/Python environment, so if
# this passes but a later PyTorch/JAX/TensorFlow job fails, the problem is in the
# framework image, not in Slurm/Pyxis/GPU allocation/driver visibility.
#
# Notes:
# - The base image's CUDA version is independent of the driver's reported CUDA;
#   nvidia-smi reports the *driver's* max CUDA, not the image toolkit version.
# - GPUs are only injected at Slurm/Pyxis runtime. Prepare-time image import runs
#   CPU-only on the login node (hpc-compose disables the enroot NVIDIA hook during
#   prepare), so importing this CUDA image does not need a driver.

runtime:
  backend: pyxis

x-slurm:
  job_name: cuda-probe
  time: "00:10:00"
  cpus_per_task: 2
  mem: 8G
  # Request one GPU. Some sites (e.g. HAICORE) prefer an explicit gres such as
  # `gres: gpu:1`; set partition/account via settings, --profile, or flags.
  gpus: 1
  cache_dir: ${CACHE_DIR:-/cluster/shared/hpc-compose-cache}

services:
  probe:
    image: nvidia/cuda:12.4.1-base-ubuntu22.04
    script: |
      set -eu
      echo "hostname=$(hostname)"
      echo "CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES:-}"
      echo "SLURM_JOB_ID=${SLURM_JOB_ID:-}"
      echo "SLURM_JOB_NODELIST=${SLURM_JOB_NODELIST:-}"
      nvidia-smi
      nvidia-smi -L
      ls -l /dev/nvidia* 2>/dev/null || true

Roadmap and Non-Goals

This roadmap is intentionally short. hpc-compose is not trying to become a general-purpose orchestrator.

Authoring Ergonomics

make the supported Compose subset easier to discover from examples and docs
keep validate, inspect, config, and render as the fast path for authoring confidence
keep refining starter templates and example selection (now surfaced through examples recommend, search, and coverage) before adding more surface area

Runtime Visibility

make tracked jobs easier to reconnect to and reason about
keep improving status, ps, watch, stats, and artifact export for real cluster debugging
prefer inspectable generated state over hidden orchestration behavior

Cluster Compatibility

expand confidence on more Linux cluster environments before broadening scope
keep support policy explicit through the support matrix
improve docs and examples around shared storage, Pyxis, and Enroot expectations

If your workflow falls outside this roadmap, that is useful feedback. Open an adoption feedback issue with your cluster type, workload type, and main friction point.

Full Example Specs
Glossary
Support Matrix
Architecture for Contributors

Onboard a Cluster Site

Cluster profiles let validate and preflight compare a spec against site-specific Slurm, runtime, MPI, storage, and policy hints.

For HAICORE-specific resource, workspace, and container notes, see HAICORE Guide.

Generate a best-effort profile on the target login node:

hpc-compose doctor cluster-report

This writes .hpc-compose/cluster.toml by default. Use --out - to print TOML instead.

For a live advisory snapshot of current conditions, use:

hpc-compose weather

weather reads stable labels and hints from the discovered cluster profile when present, but live node, queue, fairshare, and priority data come from one-shot Slurm probes and are not persisted in .hpc-compose/cluster.toml.

What Gets Discovered

The profile generator uses available local tools and environment hints:

sinfo, scontrol, and srun --mpi=list
selected runtime binaries
shared-path environment hints
loaded MPI stack hints from PATH, MPI_HOME, MPI_DIR, I_MPI_ROOT, EBROOTOPENMPI, and EBROOTMPICH
editable distributed defaults such as rendezvous port and [distributed.env]

It does not run module avail. Module-only MPI installations can be added manually to the generated mpi_installations list.

Site Policy Packs

Support teams can edit optional sections such as:

[site]
[[software.modules]]
[[filesystems]]
[gpu]
[network]
[containers]
[slurm.defaults]
[slurm.required]

Policy sections warn and suggest snippets. They do not silently add modules, bind mounts, environment variables, or SBATCH directives to user specs.

hpc-compose stages your repo but does not allocate cluster workspaces or create site storage directories (see Repo staging vs cluster workspace provisioning); a future site pack could carry the site’s workspace/provisioning command so onboarding docs can point at it.

MPI Smoke Probe

For MPI services, render a small rank-count probe against the service’s real runtime path:

hpc-compose doctor mpi-smoke -f compose.yaml --service trainer --script-out mpi-smoke.sbatch

Submit it only when you intentionally want to consume a Slurm allocation:

hpc-compose doctor mpi-smoke -f compose.yaml --service trainer --submit

The smoke plan keeps allocation and MPI launch settings but strips application workflow blocks such as setup, scratch staging, resume metadata, artifacts, and burst-buffer directives.

Fabric Smoke Probe

For distributed GPU or fabric-sensitive services, render a broader smoke probe:

hpc-compose doctor fabric-smoke -f compose.yaml --service trainer --checks auto --script-out fabric-smoke.sbatch

--checks auto always includes the MPI rank probe, adds NCCL when the selected service requests GPU resources, and collects UCX, OFI, and InfiniBand diagnostics when the corresponding tools are available. Pass an explicit list such as --checks mpi,nccl when a missing tool should fail the probe instead of being reported as skipped; the accepted tokens are mpi, nccl, ucx, and ofi (InfiniBand link health is collected as part of auto diagnostics, not as a separate token).

HAICORE@KIT Guide
Operate a Real Cluster Run
Troubleshoot a Failed Run
Runtime Backends
Spec Reference

HAICORE@KIT Guide

This page collects hpc-compose configuration notes for HAICORE@KIT. It is a practical starting point, not a replacement for the official NHR@KIT HAICORE documentation.

Before long or expensive runs, re-check current HAICORE policy pages for partitions, quotas, GPU limits, container requirements, and filesystem lifetime rules.

Where Commands Run

HAICORE is accessed through the login host documented by NHR@KIT:

ssh <username>@haicore.scc.kit.edu

Use the login node for editing, Git operations, hpc-compose plan, hpc-compose preflight, image preparation, and Slurm job management. Run compute work through Slurm with hpc-compose up, sbatch, or site-approved interactive Slurm commands.

Do not treat the login node as a place for long Python training, GPU work, data conversion, or large preprocessing jobs. Those belong inside a Slurm allocation.

HAICORE Slurm Settings To Know

The current HAICORE batch-system documentation describes Slurm partitions named normal and advanced. The normal partition is the general starting point; advanced requires special permission and allows larger jobs.

Common settings you will map into hpc-compose:

HAICORE / Slurm setting	`hpc-compose` field	Notes
Partition	`x-slurm.partition`	Usually start with the site-documented general partition.
Account/project	`x-slurm.account`	Use the account string assigned by the site or project.
Wall time	`x-slurm.time`	Keep smoke tests short; request only what the run needs.
Nodes	`x-slurm.nodes`	`normal` is documented for single-node jobs; confirm before multi-node runs.
Tasks	`x-slurm.ntasks`, service `x-slurm.ntasks`	Process/rank count.
CPUs per task	`x-slurm.cpus_per_task`, service `x-slurm.cpus_per_task`	CPU threads per process/rank.
Memory	`x-slurm.mem`	Scheduler/runtime memory request, not storage.
Full GPUs	`x-slurm.gres` or service `x-slurm.gres`	Request a full A100 with `gpu:N` (e.g. `gpu:1`); the `normal` partition has no `gpu:full` GRES type.
MIG GPUs	`x-slurm.gres` or service `x-slurm.gres`	HAICORE documents MIG profiles such as `gpu:1g.5gb:1`; confirm current names.
Constraints	`x-slurm.constraint` or `x-slurm.submit_args`	HAICORE documents constraints such as `LSDF` and `BEEOND`.

Example single-node GPU starting point:

name: haicore-smoke

x-slurm:
  job_name: haicore-smoke
  partition: normal
  account: <account>
  time: "00:10:00"
  nodes: 1
  cpus_per_task: 4
  mem: 16G
  gres: gpu:1
  cache_dir: <workspace-path>/hpc-compose-cache

services:
  app:
    image: python:3.11-slim
    command: python -c "import os, socket; print(socket.gethostname()); print(os.environ.get('SLURM_JOB_ID'))"

Preview before submitting:

hpc-compose plan -f compose.yaml
hpc-compose plan --show-script -f compose.yaml
hpc-compose preflight -f compose.yaml

Workspaces And Storage

HAICORE documents several storage types. For hpc-compose, the most important distinction is shared persistent-enough storage versus job-local temporary storage.

Storage	Use with `hpc-compose`	Avoid using it for
`$HOME`	Small configuration, source code, shell setup, credentials handled under site policy.	Large image caches, datasets, checkpoints, or logs from many jobs.
Workspace	`x-slurm.cache_dir`, Enroot data/cache, datasets, model files, run logs, artifacts, checkpoints.	Data that must be backed up elsewhere; workspaces are documented as not backed up and time-limited.
`$TMPDIR`	Fast node-local temporary files created and consumed within one job.	`x-slurm.cache_dir` or anything needed by login-node prepare and later compute-node runtime.
BeeOND	Job-local shared scratch across nodes when explicitly requested.	Long-term cache, persistent checkpoints, or files needed after the job unless copied out.

Create and locate a workspace with HAICORE’s workspace tools:

ws_allocate <workspace-name> <duration>
ws_find <workspace-name>
ws_list
ws_extend <workspace-name> <duration>

Use the path from ws_find for the cache:

export CACHE_DIR=<workspace-path>/hpc-compose-cache
mkdir -p "$CACHE_DIR"
test -w "$CACHE_DIR"

Then set it in your spec:

x-slurm:
  cache_dir: ${CACHE_DIR}

The official HAICORE filesystem page documents workspace lifetime, extension limits, quotas, and backup policy. Treat workspace expiration as operational risk: long-running projects should have a habit of checking ws_list and copying durable results to the correct long-term location.

hpc-compose (including hpc-compose up --remote from a laptop) stages your repo and reads these paths, but it does not run ws_allocate or create the cache and storage directories for you. Allocate the workspace and mkdir -p your cache_dir, dataset, and checkpoint paths first; a missing host bind-mount or storage directory blocks preflight. See Repo staging vs cluster workspace provisioning.

Containers On HAICORE

The official HAICORE container documentation says native Docker and rootless Docker are not supported on the HPC systems. The relevant paths are site-supported HPC runtimes, including Pyxis/Enroot and Apptainer.

For the default hpc-compose backend:

runtime:
  backend: pyxis

Validate Pyxis support on the login node:

srun --help | grep container-image
hpc-compose preflight -f compose.yaml

HAICORE documents Pyxis as the Slurm integration for Enroot and lists container options such as --container-image, --container-name, --container-mounts, --container-mount-home, --container-writable, and --container-remap-root.

The HAICORE docs also list site-required Pyxis mounts for Slurm integration. Because mount paths are site policy and can change, inspect the current HAICORE container page before copying them into a spec. When needed, pass site-specific Pyxis flags through service-level extra_srun_args:

services:
  app:
    image: python:3.11-slim
    command: python -c "print('hello from HAICORE')"
    x-slurm:
      extra_srun_args:
        - "--container-mounts=<site-required-mounts>"

If the cluster recommends Apptainer for your workflow or Pyxis is not available in srun, choose the corresponding backend:

runtime:
  backend: apptainer

See Runtime Backends for the backend behavior and required tools.

Enroot Cache Placement

HAICORE documents Enroot as available by default, with default data paths under the user’s home directory. For repeated container jobs, large images, or quota-sensitive projects, place runtime cache/data under a workspace-backed x-slurm.cache_dir.

hpc-compose sets per-job Enroot runtime paths below the configured cache directory. That keeps image runtime state close to the job and avoids filling $HOME accidentally.

The first time an image is imported (on a fresh cache, or after eviction) enroot downloads a multi-GB image and then extracts it and builds a squashfs, which can take several minutes. Later jobs reuse the cached .sqsh, so subsequent runs are fast.

On HAICORE the recommended opt-in is to point enroot’s extraction scratch at node-local storage so mksquashfs does not hit Stale file handle on shared home/work storage. Set x-slurm.enroot_temp_dir in the spec (or cache.enroot_temp_dir in .hpc-compose/settings.toml) to a node-local path such as /tmp/${USER}-hpc-compose-enroot; the final .sqsh and the layer cache still live on the workspace-backed cache. Prefer the spec or settings field over the HPC_COMPOSE_ENROOT_TEMP_DIR environment variable for up --remote, because a laptop env var does not propagate over SSH.

x-slurm:
  cache_dir: ${CACHE_DIR}
  enroot_temp_dir: /tmp/${USER}-hpc-compose-enroot

BeeOND And Job-Local Scratch

HAICORE documents BeeOND as a job-local filesystem requested through a Slurm constraint:

x-slurm:
  constraint: BEEOND

Use BeeOND for temporary high-throughput working data inside a job, then copy durable results back to a workspace or other approved persistent location. Do not put x-slurm.cache_dir on BeeOND because the cache must exist before the job and be reusable by later jobs.

Software Modules

HAICORE software is exposed through Lmod environment modules. For host-runtime or MPI workflows, keep module setup explicit in x-slurm.setup:

x-slurm:
  setup:
    - module purge
    - module avail
    - module load <module-name>

Do not leave module avail in production scripts if it produces too much output; it is useful while discovering the environment. Use module list in smoke tests when you need the batch log to record the active software stack.

Suggested First HAICORE Checklist

Run these on the HAICORE login node before the first real job:

ws_find <workspace-name>
scontrol show partition normal
srun --help | grep container-image
hpc-compose plan --show-script -f compose.yaml
hpc-compose preflight -f compose.yaml
hpc-compose doctor cluster-report --out .hpc-compose/haicore-cluster.toml

On HAICORE, sinfo/sinfo -N node-state queries are denied (slurm_load_node: Access/permission denied). Use scontrol show partition, squeue, sacct, or srun --test-only for partition and availability introspection instead.

Check the rendered script for:

the intended #SBATCH --partition,
the intended account/project,
a short wall time for smoke tests,
a workspace-backed cache_dir,
expected GPU or MIG request,
expected srun --container-* options when using Pyxis.

Submit only after the static plan and preflight output are understandable:

hpc-compose up --detach -f compose.yaml
hpc-compose status -f compose.yaml
hpc-compose logs -f compose.yaml --follow

For a fast compute-node GPU/CUDA sanity check before your real workload, submit the cuda-probe.yaml example — a short GPU job that runs nvidia-smi and a minimal CUDA check inside the container.

The first up imports the image with enroot (download, extract, then squashfs build) and can take several minutes; later runs reuse the cache.

Common HAICORE Failure Modes

Symptom	Likely cause	What to check
Workspace path is missing	Workspace expired or wrong name/path was used.	`ws_list` and `ws_find <workspace-name>`.
Cache path fails preflight	Path is not shared, writable, or policy-safe.	Move `x-slurm.cache_dir` to a workspace path.
`--container-image` is unknown	Pyxis is not active in the current Slurm environment.	`srun –help
Job is rejected for partition/account	Site policy or project/account mismatch.	HAICORE batch docs, `sacctmgr`/support guidance, rendered `#SBATCH` lines.
GPU request is rejected	Wrong `gres` name, too many GPUs, or partition limit.	HAICORE batch docs and a tiny smoke job.
Job starts but cannot see data	Data is on node-local storage or an unmounted path.	Use workspace paths or explicit `volumes`.
Workspace fills or expires	Container cache, datasets, checkpoints, or logs accumulated.	`ws_list`, quota tools, cache cleanup, artifact retention policy.
enroot import fails at `Creating squashfs filesystem...` (`Stale file handle`)	Extraction scratch is on shared home/work storage.	Set `x-slurm.enroot_temp_dir` (or `cache.enroot_temp_dir`) to node-local `/tmp/$USER-...`; the layer cache and final `.sqsh` stay on the workspace cache.

Official HAICORE References

Set Up With an AI Agent

You can hand hpc-compose setup to an AI agent — Claude, Codex, Copilot, Cursor, or any LLM that can read a repository and run shell commands. This page is the agent-agnostic entry point: a copy-paste prompt, the safety boundary every agent must respect, and how to install the bundled skill for agents that support skills.

The machine-readable entry point is the published map llms.txt, served at https://nicolasschuler.github.io/hpc-compose/llms.txt. Point an agent at that URL first; it carries the curated doc map, the safety contract, and the canonical spec conventions in a token-lean form.

Copy-paste prompt for any agent

Help me set up hpc-compose for my Slurm cluster.

First read https://nicolasschuler.github.io/hpc-compose/llms.txt and honor its
safety contract: never submit, allocate, or cancel a Slurm job without my explicit
approval. Author the spec and verify it with the safe static checks
(validate, plan --show-script, inspect) before proposing any real run.

Then: inspect this repository, ask me what you need about my cluster (account,
partition, runtime backend, and a shared cache path visible from login and compute
nodes), and produce an hpc-compose spec plus the exact login-node commands. Stop and
ask before any command that submits or cancels a job.

For a one-line nudge once the agent has context: “Set up hpc-compose for my cluster, read the published llms.txt first, and don’t submit any Slurm job without my approval.”

The safety boundary (what an agent may run unprompted)

Safe to run unprompted (never submits, cancels, or allocates; no quota)	Requires your explicit approval (submits/cancels/allocates)
Static, no scheduler contact: `new`, `validate`, `plan`, `plan --show-script`, `inspect`, `render`, `config`	`up`, `run`, `test --submit`, `notebook`, `alloc`, `shell`, `sweep submit`, `down`, `cancel`
Read-only scheduler queries (`squeue`/`sacct`, no changes): `status`, `ps`, `stats`, `diff`, `logs` — avoid tight polling on rate-limited login nodes. `artifacts` also writes exported files to the local `export_dir`	—

A well-behaved agent authors and statically verifies a spec first, and only runs a submitting command after you approve it on a supported Linux Slurm submission host. On a login node it should prefer hpc-compose debug -f <file> --preflight and hpc-compose doctor cluster-report before a first up.

Install the bundled skill (Claude, Codex, and other skill-aware agents)

This repository ships a drop-in skill bundle at skills/hpc-compose/ — the source of truth for the setup recipe. Copy it into your agent’s skills directory and start a fresh session so skill discovery reloads:

Claude Code: ~/.claude/skills/hpc-compose (user scope) or .claude/skills/hpc-compose (project scope)
Codex: $CODEX_HOME/skills/hpc-compose or ~/.codex/skills/hpc-compose
Other runtimes: the skills location your agent documents

The bundle progressively loads detail as needed:

Path	Purpose
`skills/hpc-compose/SKILL.md`	Trigger description, the safe-first core workflow, adaptation rules, and output expectations.
`skills/hpc-compose/references/environment-setup.md`	Onboarding: installation, cluster-requirement discovery, shared-cache setup, profile/context checks, and the first safe cluster handoff.
`skills/hpc-compose/references/hpc-compose-workflow.md`	Command path, Docker Compose migration, backend selection, verification, and troubleshooting.
`skills/hpc-compose/references/haicore-kit.md`	HAICORE / NHR@KIT Slurm, GPU, filesystem, cache, and Pyxis/Enroot guidance.
`skills/hpc-compose/references/cluster-adaptation.md`	General Slurm cluster reconnaissance and portable adaptation.
`skills/hpc-compose/scripts/hpc_compose_repo_probe.py`	Heuristic repository probe for migration clues.

For local reconnaissance you (or the agent) can run the probe directly:

python3 skills/hpc-compose/scripts/hpc_compose_repo_probe.py .

The probe is intentionally heuristic — treat its output as an inventory and a set of hypotheses, then confirm against repository files, current cluster documentation, and hpc-compose static checks.

What to expect from a good agent run

An agent helping with hpc-compose should:

inspect the target repository before proposing a spec;
discover your environment (cluster, access method, workload, backend, shared filesystem, account/partition/QOS) before writing cluster-specific files;
prefer x-runtime.prepare.commands and a shared cache path (never /tmp, /var/tmp, /private/tmp, or /dev/shm);
verify with validate, plan --show-script, and inspect before any real submission;
ask before any command that submits or cancels jobs or consumes allocation quota;
leave you with the created files, the static checks it ran, the cluster assumptions still unverified, and the next safest command.

Quickstart
Task Guide
Installation
CLI Reference

Architecture for Contributors

The library crate owns the core staged pipeline. The binary entrypoint delegates to command-family modules under src/commands/, while presentation lives under src/output/. Reusable planning, prepare, render, tracking, cache, context, and template logic stay in the library modules.

Module map

spec: parse, interpolate, and validate the supported Compose subset
planner: normalize the parsed spec into a deterministic plan
lint: run opinionated static checks over validated plans
context: resolve .hpc-compose/settings.toml, profiles, env files, interpolation variables, and binary overrides
cluster: generate and apply best-effort cluster capability profiles from doctor cluster-report
preflight: check login-node prerequisites and cluster policy issues
prepare: import base images and rebuild prepared runtime artifacts
render: generate the final sbatch script and service launch commands
job: track submissions, logs, metrics, replay, status, and artifact export
tracked_paths: centralize the .hpc-compose/ layout used by render and job tracking
cache: persist cache manifests for imported and prepared images
init: expose the shipped example templates for hpc-compose new plus the legacy init alias
schema and manpages: expose the checked-in JSON Schema and generated section-1 manpage flow
commands/spec: static authoring commands such as plan, validate, lint, render, config, inspect, prepare, and preflight
commands/runtime: submission, tracked-run, and local-development commands such as up, when, run, alloc, debug, status, ps, watch, replay, stats, logs, artifacts, down, cancel, clean, dev, tmux, and test
commands/cache: cache inspection and pruning
commands/doctor, commands/evolve, commands/examples, commands/weather: the doctor, evolve, examples, and weather command families
commands/init: new / init, setup, context, and completions
commands (mod.rs): parses the CLI and routes every command to its handler module
watch_ui: terminal UI controller and renderer for up, watch, and replay playback
output: binary-only text, JSON, CSV, and JSONL formatting helpers

Execution flow

ComposeSpec::load parses YAML, resolves authoring extends, validates supported keys, interpolates variables, and applies semantic validation.
planner::build_plan resolves paths, command shapes, dependencies, and prepare blocks into a normalized plan.
prepare::build_runtime_plan computes concrete cache artifact locations.
context and optional cluster profiles provide resolved paths, binaries, env, and compatibility warnings.
preflight::run checks cluster prerequisites before submission.
prepare::prepare_runtime_plan imports or rebuilds artifacts when needed.
render::render_script emits the batch script consumed by sbatch.
job persists tracked metadata under .hpc-compose/ and powers status, ps, watch, replay, stats, logs, cancel, and artifact export. job::replay reconstructs a best-effort timeline from existing state, service-exit, metrics, and log artifacts while reusing the watch renderer for playback.
commands/* turns CLI variants into library calls, and output formats the final presentation.

Tracked Runtime Layout

tracked_paths is the single source of truth for the tracked-job layout shared by render and job.

Compose-level metadata lives under .hpc-compose/ next to the compose file.
Per-job runtime state lives under <runtime-root>/<job-id>/, where <runtime-root> defaults to <submit-dir>/.hpc-compose and can be overridden with x-slurm.runtime_root. The renderer resolves this to an absolute path at submit time and bakes it into JOB_ROOT, so a running job does not depend on $SLURM_SUBMIT_DIR. Records persist an explicit override so later lookups address the same directory.
Root-level logs/, metrics/, artifacts/, and state.json are the latest-view paths used by status and export commands.
Resume-aware runs still write attempt-specific state under attempts/<attempt>/....
The batch script updates root-level latest symlinks so contributor-facing tooling can read the most recent attempt without reconstructing shell logic independently.

Contributor commands

cargo test
cargo test --test cli_runtime
cargo test --test release_metadata
cargo doc --no-deps
mdbook build docs
cargo run --features manpage-bin --bin gen-manpages -- --check

Coverage Notes

Treat src/spec/mod.rs as high risk for broad refactors until parser and semantic-validation behavior has more focused coverage. Prefer adding behavior-first tests in tests/cli_spec.rs or spec unit tests before moving large validation blocks.
Render changes should keep generated-script assertions close to src/render.rs. just examples-check shellchecks rendered batch scripts, while local launchers are produced through up/run --local, so local launcher syntax needs focused render or local dry-run coverage.
Runtime command refactors should start with pure helpers that have deterministic unit tests and existing CLI integration filters. Submission, tracking, watching, and process orchestration should stay together until a narrower harness makes a larger move low risk.

Documentation split

Use this mdBook for user-facing workflows, examples, and reference material.
Use rustdoc for contributor-facing internals and the library module map.
Keep README short and point readers into the book instead of duplicating long-form guidance.

Execution Model
Spec Reference
Roadmap

Brand Assets

The tracked brand kit lives in docs/brand/.

Use the full logo for README-style hero placement, the square mark for icons and compact references, and the social preview image when configuring the GitHub repository preview.

Asset	Intended use
`hpc-compose-logo.png`	Full logo and wordmark.
`hpc-compose-mark.png`	Square mark for icons and compact layouts.
`hpc-compose-wordmark-on-light.png`	Wordmark for light backgrounds.
`hpc-compose-wordmark-on-dark.png`	Wordmark for dark backgrounds.
`hpc-compose-social-preview.png`	Link previews and repository social image.

GitHub repository social previews are configured outside the tracked files. Upload docs/brand/hpc-compose-social-preview.png in the repository settings after changing the brand kit.

Keyboard shortcuts

hpc-compose