Spec reference
This page describes the Compose subset that hpc-compose accepts today. Unknown or unsupported fields are rejected unless this page explicitly says otherwise.
How To Use This Reference
This page is intentionally complete. If you are new, start with Quickstart, Examples, and Runtime Backends, then use the table below to jump into the field group you need.
| Need | Section |
|---|---|
| Overall YAML shape | Top-level shape and Top-level fields |
| Shared templates and overrides | extends |
| Runtime backend choice | runtime and Runtime Backends |
| Slurm allocation settings | x-slurm |
| Hyperparameter sweeps | sweep and Hyperparameter Sweeps |
| Service command, image, env, and mounts | Service fields, Image rules, command and entrypoint, environment, volumes |
| Startup ordering | depends_on, readiness, and healthcheck |
| Multi-node placement and MPI | Multi-node placement rules, services.<name>.x-slurm.placement, and services.<name>.x-slurm.mpi |
| Prepared images | x-runtime.prepare and x-enroot.prepare |
| Metrics, artifacts, and resume | x-slurm.metrics, x-slurm.artifacts, and x-slurm.resume |
| Unsupported Compose features | Unsupported Compose keys |
Top-level shape
name: demo
version: "1"
runtime:
backend: pyxis
x-slurm:
time: "00:30:00"
services:
app:
image: python:3.11-slim
command: python -m main
Top-level fields
| Field | Shape | Default | Notes |
|---|---|---|---|
extends | string | omitted | Top-level authoring-only path to a base spec. The base is resolved before interpolation, validation, planning, and config output. |
name | string | omitted | Used as the Slurm job name when x-slurm.job_name is not set. |
version | string "1" or integer 1 | 1 | hpc-compose spec schema version. Omit for v1 or set explicitly to "1"; Docker Compose values such as "3.9" are rejected after migration. |
runtime | mapping | backend: pyxis | Selects the service runtime backend and GPU passthrough policy. |
services | mapping | required | Must contain at least one service. |
steps | mapping | alias for services | Use either services or steps, not both. |
modules | list of strings | omitted | List-only shorthand for top-level x-env.modules.load; cannot be combined with x-env.modules. |
x-env | mapping | omitted | Structured host-side module, Spack view, and environment setup shared by all services. |
x-slurm | mapping | omitted | Top-level Slurm settings and shared runtime defaults. |
sweep | mapping | omitted | Embedded hyperparameter sweep metadata consumed by hpc-compose sweep submit/status/list. Normal commands treat it as metadata. |
extends
extends is an authoring feature for sharing base specs and service templates without copying large cluster-specific blocks. It is resolved before interpolation, validation, planning, rendering, tracked metadata, and hpc-compose config; the effective config no longer contains any extends keys.
Top-level extends points at a base YAML file:
extends: cluster-base.yaml
x-slurm:
time: "02:00:00"
services:
trainer:
command: python train.py
Service-level extends supports three forms:
services:
api:
extends: base-service
worker:
extends: service-templates.yaml
trainer:
extends:
file: ml-templates.yaml
service: gpu-worker
Rules:
- Top-level
extendsmust be a file path string. - A service string that looks like a YAML file path, such as
base.yaml,../base.yml, or a path with a separator, uses the same service name from that file. Other strings refer to a service in the same file. - A service mapping can select
{ file, service }; omitfileto select a service from the same file. - Extends references are recursive and cycles are rejected.
- Maps merge recursively. Sequences append base-first. Child scalars replace base scalars.
- Service
volumesmerge by container target, so a child mount for/datareplaces the base mount for/datawhile unrelated base mounts are kept. - Relative host paths in the final plan still resolve against the leaf compose file passed with
-f. - There is no delete or unset syntax in this version.
sweep
sweep defines trial variables for hpc-compose sweep submit. It is a top-level metadata block; every generated trial is still planned, rendered, submitted, and tracked as a normal one-allocation job.
Full Cartesian product:
sweep:
parameters:
lr: [0.001, 0.01, 0.1]
batch_size: [32, 64]
matrix: full
Random sample without replacement:
sweep:
parameters:
lr: [0.001, 0.01, 0.1]
batch_size: [32, 64]
matrix:
random: 5
seed: "optional-stable-seed"
Rules:
parametersmust contain at least one key, and every value list must contain at least one scalar.- Parameter keys must be valid interpolation variable names:
[A-Za-z_][A-Za-z0-9_]*. - Parameter keys must not use the reserved
HPC_COMPOSE_SWEEP_prefix. - Parameter values may be strings, numbers, or booleans. They are passed to interpolation as strings.
matrix: fullexpands the Cartesian product deterministically over sorted parameter names.matrix.randommust be at least 1 and cannot exceed the total number of combinations.matrix.seedis optional. If omitted,sweep submitderives a seed from the new sweep id and persists it.sweep.specis rejected in v1; embed the sweep in the same compose file.
For each trial, sweep variables override existing interpolation variables from .env, environment, settings, or --env. These reserved variables are also available:
| Variable | Meaning |
|---|---|
HPC_COMPOSE_SWEEP_ID | Persisted sweep id. |
HPC_COMPOSE_SWEEP_TRIAL | Trial label such as t000. |
HPC_COMPOSE_SWEEP_TRIAL_INDEX | Zero-based trial index. |
Normal commands do not expand the sweep matrix. If the runnable spec contains ${lr} with no default, ordinary plan, up, and render still fail unless lr is provided. Use defaults such as ${lr:-0.001} when the base spec should remain runnable, or use hpc-compose sweep submit --dry-run to validate sweep-only variables.
hpc-compose sweep submit rejects x-slurm.array, because every sweep trial is already its own allocation. See Hyperparameter Sweeps for manifests, status aggregation, and examples.
x-env
x-env is structured host-side software setup. It is available at the top level and under services.<name>.
x-env:
modules:
- cuda/12.4
- openmpi/5
spack:
view: /shared/spack/views/ml
env:
HDF5_USE_FILE_LOCKING: "FALSE"
services:
app:
image: python:3.11-slim
x-env:
modules:
purge: false
load:
- netcdf/4.9
env:
OMP_NUM_THREADS: "8"
Supported forms:
modules: [name, ...]modules: { purge: bool, load: [name, ...] }spack: { view: /path/to/view }env: { KEY: VALUE }
Rules:
- Top-level
x-envrenders beforex-slurm.setup. - Service-level
x-envrenders immediately before that service’ssrun. enventries are exported on the host and forwarded into Pyxis containers.- Service-level
x-env.envoverrides top-levelx-env.envwhen the same variable is set. - Top-level
modules: [...]and service-levelmodules: [...]are shorthand for the matchingx-env.modules.loadlist. The shorthand is list-only and cannot be combined withx-env.modulesat the same scope. spack.viewprependsbin,lib,lib64, and Python site-package paths only when those directories exist.- Modules and Spack views are host-side setup. Container filesystem visibility still requires explicit
volumes,x-slurm.mpi.host_mpi.bind_paths, or other site-specific binds.
Settings-aware command table
Use these commands and global flags when you want the project-local settings file (.hpc-compose/settings.toml) to remember compose path, env files, env vars, and binary overrides.
| Command or flag | Purpose | Notes |
|---|---|---|
--profile <NAME> | Select the profile from settings | Global flag; applies to every subcommand. |
--settings-file <PATH> | Use an explicit settings file | Global flag; bypasses upward auto-discovery of .hpc-compose/settings.toml. |
hpc-compose setup | Create or update the project-local settings file | Interactive by default; supports --non-interactive with --profile-name, --compose-file, --env-file, --env, --binary, --cache-dir, and --default-profile. |
hpc-compose context | Print fully resolved execution context | Shows selected settings/profile, compose path, binaries, referenced interpolation vars, runtime paths, and value sources; supports --format json. Sensitive-looking interpolation values are redacted unless --show-values is passed. |
hpc-compose validate --strict-env | Fail when interpolation fell back to defaults | Detects when ${VAR:-...} or ${VAR-...} consumed fallback values because VAR was missing. |
hpc-compose lint | Run opinionated authoring checks | Builds on validation and planning, then reports stable finding codes for risky dependency, memory, and shared-write patterns. |
hpc-compose schema | Print the checked-in JSON Schema | Useful for editor integration and authoring tools. Rust validation remains the semantic source of truth. |
x-slurm
These fields live under the top-level x-slurm block.
| Field | Shape | Default | Notes |
|---|---|---|---|
resources | string | omitted | Name of a [resource_profiles.<name>] entry in .hpc-compose/settings.toml. Profile values are defaults only; explicit x-slurm fields win. |
job_name | string | name when present | Rendered as #SBATCH --job-name. |
partition | string | omitted | Passed through to #SBATCH --partition. |
account | string | omitted | Passed through to #SBATCH --account. |
qos | string | omitted | Passed through to #SBATCH --qos. |
time | string | omitted | Passed through to #SBATCH --time. |
nodes | positive integer | omitted | Slurm allocation node count. Defaults to 1 when omitted. |
ntasks | positive integer | omitted | Passed through to #SBATCH --ntasks. |
ntasks_per_node | positive integer | omitted | Passed through to #SBATCH --ntasks-per-node. |
cpus_per_task | positive integer | omitted | Top-level Slurm CPU request. |
mem | string | omitted | Passed through to #SBATCH --mem. |
gres | string | omitted | Passed through to #SBATCH --gres. |
gpus | positive integer | omitted | Used only when gres is not set. |
gpus_per_node | positive integer | omitted | Passed through to #SBATCH --gpus-per-node. |
gpus_per_task | positive integer | omitted | Passed through to #SBATCH --gpus-per-task. |
cpus_per_gpu | positive integer | omitted | Passed through to #SBATCH --cpus-per-gpu. |
mem_per_gpu | string | omitted | Passed through to #SBATCH --mem-per-gpu. |
gpu_bind | string | omitted | Passed through to #SBATCH --gpu-bind. |
cpu_bind | string | omitted | Passed through to #SBATCH --cpu-bind. |
mem_bind | string | omitted | Passed through to #SBATCH --mem-bind. |
distribution | string | omitted | Passed through to #SBATCH --distribution. |
hint | string | omitted | Passed through to #SBATCH --hint. |
constraint | string | omitted | Passed through to #SBATCH --constraint. |
output | string | omitted | Passed through to #SBATCH --output. |
error | string | omitted | Passed through to #SBATCH --error. |
chdir | string | omitted | Passed through to #SBATCH --chdir. |
array | string | omitted | Slurm array spec such as 0, 1-10, 1-10:2, 0,3,8-12, or 0-99%10. Rendered as #SBATCH --array. |
after_job | string or mapping | omitted | Scheduler dependency on a prior job id. String shorthand means afterany:<id>; mapping supports { id, condition }. |
dependency | string | omitted | Currently supports singleton, combined with after_job when both are set. |
cache_dir | string | settings profile, settings defaults, then $HOME/.cache/hpc-compose | Must resolve to shared storage visible from the login node and the compute nodes. |
scratch | mapping | omitted | Optional scratch path mounted into services and exposed as HPC_COMPOSE_SCRATCH_DIR. |
stage_in | list of mappings | omitted | Copy or rsync host paths before services launch. |
stage_out | list of mappings | omitted | Copy or rsync paths during teardown, optionally by outcome. |
burst_buffer | mapping | omitted | Raw #BB / #DW directives for site-specific burst-buffer systems. |
metrics | mapping | omitted | Enables runtime metrics sampling. |
artifacts | mapping | omitted | Enables tracked artifact collection and export metadata. |
resume | mapping | omitted | Enables checkpoint-aware resume semantics with a shared host path mounted into every service. |
notify | mapping | omitted | First-class Slurm email notification settings. |
setup | list of strings | omitted | Raw shell lines inserted into the generated batch script before service launches. |
submit_args | list of strings | omitted | Extra raw Slurm arguments appended as #SBATCH ... lines. |
rendezvous | string, list, or mapping | omitted | Resolve cross-job service records from the shared cache and inject HPC_COMPOSE_RDZV_* env vars. |
Resource profiles
Resource profiles are reusable settings defaults, distinct from the global --profile setting selector. Define them in .hpc-compose/settings.toml:
[resource_profiles.gpu-small]
partition = "gpu"
time = "01:00:00"
gpus = 1
cpus_per_task = 8
mem = "32G"
Reference one from the spec:
x-slurm:
resources: gpu-small
mem: 64G
The profile fills only omitted resource fields. In the example above, partition, time, gpus, and cpus_per_task come from the profile, while the explicit mem: 64G wins. Profiles intentionally exclude behavior such as job_name, cache_dir, arrays, dependencies, submit_args, setup hooks, scratch/staging, artifacts, resume, notify, and metrics.
Allowed profile fields are: partition, account, qos, time, nodes, ntasks, ntasks_per_node, cpus_per_task, mem, gres, gpus, gpus_per_node, gpus_per_task, cpus_per_gpu, mem_per_gpu, gpu_bind, cpu_bind, mem_bind, distribution, hint, and constraint.
x-slurm.array
x-slurm:
array: 0-99%10
output: logs/%A_%a.out
services:
worker:
image: python:3.12-slim
command: python worker.py
array accepts Slurm list, range, step, and concurrency forms such as 0, 1-10, 1-10:2, 0,3,8-12, and 0-99%10. Values with spaces, null bytes, malformed ranges, negative numbers, zero step, or zero concurrency are rejected.
Array jobs currently require hpc-compose up --detach; live watch/log fan-out for per-task array elements is future work. --local rejects array specs. Slurm provides SLURM_ARRAY_JOB_ID, SLURM_ARRAY_TASK_ID, SLURM_ARRAY_TASK_COUNT, SLURM_ARRAY_TASK_MAX, SLURM_ARRAY_TASK_MIN, and SLURM_ARRAY_TASK_STEP; for Pyxis jobs, hpc-compose forwards these names into the container when x-slurm.array is set. Prefer output patterns such as %A_%a so task logs do not overwrite each other.
x-slurm.after_job and x-slurm.dependency
x-slurm:
after_job:
id: "12345"
condition: afterok
dependency: singleton
after_job: "12345" is shorthand for afterany:12345. Mapping form accepts id plus condition, where condition is afterany, afterok, or afternotok. Job ids must be numeric Slurm ids such as 12345, or array elements such as 12345_7.
dependency: singleton is separate because Slurm’s singleton dependency does not take a job id. When both fields are set, hpc-compose submits one command-line dependency string such as --dependency=afterok:12345,singleton.
Dependencies are passed to sbatch as CLI arguments, not rendered as #SBATCH lines, because dependency job ids are commonly dynamic. --local rejects scheduler dependencies.
x-slurm.setup
x-slurm:
setup:
- module load enroot
- source /shared/env.sh
- Shape: list of strings
- Default: omitted
- Notes:
- Each line is emitted verbatim into the generated bash script.
- The script runs under
set -euo pipefail. - Shell quoting and escaping are the user’s responsibility.
x-slurm.submit_args
x-slurm:
submit_args:
- "--mail-type=END"
- "--mail-user=user@example.com"
- "--reservation=gpu-reservation"
- Shape: list of strings
- Default: omitted
- Notes:
- Each entry is emitted as
#SBATCH {arg}. - Entries are rejected if they contain line breaks or null bytes.
- Entries are not validated against Slurm option syntax.
- First-class fields reject conflicting raw entries for the same option. Use
x-slurm.array,x-slurm.after_job, orx-slurm.dependencyinstead of raw--arrayor--dependency.
- Each entry is emitted as
x-slurm.notify
x-slurm:
notify:
email:
to: user@example.com
on: [end, fail]
| Field | Shape | Default | Notes |
|---|---|---|---|
notify.email | mapping | omitted | Required when notify is present. |
notify.email.to | string | required | Rendered as #SBATCH --mail-user. |
notify.email.on | list of events | [end, fail] | Rendered as #SBATCH --mail-type. |
Supported events:
| Event | Slurm mail type |
|---|---|
start | BEGIN |
end | END |
fail | FAIL |
all | ALL |
Rules:
- When
onis omitted or empty, defaults to[end, fail]. - If
allis present, it replaces all other events. - Cannot be combined with raw
--mail-typeor--mail-userinx-slurm.submit_args.
x-slurm.cache_dir
- Shape: string
- Default precedence: explicit
x-slurm.cache_dir, then[profiles.<name>.cache].dir, then[defaults.cache].dir, then$HOME/.cache/hpc-compose. - Notes:
- Relative paths and environment variables are resolved against the compose file directory.
- Settings cache paths are resolved against the settings base directory.
- Paths under
/tmp,/var/tmp,/private/tmp, and/dev/shmare accepted by parsing and planning, butpreflightreports them as unsafe because they are not valid shared-cache locations for login-node prepare plus compute-node reuse. - The path must be visible from both the login node and the compute nodes.
Settings example:
[defaults.cache]
dir = "/cluster/shared/hpc-compose-cache"
[profiles.dev.cache]
dir = "/cluster/shared/dev-hpc-compose-cache"
runtime
runtime:
backend: apptainer
gpu: auto
| Field | Shape | Default | Notes |
|---|---|---|---|
backend | pyxis, apptainer, singularity, or host | pyxis | Selects the runtime used inside Slurm steps. |
gpu | auto, none, or nvidia | auto | For Apptainer/Singularity, controls --nv; auto enables it when Slurm GPU resources are requested. |
Backend notes:
pyxisusessrun --container-*flags and Enroot.sqshartifacts.apptainerandsingularitybuild or reuse.sifartifacts and launch them throughapptainer exec/runorsingularity exec/runinsidesrun.hostruns commands directly undersrun; services must setcommandorentrypoint, and image prepare blocks, servicevolumes, andx-slurm.mpi.host_mpi.bind_pathsare not allowed because no container bind mount is applied.x-enroot.prepareis a Pyxis/Enroot compatibility spelling. Preferx-runtime.preparefor new specs, especially with Apptainer/Singularity.
x-slurm.scratch, stage_in, stage_out, and burst_buffer
x-slurm:
scratch:
scope: shared
base: /scratch/$USER/jobs
mount: /scratch
cleanup: on_success
stage_in:
- from: /shared/input
to: /scratch/input
mode: rsync
stage_out:
- from: /scratch/output
to: /shared/results/${SLURM_JOB_ID}
when: always
mode: copy
burst_buffer:
directives:
- "#BB create_persistent name=data capacity=100G"
scratch.baseis a host path.scratch.mountis the container-visible mount point.scratch.scopeisnode_localorshared; cluster profiles can warn when a shared scratch path does not look shared.scratch.cleanupisalways,on_success, ornever.stage_inruns before services launch;stage_outruns during teardown.modeisrsyncorcopy;rsyncfalls back tocp -Rwhenrsyncis unavailable.stage_out.whenisalways,on_success, oron_failure.${SLURM_JOB_ID}is preserved in scratch and staging paths for runtime expansion.burst_buffer.directivesentries are emitted as raw batch-script directives and must start with#BBor#DW.
Multi-node placement rules
x-slurm.nodes > 1reserves a multi-node allocation.- Helper services remain single-node steps and are pinned to the allocation’s primary node.
- When a multi-node job has exactly one service, that service defaults to the distributed full-allocation step.
- Services may use
services.<name>.x-slurm.placementto select explicit allocation node indices. - Overlapping explicit placements are rejected unless one side sets
allow_overlap: trueor usesshare_with. - Any service spanning more than one node may use
readiness.type: sleeporreadiness.type: log, or TCP/HTTP readiness only with an explicit non-local host or URL.
x-slurm.metrics
x-slurm:
metrics:
interval_seconds: 5
collectors: [gpu, slurm]
- Shape: mapping
- Default: omitted
- Notes:
- Omitting the block disables runtime metrics sampling.
- If the block is present and
enabledis omitted, metrics sampling is enabled. interval_secondsdefaults to5and must be at least1.collectorsdefaults to[gpu, slurm].- Supported collectors:
gpusamples device and process telemetry throughnvidia-smislurmsamples job-step CPU and memory data throughsstat
- In multi-node jobs,
gpusampling launches one best-effort sampler task per allocated node and writes node metadata into GPU rows; legacy rows withoutnoderemain readable as primary-node samples. - Sampler files are written under
${SLURM_SUBMIT_DIR:-$PWD}/.hpc-compose/${SLURM_JOB_ID}/metricson the host and are also visible inside containers at/hpc-compose/job/metrics. - Diagnostics are written under
metrics/diagnostics/when available, includingnvidia-smi topo -m,nvidia-smi -q, selected fabric/GPU environment variables, and best-effortibstat,ibv_devinfo,ucx_info -v, andfi_infooutput.
x-slurm.rendezvous
Client-side cross-job discovery resolves records from <cache_dir>/rendezvous/<name>/latest.json before launching services:
x-slurm:
cache_dir: /cluster/shared/hpc-compose-cache
rendezvous: model-server
The mapping form supports multiple names and a timeout:
x-slurm:
rendezvous:
discover:
- model-server
- tokenizer
timeout_seconds: 60
require: true
Resolved records become generic variables such as HPC_COMPOSE_RDZV_URL and name-scoped variables such as HPC_COMPOSE_RDZV_MODEL_SERVER_URL.
- Collector failures are best-effort and do not fail the batch job.
x-slurm.artifacts
x-slurm:
artifacts:
collect: always
export_dir: ./results/${SLURM_JOB_ID}
paths:
- /hpc-compose/job/metrics/**
bundles:
checkpoints:
paths:
- /hpc-compose/job/checkpoints/*.pt
- Shape: mapping
- Default: omitted
- Notes:
- Omitting the block disables tracked artifact collection.
collectdefaults toalways. Supported values arealways,on_success, andon_failure.export_diris required and is resolved relative to the compose file directory whenhpc-compose artifactsruns.${SLURM_JOB_ID}is preserved inexport_diruntilhpc-compose artifactsexpands it from tracked metadata.pathsremains supported as the implicitdefaultbundle.bundlesis optional. Bundle names must match[A-Za-z0-9_-]+, anddefaultis reserved for top-levelpaths.- At least one source path must be present in
pathsorbundles. - Every source path must be an absolute container-visible path rooted at
/hpc-compose/job. - Paths under
/hpc-compose/job/artifactsare rejected. - Collection happens during batch teardown and is best-effort.
- Collected payloads and
manifest.jsonare written under${SLURM_SUBMIT_DIR:-$PWD}/.hpc-compose/${SLURM_JOB_ID}/artifacts/. hpc-compose artifacts --bundle <name>exports only the selected bundle or bundles.hpc-compose artifacts --tarballalso writes one<bundle>.tar.gzarchive per exported bundle.- Export writes per-bundle provenance metadata under
<export_dir>/_hpc-compose/bundles/<bundle>.json.
x-slurm.resume
x-slurm:
resume:
path: /shared/$USER/runs/my-run
- Shape: mapping
- Default: omitted
- Notes:
- Omitting the block disables resume semantics.
pathis required and must be an absolute host path./hpc-compose/...paths are rejected becausepathmust point at shared host storage, not a container-visible path./tmpand/var/tmptechnically validate, butpreflightwarns because those paths are not reliable resume storage.- When enabled,
hpc-composemountspathinto every service at/hpc-compose/resume. - Services also receive
HPC_COMPOSE_RESUME_DIR,HPC_COMPOSE_ATTEMPT, andHPC_COMPOSE_IS_RESUME. - The canonical resume source is the shared
path, not exported artifact bundles. - Attempt-specific runtime state moves under
${SLURM_SUBMIT_DIR:-$PWD}/.hpc-compose/${SLURM_JOB_ID}/attempts/<attempt>/, and the top-levellogs,metrics,artifacts, andstate.jsonpaths continue to point at the latest attempt for compatibility.
Allocation metadata inside services
Every service receives:
HPC_COMPOSE_PRIMARY_NODEHPC_COMPOSE_NODE_COUNTHPC_COMPOSE_NODELISTHPC_COMPOSE_NODELIST_FILEHPC_COMPOSE_SERVICE_PRIMARY_NODEHPC_COMPOSE_SERVICE_NODE_COUNTHPC_COMPOSE_SERVICE_NODELISTHPC_COMPOSE_SERVICE_NODELIST_FILE
The allocation-wide data is also written under /hpc-compose/job/allocation/primary_node and /hpc-compose/job/allocation/nodes.txt. Service-scoped node lists are written under /hpc-compose/job/allocation/service-nodelists/.
Multi-node services also receive distributed launch helpers:
HPC_COMPOSE_DIST_MASTER_ADDRHPC_COMPOSE_DIST_MASTER_PORTHPC_COMPOSE_DIST_RDZV_ENDPOINTHPC_COMPOSE_DIST_NNODESHPC_COMPOSE_DIST_NODE_RANKHPC_COMPOSE_DIST_LOCAL_RANKHPC_COMPOSE_DIST_GLOBAL_RANKHPC_COMPOSE_DIST_NPROC_PER_NODEHPC_COMPOSE_DIST_WORLD_SIZEHPC_COMPOSE_DIST_HOSTFILE
HPC_COMPOSE_DIST_NPROC_PER_NODE is derived from a service environment override, GPU requests, ntasks_per_node, then 1. The distributed hostfile is written under /hpc-compose/job/allocation/distributed-hostfiles/. When a discovered .hpc-compose/cluster.toml contains [distributed.env], those profile variables are injected only for multi-node services; explicit service environment values win on name conflicts and are still the durable config source.
Services that configure services.<name>.x-slurm.mpi also receive:
HPC_COMPOSE_MPI_TYPEHPC_COMPOSE_MPI_PROFILEwhenx-slurm.mpi.profileis setHPC_COMPOSE_MPI_IMPLEMENTATIONwhenx-slurm.mpi.implementationis set or implied byx-slurm.mpi.profileHPC_COMPOSE_MPI_HOSTFILE
The MPI hostfile is written under /hpc-compose/job/allocation/mpi-hostfiles/ and contains the service’s effective node list. When ntasks_per_node is known, each host line includes slots=<ntasks_per_node>. For a single-node service with ntasks but no ntasks_per_node, the hostfile uses slots=<ntasks>. Otherwise it emits one node per line without slots.
MPI services also forward common PMI, PMIx, and Slurm rank variables into the container through Pyxis --container-env, including PMI_RANK, PMI_SIZE, PMIX_RANK, PMIX_NAMESPACE, SLURM_PROCID, SLURM_LOCALID, SLURM_NODEID, SLURM_NTASKS, and SLURM_TASKS_PER_NODE.
gres and gpus
When both gres and gpus are set at the same level, gres takes priority and gpus is ignored.
Service fields
| Field | Shape | Default | Notes |
|---|---|---|---|
extends | string or mapping | omitted | Authoring-only service template reference. See extends. |
image | string | required unless runtime.backend: host | Can be a remote image reference, a local .sqsh / .squashfs path for Pyxis, or a local .sif path for Apptainer/Singularity. |
command | string or list of strings | omitted | Shell form or exec form. |
entrypoint | string or list of strings | omitted | Must use the same form as command when both are present. |
script | string | omitted | Multi-line shell script sugar for command: ["/bin/sh", "-lc", script]; mutually exclusive with command and entrypoint. |
environment | mapping or list of KEY=VALUE strings | omitted | Both forms normalize to key/value pairs. |
modules | list of strings | omitted | List-only shorthand for service x-env.modules.load; cannot be combined with service x-env.modules. |
volumes | list of host_path:container_path strings | omitted | Runtime bind mounts. Host paths resolve against the compose file directory. |
working_dir | string | omitted | Valid only when the service also has an explicit command or entrypoint. |
depends_on | list or mapping | omitted | Dependency list with service_started or service_healthy conditions. |
readiness | mapping | omitted | Post-launch readiness gate. |
healthcheck | mapping | omitted | Compose-compatible sugar for a subset of readiness. Mutually exclusive with readiness. |
assert | mapping | omitted | Post-run service contract checked during batch cleanup and surfaced in status. |
x-env | mapping | omitted | Structured host-side module, Spack view, and environment setup for this service. |
x-slurm | mapping | omitted | Per-service Slurm overrides. |
x-runtime | mapping | omitted | Backend-neutral image preparation rules. |
x-enroot | mapping | omitted | Pyxis/Enroot preparation compatibility alias. |
Image rules
Remote images
- Any image reference without an explicit
://scheme is prefixed withdocker://. - Explicit schemes are allowed only for
docker://,dockerd://, andpodman://. - Other schemes are rejected.
- Shell variables in the image string are expanded at plan time.
- Unset variables expand to empty strings.
Local images
- Pyxis local image paths must point to
.sqshor.squashfsfiles. - Apptainer/Singularity local image paths must point to
.siffiles. - Relative paths are resolved against the compose file directory.
- Paths that look like build contexts are rejected.
command, entrypoint, and script
Both fields accept either:
- a string, interpreted as shell form
- a list of strings, interpreted as exec form
Rules:
- If both fields are present, they must use the same form.
- Mixed string/array combinations are rejected.
- If neither field is present, the image default entrypoint and command are used.
- If
working_diris set, at least one ofcommandorentrypointmust also be set. - A multi-line string-form
commandis automatically normalized to["/bin/sh", "-lc", command]so YAML block scalars run as one shell script. - Single-line string-form
commandremains shell form. scriptis a convenience field for multi-line shell snippets and normalizes tocommand: ["/bin/sh", "-lc", script].scriptcannot be combined withcommandorentrypoint.
environment
Accepted forms:
environment:
APP_ENV: prod
LOG_LEVEL: info
environment:
- APP_ENV=prod
- LOG_LEVEL=info
Rules:
- List items must use
KEY=VALUEsyntax. .envfrom the compose file directory is loaded automatically when present.- Shell environment variables override
.env;.envfills only missing variables. environment,x-runtime.prepare.env, and compatibilityx-enroot.prepare.envvalues support$VAR,${VAR},${VAR:-default}, and${VAR-default}interpolation.- Missing variables without defaults are errors.
- Use
$$for a literal dollar sign in interpolated fields. - String-form shell snippets are still literal. For example,
$PATHinside a string-formcommandis not expanded at plan time.
volumes
Accepted form:
volumes:
- ./app:/workspace
- /shared/data:/data
- /shared/reference:/reference:ro
Rules:
- Host paths are resolved against the compose file directory.
- Runtime mounts accept
host_path:container_pathandhost_path:container_path:ro|rw. - Pyxis mounts are passed through
srun --container-mounts=...; Apptainer/Singularity mounts are passed as--bind. - Every service also gets an automatic shared mount at
/hpc-compose/job, backed by${SLURM_SUBMIT_DIR:-$PWD}/.hpc-compose/${SLURM_JOB_ID}on the host. /hpc-compose/jobis reserved and cannot be used as an explicit volume destination.
Warning
If a mounted file is a symlink, the symlink target must also be visible from inside the mounted directory. Otherwise the path can exist on the host but fail inside the container.
depends_on
Accepted forms:
depends_on:
- redis
depends_on:
redis:
condition: service_started
depends_on:
redis:
condition: service_healthy
Rules:
- List form means
condition: service_started. - Map form accepts
condition: service_started,condition: service_healthy, andcondition: service_completed_successfully. service_healthyrequires the dependency service to definereadiness.service_startedwaits only for the dependency process to be launched and still alive.service_healthywaits for the dependency readiness check to succeed.service_completed_successfullywaits for the dependency to exit with status0before launching the dependent service, which is useful for one-shot DAG stages such as preprocess -> train -> postprocess.
readiness
Supported types:
Sleep
readiness:
type: sleep
seconds: 5
secondsis required.
TCP
readiness:
type: tcp
host: 127.0.0.1
port: 6379
timeout_seconds: 30
hostdefaults to127.0.0.1.timeout_secondsdefaults to60.
Log
readiness:
type: log
pattern: "Server started"
timeout_seconds: 60
timeout_secondsdefaults to60.
HTTP
readiness:
type: http
url: http://127.0.0.1:8080/health
status_code: 200
timeout_seconds: 30
status_codedefaults to200.timeout_secondsdefaults to60.- The readiness check polls the URL through
curl.
healthcheck
healthcheck is accepted as migration sugar and is normalized into the readiness model.
services:
redis:
image: redis:7
healthcheck:
test: ["CMD", "nc", "-z", "127.0.0.1", "6379"]
timeout: 30s
Rules:
healthcheckandreadinessare mutually exclusive.- Supported probe forms are a constrained subset:
["CMD", "nc", "-z", HOST, PORT]["CMD-SHELL", "nc -z HOST PORT"]- recognized
curlprobes againsthttp://orhttps://URLs - recognized
wget --spiderprobes againsthttp://orhttps://URLs
timeoutmaps totimeout_seconds.disable: truedisables readiness for that service.interval,retries, andstart_periodare parsed but rejected in v1.- HTTP-style healthchecks normalize to
readiness.type: httpwithstatus_code: 200.
assert
assert defines post-run contracts for a service. Checks run in the rendered script’s cleanup() after services are reaped and before artifact collection or stage-out. Any failed assertion marks the job failed, even when the service uses x-slurm.failure_policy.mode: ignore.
services:
train:
image: trainer:latest
command: python train.py
assert:
exit_code: 0
artifacts_contain: "model/*.pt"
max_duration_seconds: 7200
| Field | Shape | Notes |
|---|---|---|
exit_code | integer 0..255 | Expected final service exit code. |
artifacts_contain | string | Glob that must match at least one path. Relative patterns resolve under /hpc-compose/job; absolute patterns must stay under /hpc-compose/job. |
max_duration_seconds | positive integer | Maximum wall-clock seconds from first service launch to terminal service exit, including restart time. |
At least one assertion field is required. Assertion results are written into runtime state.json; hpc-compose status --format json includes them under each service’s assertions object.
Service-level x-slurm
These fields live under services.<name>.x-slurm.
| Field | Shape | Default | Notes |
|---|---|---|---|
nodes | positive integer | omitted | Legacy shorthand: 1 for a helper step, or the full top-level allocation node count for a full-allocation distributed service. Partial multi-node counts require placement.node_count. |
placement | mapping | omitted | Explicit node-index placement inside the allocation. |
ntasks | positive integer | omitted | Adds --ntasks to that service’s srun. |
ntasks_per_node | positive integer | omitted | Adds --ntasks-per-node to that service’s srun. |
cpus_per_task | positive integer | omitted | Adds --cpus-per-task to that service’s srun. |
gpus | positive integer | omitted | Adds --gpus when gres is not set. |
gres | string | omitted | Adds --gres to that service’s srun. Takes priority over gpus. |
gpus_per_node | positive integer | omitted | Adds --gpus-per-node to that service’s srun. |
gpus_per_task | positive integer | omitted | Adds --gpus-per-task to that service’s srun. |
cpus_per_gpu | positive integer | omitted | Adds --cpus-per-gpu to that service’s srun. |
mem_per_gpu | string | omitted | Adds --mem-per-gpu to that service’s srun. |
gpu_bind | string | omitted | Adds --gpu-bind to that service’s srun. |
cpu_bind | string | omitted | Adds --cpu-bind to that service’s srun. |
mem_bind | string | omitted | Adds --mem-bind to that service’s srun. |
distribution | string | omitted | Adds --distribution to that service’s srun. |
hint | string | omitted | Adds --hint to that service’s srun. |
time_limit | string | omitted | Advisory per-service time limit. Validated against Slurm time formats but not passed to srun. inspect surfaces warnings when the limit exceeds allocation time or conflicts with dependencies. Accepted formats: MM, MM:SS, HH:MM:SS, D-HH, D-HH:MM, D-HH:MM:SS. |
extra_srun_args | list of strings | omitted | Appended directly to the service’s srun command. |
mpi | mapping | omitted | Adds first-class MPI launch metadata and srun --mpi=<type>. |
failure_policy | mapping | omitted | Per-service failure handling (fail_job, ignore, restart_on_failure). |
prologue | string or mapping | omitted | Per-service shell hook run before each launch attempt. String shorthand runs on the host. |
epilogue | string or mapping | omitted | Per-service shell hook run after each service exit attempt. String shorthand runs on the host. |
hooks | list of mappings | omitted | Host-side event hooks for failure-policy transitions such as accepted restarts and crash-loop window exhaustion. |
rendezvous | mapping | omitted | Provider registration config for cross-job service discovery. |
services.<name>.x-slurm.rendezvous
Provider-side registration writes an atomic shared-cache record after readiness succeeds when readiness is configured:
services:
model:
image: python:3.12-slim
command: python -m http.server 8000
readiness:
type: tcp
port: 8000
x-slurm:
rendezvous:
register:
name: model-server
port: 8000
protocol: http
path: /
ttl_seconds: 3600
Names are single safe path components using ASCII letters, digits, ., _, and -. Rendezvous is same-cluster shared-storage coordination only; it does not provide DNS, tunneling, or authentication.
services.<name>.x-slurm.prologue / epilogue
services:
trainer:
image: trainer:latest
command: python train.py
x-slurm:
prologue: |
module load cuda/12.1
nvidia-smi
epilogue:
context: container
script: |
tar czf /shared/logs-${SLURM_JOB_ID}.tar.gz /hpc-compose/job/logs
- Shape: either a block string, or a mapping with
scriptand optionalcontext. context:host(default) orcontainer.- Hook scripts are emitted as trusted shell and are not Compose-interpolated, so runtime variables such as
${SLURM_JOB_ID}are preserved. - Hooks run once per service launch attempt, including
restart_on_failureretries. - Host hooks run in the generated batch supervisor on the allocation’s primary execution context. Container hooks wrap the service command inside the container and can use
/hpc-compose/job. - Hook stdout/stderr is written to the service log.
- Container hooks require an explicit
commandorentrypoint; image-default services cannot be wrapped.
services.<name>.x-slurm.hooks
services:
trainer:
image: trainer:latest
command: python train.py
x-slurm:
failure_policy:
mode: restart_on_failure
hooks:
- on: restart
context: host
script: |
echo "Service $HPC_COMPOSE_SERVICE_NAME restarted (attempt $HPC_COMPOSE_ATTEMPT)" >> /shared/restart.log
- on: window_exhausted
script: |
curl -X POST "$WEBHOOK_URL" -d '{"alert": "crash loop detected"}'
- Shape: list of mappings with
on,script, and optionalcontext. on:restartorwindow_exhausted.context:hostonly. Omittedcontextdefaults tohost;containeris rejected for event hooks.restartruns after a non-zero exit has passed the lifetime and rolling-window guards, after restart counters are recorded, and before backoff/relaunch.window_exhaustedruns only when the rolling-window guard blocks another restart. It does not run for lifetimemax_restartsexhaustion.- Event hooks are best-effort observability hooks. A non-zero hook exit is logged to the service log and does not change the restart or failure-policy outcome.
- Event hook scripts are emitted as trusted shell and are not Compose-interpolated.
- Event hooks receive
HPC_COMPOSE_HOOK_PHASE,HPC_COMPOSE_SERVICE_NAME,HPC_COMPOSE_SERVICE_LOG,HPC_COMPOSE_SERVICE_EXIT_CODE,HPC_COMPOSE_ATTEMPT,HPC_COMPOSE_RESTART_COUNT,HPC_COMPOSE_MAX_RESTARTS,HPC_COMPOSE_WINDOW_SECONDS,HPC_COMPOSE_MAX_RESTARTS_IN_WINDOW, andHPC_COMPOSE_RESTART_FAILURES_IN_WINDOW.
services.<name>.x-slurm.placement
services:
a:
image: app:a
x-slurm:
placement: { node_range: "0-3" }
b:
image: app:b
x-slurm:
placement: { node_range: "4-7" }
ps:
image: app:b
x-slurm:
placement: { share_with: b }
Exactly one selector is required:
| Field | Shape | Notes |
|---|---|---|
node_range | string | Zero-based inclusive allocation indices, for example "0-3" or "0-3,6". |
node_count | integer | Selects this many eligible nodes starting at start_index, default 0. |
node_percent | integer 1..100 | Selects ceil(percent * eligible_nodes / 100), minimum one node. |
share_with | string | Reuses another service’s resolved node set for explicit co-location. |
Optional fields:
start_index: applies tonode_countandnode_percent.exclude: zero-based allocation indices removed from the eligible set and passed tosrun --exclude.allow_overlap: permits intentional overlap with another explicit placement.
Node indices are resolved against the Slurm allocation order from scontrol show hostnames "$SLURM_JOB_NODELIST". At runtime, containers receive both allocation-wide metadata (HPC_COMPOSE_NODELIST) and service-scoped metadata (HPC_COMPOSE_SERVICE_NODELIST, HPC_COMPOSE_SERVICE_NODELIST_FILE, HPC_COMPOSE_SERVICE_PRIMARY_NODE, HPC_COMPOSE_SERVICE_NODE_COUNT).
services.<name>.x-slurm.mpi
services:
trainer:
image: mpi-image:latest
command: /usr/local/bin/train
x-slurm:
nodes: 2
ntasks_per_node: 4
mpi:
type: pmix_v4
profile: openmpi
implementation: openmpi
launcher: srun
expected_ranks: 8
host_mpi:
bind_paths:
- /opt/site/openmpi:/opt/site/openmpi:ro
env:
MPI_DIR: /opt/site/openmpi
- Shape: mapping
- Default: omitted
typeis an exactsrun --mpi=<type>plugin token. Common values includepmix,pmix_v4,pmi2,pmi1, andopenmpi; usesrun --mpi=listorhpc-compose doctor cluster-reporton the target cluster to discover site-specific values.- Notes:
- Rendered as
--mpi=<type>on the service’ssruncommand. profileis optional compatibility metadata used for validation, cluster-profile diagnostics, anddoctor mpi-smokeoutput. Supported values areopenmpi,mpich, andintel_mpi.profiledoes not auto-select or rewritetype; use the exact token that your cluster reports throughsrun --mpi=list.launcherdefaults tosrun; v1 rejects other launchers.implementationis optional metadata for diagnostics. Supported values areopenmpi,mpich,intel_mpi,mvapich2,cray_mpi,hpe_mpi, andunknown.- When both
profileandimplementationare set, they must describe the same MPI family. expected_ranks, when set, must match the resolved Slurm task geometry.host_mpi.bind_pathsuseshost_path:container_path[:ro|rw]syntax, is validated like service volumes, and is automatically mounted into the service.host_mpi.envis injected into the service environment after normal service environment entries.- Cannot be combined with raw
--mpi...entries inextra_srun_args. - MPI services receive
HPC_COMPOSE_MPI_TYPEandHPC_COMPOSE_MPI_HOSTFILE. - MPI services also receive
HPC_COMPOSE_MPI_PROFILEwhenprofileis set andHPC_COMPOSE_MPI_IMPLEMENTATIONwhenimplementationis set or implied byprofile. hpc-compose doctor mpi-smoke -f compose.yaml --service trainerrenders a smoke probe for the service; add--submitto run it through Slurm.hpc-compose doctor fabric-smoke -f compose.yaml --service trainer --checks autoextends the same pattern with NCCL, UCX, OFI, and InfiniBand diagnostics when available. Smoke plans keep allocation and MPI launch settings, but strip application workflow blocks such as setup, scratch staging, resume metadata, artifacts, and burst-buffer directives.
- Rendered as
Profile-specific compatibility checks are intentionally conservative:
profile: openmpiexpects a PMIx-capabletypesuch aspmixorpmix_v*, withpmi2accepted as a fallback.profile: mpichexpectspmi2or a PMIx-capable setup.profile: intel_mpiexpectspmi2; preflight and doctor warn when noI_MPI_PMI_LIBRARYor cluster-profile PMI2 library is visible.
services.<name>.x-slurm.failure_policy
services:
worker:
image: python:3.11-slim
x-slurm:
failure_policy:
mode: restart_on_failure
max_restarts: 3
backoff_seconds: 5
window_seconds: 60
max_restarts_in_window: 3
| Field | Shape | Default | Notes |
|---|---|---|---|
mode | fail_job | ignore | restart_on_failure | fail_job | fail_job keeps fail-fast behavior. ignore keeps the job running after non-zero exits. restart_on_failure restarts on non-zero exits only. |
max_restarts | integer | 3 when mode=restart_on_failure | Required to be at least 1 after defaults are applied. Valid only for restart_on_failure. |
backoff_seconds | integer | 5 when mode=restart_on_failure | Fixed delay between restart attempts. Required to be at least 1 after defaults are applied. Valid only for restart_on_failure. |
window_seconds | integer | 60 when mode=restart_on_failure | Rolling window for counting restart-triggering exits. Required to be at least 1 after defaults are applied. Valid only for restart_on_failure. |
max_restarts_in_window | integer | resolved max_restarts when mode=restart_on_failure | Maximum restart-triggering exits allowed within window_seconds. Required to be at least 1 after defaults are applied. Valid only for restart_on_failure. |
Rules:
- In a multi-node allocation, implicit helper services are pinned to
HPC_COMPOSE_PRIMARY_NODE. - Explicit service placements may not overlap unless one side sets
placement.allow_overlap: trueor usesplacement.share_with. max_restarts,backoff_seconds,window_seconds, andmax_restarts_in_windoware rejected unlessmode: restart_on_failure.- Restart attempts count relaunches after the initial launch.
- Restarts trigger only for non-zero exits.
restart_on_failureenforces both a lifetime cap (max_restarts) and a rolling-window cap (max_restarts_in_windowwithinwindow_seconds) during one live batch-script execution.- If you omit the rolling-window fields,
restart_on_failurestill enables default crash-loop protection withwindow_seconds: 60andmax_restarts_in_window: <resolved max_restarts>. - Services configured with
mode: ignorecannot be used as dependencies independs_on.
Examples:
Use the defaults when you only need bounded retries:
services:
worker:
image: python:3.11-slim
x-slurm:
failure_policy:
mode: restart_on_failure
That resolves to:
max_restarts: 3backoff_seconds: 5window_seconds: 60max_restarts_in_window: 3
Use explicit fields when you need a larger lifetime budget but still want a tighter crash-loop guard:
services:
worker:
image: python:3.11-slim
x-slurm:
failure_policy:
mode: restart_on_failure
max_restarts: 8
backoff_seconds: 10
window_seconds: 60
max_restarts_in_window: 3
Semantics:
- The initial launch does not count as a restart.
restart_countcounts granted relaunches after the initial launch.max_restarts_in_windowcounts restart-triggering non-zero exits whose timestamps still satisfynow - event < window_seconds.- If a non-zero exit would exceed the rolling-window cap, the job fails immediately and that blocked exit is not recorded as a consumed restart.
- Successful exits do not trigger restarts and do not add entries to the rolling window.
- The rolling window is attempt-local to one live batch-script execution. It is not hydrated from prior
state.json, resume metadata, or Slurm requeue history. x-slurm.hookscan observe acceptedrestartevents and blockedwindow_exhaustedevents without changing the policy decision.
Tracked state:
status --format jsonincludesfailure_policy_mode,restart_count,max_restarts,window_seconds,max_restarts_in_window,restart_failures_in_window, andlast_exit_codefor each tracked service.- Text
statusrenders the live rolling-window budget aswindow=<current>/<max>@<seconds>s.
Unknown keys under top-level x-slurm or per-service x-slurm cause hard errors.
x-runtime.prepare and x-enroot.prepare
x-runtime.prepare lets a service build a prepared runtime image from its base image before submission. x-enroot.prepare remains accepted as a Pyxis-only compatibility spelling.
services:
app:
image: python:3.11-slim
x-runtime:
prepare:
commands:
- pip install --no-cache-dir numpy pandas
mounts:
- ./requirements.txt:/tmp/requirements.txt
env:
PIP_CACHE_DIR: /tmp/pip-cache
root: true
| Field | Shape | Default | Notes |
|---|---|---|---|
commands | list of strings | required when prepare is present | Each command runs through the selected backend’s writable prepare flow. |
mounts | list of host_path:container_path strings | omitted | Visible only during prepare. Relative host paths resolve against the compose file directory. |
env | mapping or list of KEY=VALUE strings | omitted | Passed only during prepare. Values support the same interpolation rules as environment. |
root | boolean | true | Controls whether prepare commands request root/fakeroot behavior where the backend supports it. |
Rules:
- If
x-runtime.prepareorx-enroot.prepareis present,commandscannot be empty. - A service may not set both spellings.
x-enroot.prepareis rejected whenruntime.backendis notpyxis.- If
prepare.mountsis non-empty, the service rebuilds on everyprepareorup. - Remote base images are imported under
cache_dir/base. - Prepared images are exported under
cache_dir/prepared. - Unknown keys under
x-runtime,x-enroot, orpreparecause hard errors.
Unsupported Compose keys
These keys are rejected with explicit messages:
buildportsnetworksnetwork_mode- Compose
restart(useservices.<name>.x-slurm.failure_policy) deploy
Any other unknown key at the service level is also rejected.