Examples
These examples are the fastest way to understand the intended hpc-compose workflows and adapt them to a real application.
There are two starting points:
- built-in starter templates generated by
hpc-compose new - repository example files copied directly from
examples/
Before launching anything, run the safe authoring path first:
hpc-compose new --template minimal-batch --name my-app --output compose.yaml
hpc-compose plan -f compose.yaml
hpc-compose plan --show-script -f compose.yaml
If you are reading from a source checkout, you can run the same static checks directly against examples/minimal-batch.yaml.
Some repository examples keep an explicit ${CACHE_DIR:-/cluster/shared/hpc-compose-cache} for portability, while starter examples rely on the settings/builtin cache default. Before running on a real cluster, configure a shared path visible from both the submission host and the compute nodes:
export CACHE_DIR=/cluster/shared/hpc-compose-cache
mkdir -p "$CACHE_DIR"
test -w "$CACHE_DIR"
Start Here: The Four Promoted Examples
These four examples are the intended conversion funnel.
minimal-batch.yaml
- Demonstrates: one service, no dependencies, no image prepare step
- Expected prerequisites: any machine for
plan; a Linux Slurm login node plus the selected runtime backend forup - Cluster run, Linux Slurm login node only:
hpc-compose up -f examples/minimal-batch.yaml - Success signal: the batch log prints
Hello from Slurm!
app-redis-worker.yaml
- Demonstrates: multi-service startup ordering plus TCP readiness inside one allocation
- Expected prerequisites: a normal Slurm + Enroot submission host and shared
CACHE_DIR - Cluster run, Linux Slurm login node only:
hpc-compose up -f examples/app-redis-worker.yaml - Success signal:
worker.logshows a successful RedisPINGfollowed by repeatedINCR jobscalls
llm-curl-workflow-workdir.yaml
- Demonstrates: one GPU-backed LLM service plus one client service in the same job
- Expected prerequisites: a GGUF model at
$HOME/models/model.gguf, a GPU-capable Slurm target, and sharedCACHE_DIR - Cluster run, Linux Slurm login node only:
hpc-compose up -f examples/llm-curl-workflow-workdir.yaml - Success signal:
curl_client.logcontains a JSON response from/v1/chat/completions
training-resume.yaml
- Demonstrates: checkpoint export, resume-aware reruns, and attempt-aware training state
- Expected prerequisites: shared storage for
x-slurm.resume.pathplus sharedCACHE_DIR - Cluster run, Linux Slurm login node only:
hpc-compose up -f examples/training-resume.yaml - Success signal:
results/<job-id>/contains exported checkpoints and later attempts resume from the previously saved epoch
Beginner Ladder
Use this ordering when you are new to the project:
For a guided version of the first five concepts, run hpc-compose evolve --output compose.yaml. The progressive-complexity lesson walks through minimal, second-service, readiness, failure-policy, and multi-node-placement as one evolving valid spec.
| Stage | Start here | Why |
|---|---|---|
| Authoring only | minimal-batch.yaml with plan and plan --show-script | Confirms the tool understands a spec without touching Slurm. |
| First cluster run | minimal-batch.yaml on a Linux Slurm login node | Smallest real submission and log-check path. |
| Single-node multi-service | app-redis-worker.yaml | Shows depends_on plus TCP readiness. |
| GPU or LLM serving | llm-curl-workflow-workdir.yaml, llama-app.yaml, or vllm-openai.yaml | Adds accelerator resources and service/client coordination. |
| Durable training | training-checkpoints.yaml or training-resume.yaml | Adds artifacts, checkpoints, and resume semantics. |
| Distributed launch | multi-node-mpi.yaml, multi-node-torchrun.yaml, or framework-specific examples below | Adds allocation-wide or explicitly placed multi-node services. |
Built-In Starter Templates
Use built-in templates when you want hpc-compose to write a fresh compose.yaml with your application name filled in for you.
hpc-compose new --list-templates
hpc-compose new --describe-template minimal-batch
hpc-compose new --template minimal-batch --name my-app --output compose.yaml
hpc-compose new --template minimal-batch --name my-app --cache-dir '<shared-cache-dir>' --output compose.yaml
If the workflow you want is not listed by --list-templates, copy the closest repository example directly from examples/.
Broader Example Matrix
The matrix below covers the broader set of runnable examples beyond the four promoted starts. “Built-in template” means hpc-compose new --template <name> can scaffold it; “repository file” means copy the YAML from examples/ directly.
| Example | Availability | What it demonstrates | When to start from it |
|---|---|---|---|
dev-python-app.yaml | Built-in template | Mounted source code plus x-runtime.prepare.commands for dependencies | You want an iterative development workflow |
dev-python-smoke.yaml | Repository file | Finite smoke-test variant of the source-mounted Python app | You want to test a development spec without a long-running process |
llm-curl-workflow.yaml | Built-in template | Repo-local variant of the smallest concrete inference workflow | You want the same LLM stack but with models under the repository tree |
llama-app.yaml | Built-in template | GPU-backed service, mounted model files, dependent app service | You need accelerator resources or a model-serving pattern |
llama-uv-worker.yaml | Built-in template | llama.cpp serving plus a source-mounted Python worker executed through uv | You want the GGUF server plus mounted worker pattern |
multi-node-mpi.yaml | Built-in template | First-class MPI launch, generated MPI hostfile, and one primary-node helper | You want a minimal multi-node MPI pattern without extra orchestration |
mpi-pmix-v4-host-mpi.yaml | Built-in template | Versioned PMIx launch plus host MPI bind/env configuration | Your site requires a host MPI stack inside containers |
multi-node-partitioned.yaml | Repository file | Disjoint node ranges, fractional node selection, and explicit co-location | You want multiple distributed roles inside one allocation |
multi-node-torchrun.yaml | Built-in template | Allocation-wide torchrun launch using the primary node as rendezvous | You want a multi-node GPU training starting point |
multi-node-deepspeed.yaml | Built-in template | DeepSpeed no-SSH launch using generated rendezvous and hostfile env | You want distributed fine-tuning without hand-written rendezvous setup |
multi-node-accelerate.yaml | Built-in template | Hugging Face Accelerate multi-machine launch | You want an Accelerate-based training or fine-tuning starting point |
multi-node-horovod.yaml | Built-in template | Horovod rank-per-GPU launch through Slurm MPI | You want Horovod without SSH fanout |
multi-node-jax.yaml | Built-in template | JAX coordinator/process metadata for jax.distributed.initialize | You want a JAX distributed starting point |
nccl-tests.yaml | Built-in template | MPI-backed NCCL all-reduce probe | You are debugging GPU fabric, NCCL, UCX, or OFI settings |
ray-symmetric.yaml | Built-in template | Ray symmetric-run across one Slurm allocation | You want a modern Ray-on-Slurm starting point without an autoscaler |
ray-head-workers.yaml | Built-in template | Ray head plus worker steps inside one allocation | You need explicit Ray head/worker control for an older or site-specific setup |
dask-scheduler-workers.yaml | Built-in template | Dask scheduler on the primary node plus allocation workers | You want Dask CLI deployment inside one Slurm allocation |
spark-standalone.yaml | Built-in template | Spark standalone master, workers, and app submission | You need a conservative Spark standalone pattern without external cluster management |
flux-nested.yaml | Built-in template | Nested Flux launched through srun flux start | You want Flux scheduling inside an existing Slurm allocation |
nextflow-bridge.yaml | Built-in template | Nextflow command wrapper inside one hpc-compose allocation | You want hpc-compose tracking around a workflow engine run without parsing Nextflow files |
snakemake-bridge.yaml | Built-in template | Snakemake command wrapper inside one hpc-compose allocation | You want hpc-compose tracking around a Snakemake run without replacing Snakemake scheduling semantics |
postgres-etl.yaml | Built-in template | PostgreSQL plus a Python data processing job | You need a database-backed batch pipeline |
restart-policy.yaml | Built-in template | Per-service restart_on_failure with bounded retries and a rolling-window crash-loop guard | You need transient-failure retries without letting one service spin forever |
training-checkpoints.yaml | Built-in template | GPU training with checkpoints exported to shared storage | You need durable checkpoint outputs but not automatic resume semantics |
training-sweep.yaml | Repository file | Embedded sweep parameters with interpolation defaults for dry-run and normal render workflows | You want a small hyperparameter sweep starting point |
vllm-openai.yaml | Built-in template | vLLM serving with an in-job Python client | You want vLLM-based inference instead of llama.cpp |
vllm-uv-worker.yaml | Built-in template | vLLM serving plus a source-mounted Python worker executed through uv | You want a common LLM stack with mounted app code |
mpi-hello.yaml | Built-in template | MPI hello world using service-level x-slurm.mpi | You need a small first-class MPI workload |
multi-stage-pipeline.yaml | Built-in template | Two-stage pipeline coordinating through the shared job mount | You need file-based stage-to-stage handoff |
pipeline-dag.yaml | Built-in template | One-shot preprocess -> train -> postprocess DAG using successful-completion dependencies | You need stage completion, not service readiness, to gate downstream work |
fairseq-preprocess.yaml | Built-in template | CPU-heavy NLP data preprocessing with parallel workers | You need a CPU-bound data preprocessing pipeline |
canary-right-size.yaml | Repository file | A deliberately over-requested training probe for hpc-compose germinate | You want to practice right-sizing recommendations before changing a real spec |
rendezvous-model-server.yaml | Repository file | A provider job that registers a model-server endpoint in the shared cache | You want one Slurm allocation to publish a service for later jobs |
rendezvous-client.yaml | Repository file | A separate client job that resolves HPC_COMPOSE_RDZV_MODEL_SERVER_URL | You want cross-job service discovery through shared storage |
Which Example Should I Start From?
- Start with
minimal-batch.yamlif you are new tohpc-composeand want the smallest possible file. - Start with
app-redis-worker.yamlif your workload depends on multi-service startup ordering. - Start with
llm-curl-workflow-workdir.yamlif you want the smallest real-cluster inference workflow. - Start with
training-resume.yamlif you need resume-aware checkpoints on shared storage. - Start with
multi-node-mpi.yamlif you need one distributed step plus helper services on the primary node. - Start with
multi-node-partitioned.yamlif services need explicit node ranges orshare_withco-location. - Start with
multi-node-torchrun.yaml,multi-node-deepspeed.yaml,multi-node-accelerate.yaml, ormulti-node-jax.yamlif you need a launcher-style rendezvous pattern across multiple nodes. - Start with
nccl-tests.yamlwhen you need to debug NCCL/IB fabric before running a real training job. - Start with
ray-symmetric.yaml,dask-scheduler-workers.yaml,spark-standalone.yaml, orflux-nested.yamlif your distributed framework already fits inside one Slurm allocation. - Start with
nextflow-bridge.yamlorsnakemake-bridge.yamlwhen you want hpc-compose submission, tracking, logs, and artifacts around a workflow-engine command. These bridge templates do not parse workflow files and do not replace the engines’ native cluster executors. - Start with
dev-python-app.yamlif you want a source-mounted development loop. - Start with
dev-python-smoke.yamlif you want a finite smoke-test companion for the source-mounted development example. - Start with
training-sweep.yamlwhen you want many independent trial allocations from one embeddedsweepblock. - Start with
restart-policy.yamlif you need a clear starting point forrestart_on_failuretuning andstatus-visible retry budgets. - Start with
canary-right-size.yamlwhen your first question is whether a large GPU or memory request is justified. - Start with
rendezvous-model-server.yamlplusrendezvous-client.yamlwhen the provider and client should run as separate Slurm jobs.
Companion notes for the more involved examples live alongside the example assets:
examples/llm-curl/README.mdexamples/llama-uv-worker/README.mdexamples/vllm-uv-worker/README.mdexamples/models/README.md
Development Workflow Recipe
examples/dev-python-app.yaml mounts examples/app/ and runs a long-lived Python process, so it is best for hot reload:
hpc-compose dev -f examples/dev-python-app.yaml
hpc-compose tmux -f examples/dev-python-app.yaml --no-attach
examples/dev-python-smoke.yaml keeps the same mounted-source shape but uses a finite command, so it is suitable for smoke tests:
hpc-compose test --local -f examples/dev-python-smoke.yaml
hpc-compose test --submit --time 00:01:00 -f examples/dev-python-smoke.yaml
Adaptation Checklist
- Copy the closest repository example to your own
compose.yaml, or runhpc-compose new --template <name> --name my-app --output compose.yamlwhen a matching built-in template exists. - Configure a cache path visible from both the login node and compute nodes through
hpc-compose setup --cache-dir,x-slurm.cache_dir, or[defaults.cache]/[profiles.<name>.cache]. - Override
CACHE_DIRbefore running repository examples that use${CACHE_DIR:-...}, or replace the default cache path in your copied file. - Replace the example
image,command,environment, andvolumeswith your workload. - Keep active source in
volumesand keep slower-changing dependency installation inx-runtime.prepare.commands. - Add
readinessto services that must be reachable before dependents continue. - Adjust top-level or per-service
x-slurmsettings for your cluster. - Run
hpc-compose plan -f compose.yamlbefore the first run, andhpc-compose debug -f compose.yaml --preflightif that run fails. - Run cluster
uponly from a supported Linux Slurm submission host with the selected runtime backend available.