Examples
These examples are the fastest way to understand the intended hpc-compose workflows and adapt them to a real application.
For almost every example, the normal run is:
hpc-compose submit --watch -f examples/<example>.yaml
Use the debugging flow (validate, inspect, preflight, prepare) when you are wiring up the example for the first time or isolating a failure.
If you want one of these files written straight to your working directory, use:
hpc-compose init --template dev-python-app --name my-app --cache-dir /shared/$USER/hpc-compose-cache --output compose.yaml
Example matrix
| Example | What it demonstrates | When to start from it |
|---|---|---|
app-redis-worker.yaml | Multiple services, depends_on, and TCP readiness checks | You need service startup ordering or a small multi-service stack |
dev-python-app.yaml | Mounted source code plus x-enroot.prepare.commands for dependencies | You want an iterative development workflow |
llm-curl-workflow.yaml | End-to-end LLM request flow with a login-node prepare step and a curl client | You want the smallest concrete inference workflow |
llm-curl-workflow-workdir.yaml | Same LLM workflow, but anchored under $HOME/models for direct use on a login node | You want the lowest-overhead path from a login-node home directory |
llama-app.yaml | GPU-backed service, mounted model files, dependent app service | You need accelerator resources or a model-serving pattern |
llama-uv-worker.yaml | llama.cpp serving plus a source-mounted Python worker executed through uv | You want the GGUF server + mounted worker pattern |
minimal-batch.yaml | Single service, no dependencies, no GPU, no prepare | You want the simplest possible starting point |
multi-node-mpi.yaml | One primary-node helper plus one allocation-wide distributed CPU step | You want a minimal multi-node pattern without adding orchestration |
multi-node-torchrun.yaml | Allocation-wide torchrun launch using the primary node as rendezvous | You want a multi-node GPU training starting point |
training-checkpoints.yaml | GPU training with checkpoints written to shared storage | You need a batch training workflow with artifact collection |
training-resume.yaml | GPU training with a shared resume directory and attempt-aware checkpoints | You need restart-safe checkpoint semantics across requeues or repeated submissions |
postgres-etl.yaml | PostgreSQL plus a Python data processing job | You need a database-backed batch pipeline |
vllm-openai.yaml | vLLM serving with an in-job Python client | You want vLLM-based inference instead of llama.cpp |
vllm-uv-worker.yaml | vLLM serving plus a source-mounted Python worker executed through uv | You want a common LLM stack with mounted app code |
mpi-hello.yaml | MPI hello world compiled and run with Open MPI | You need an MPI workload |
multi-stage-pipeline.yaml | Two-stage pipeline coordinating through the shared job mount | You need file-based stage-to-stage handoff |
fairseq-preprocess.yaml | CPU-heavy NLP data preprocessing with parallel workers | You need a CPU-bound data preprocessing pipeline |
Which example should I start from?
- Start with
minimal-batch.yamlif you are new tohpc-composeand want the smallest possible file. - Start with
multi-node-mpi.yamlif you need one distributed step plus small helper services on the primary node. - Start with
multi-node-torchrun.yamlif you need a torchrun-style rendezvous pattern across multiple nodes. - Start with
dev-python-app.yamlif you want a source-mounted development loop. - Start with
llm-curl-workflow-workdir.yamlif you want the fastest real-cluster GPU inference example. - Start with
training-checkpoints.yamlif you need a GPU training job with checkpoint output. - Start with
training-resume.yamlif you need resume-aware checkpoints on shared storage. - Start with
app-redis-worker.yamlorpostgres-etl.yamlif your workload depends on multi-service startup ordering.
Companion notes for the more involved examples live alongside the example assets:
examples/llm-curl/README.mdexamples/llama-uv-worker/README.mdexamples/vllm-uv-worker/README.mdexamples/models/README.md
Adaptation checklist
- Copy the closest example to your own
compose.yaml, or runhpc-compose init --template <name> --name my-app --cache-dir /shared/$USER/hpc-compose-cache --output compose.yaml. - Set
x-slurm.cache_dirto a path visible from both the login node and the compute nodes. - Replace the example
image,command,environment, andvolumeswith your workload. - Keep active source in
volumesand keep slower-changing dependency installation inx-enroot.prepare.commands. - Add
readinessto services that must be reachable before dependents continue. - Adjust top-level or per-service
x-slurmsettings for your cluster. - Run the debugging flow before the first submit when you need to confirm planning, prerequisites, or cache behavior.