Examples

These examples are the fastest way to understand the intended hpc-compose workflows and adapt them to a real application.

There are two starting points:

built-in starter templates generated by hpc-compose new
repository example files copied directly from examples/

Use the CLI recommendation flow when you want a ranked starting point, or the coverage map when you want to inspect every shipped example by workflow or tag:

hpc-compose examples recommend
hpc-compose examples recommend 'vllm worker'
hpc-compose examples recommend 'multi-node training' --tag gpu
hpc-compose examples list --tag mpi
hpc-compose examples search 'vllm worker'

hpc-compose examples recommend is static and authoring-only: it uses the checked-in example registry, tags, and prerequisite notes; it does not inspect the cluster, contact Slurm, or submit jobs. Each result explains why it matched and prints safe next commands such as hpc-compose new, cp, plan, and plan --show-script.

Before launching anything, run the safe authoring path first:

hpc-compose new --template minimal-batch --name my-app --output compose.yaml
hpc-compose plan -f compose.yaml
hpc-compose plan --show-script -f compose.yaml

If you are reading from a source checkout, you can run the same static checks directly against examples/minimal-batch.yaml.

Some repository examples keep an explicit ${CACHE_DIR:-/cluster/shared/hpc-compose-cache} for portability, while starter examples rely on the settings/builtin cache default. Before running on a real cluster, configure a shared path visible from both the submission host and the compute nodes:

export CACHE_DIR=/cluster/shared/hpc-compose-cache
mkdir -p "$CACHE_DIR"
test -w "$CACHE_DIR"

Start Here: The Four Promoted Examples

These four examples are the intended conversion funnel.

`minimal-batch.yaml`

Demonstrates: one service, no dependencies, no image prepare step
Expected prerequisites: any machine for plan; a Linux Slurm login node plus the selected runtime backend for up
Cluster run, Linux Slurm login node only: hpc-compose up -f examples/minimal-batch.yaml
Success signal: the batch log prints Hello from Slurm!

`app-redis-worker.yaml`

Demonstrates: multi-service startup ordering plus TCP readiness inside one allocation
Expected prerequisites: a normal Slurm + Enroot submission host and shared CACHE_DIR
Cluster run, Linux Slurm login node only: hpc-compose up -f examples/app-redis-worker.yaml
Success signal: worker.log shows a successful Redis PING followed by repeated INCR jobs calls

`llm-curl-workflow-workdir.yaml`

Demonstrates: one GPU-backed LLM service plus one client service in the same job
Expected prerequisites: a GGUF model at $HOME/models/model.gguf, a GPU-capable Slurm target, and shared CACHE_DIR
Cluster run, Linux Slurm login node only: hpc-compose up -f examples/llm-curl-workflow-workdir.yaml
Success signal: curl_client.log contains a JSON response from /v1/chat/completions

`training-resume.yaml`

Demonstrates: checkpoint export, resume-aware reruns, and attempt-aware training state
Expected prerequisites: shared storage for x-slurm.resume.path plus shared CACHE_DIR
Cluster run, Linux Slurm login node only: hpc-compose up -f examples/training-resume.yaml
Success signal: results/<job-id>/ contains exported checkpoints and later attempts resume from the previously saved epoch

Beginner Ladder

Use this ordering when you are new to the project:

For a guided version of the first five concepts, run hpc-compose evolve --output compose.yaml. The progressive-complexity lesson walks through minimal, second-service, readiness, failure-policy, and multi-node-placement as one evolving valid spec.

Stage	Start here	Why
Authoring only	`minimal-batch.yaml` with `plan` and `plan --show-script`	Confirms the tool understands a spec without touching Slurm.
First cluster run	`minimal-batch.yaml` on a Linux Slurm login node	Smallest real submission and log-check path.
Single-node multi-service	`app-redis-worker.yaml`	Shows `depends_on` plus TCP readiness.
GPU or LLM serving	`llm-curl-workflow-workdir.yaml`, `llama-app.yaml`, or `vllm-openai.yaml`	Adds accelerator resources and service/client coordination.
Durable training	`training-checkpoints.yaml` or `training-resume.yaml`	Adds artifacts, checkpoints, and resume semantics.
Distributed launch	`multi-node-mpi.yaml`, `multi-node-torchrun.yaml`, or framework-specific examples below	Adds allocation-wide or explicitly placed multi-node services.

Built-In Starter Templates

Use built-in templates when you want hpc-compose to write a fresh compose.yaml with your application name filled in for you.

hpc-compose new --list-templates
hpc-compose new --describe-template minimal-batch
hpc-compose new --template minimal-batch --name my-app --output compose.yaml
hpc-compose new --template minimal-batch --name my-app --cache-dir '<shared-cache-dir>' --output compose.yaml

If the workflow you want is not listed by --list-templates, copy the closest repository example directly from examples/.

Broader Example Matrix

The matrix below covers the broader set of runnable examples beyond the four promoted starts. “Built-in template” means hpc-compose new --template <name> can scaffold it; “repository file” means copy the YAML from examples/ directly. Generate the same coverage map from the CLI with hpc-compose examples coverage --format markdown.

Example	Availability	Tags	What it demonstrates	When to start from it
`minimal-batch.yaml`	Built-in template	`beginner`, `batch`, `single-service`	Smallest single-service batch job.	You are new to hpc-compose and want the smallest possible file.
`dev-python-app.yaml`	Built-in template	`dev`, `python`, `prepare`, `hot-reload`	Mounted source code plus x-runtime.prepare.commands for dependencies.	You want an iterative source-mounted development workflow.
`dev-python-smoke.yaml`	Repository file	`test`, `python`, `dev`, `finite`	Finite test variant of the source-mounted Python app.	You want to test a development spec without a long-running process.
`cuda-probe.yaml`	Repository file	`gpu`, `cuda`, `probe`, `nvidia-smi`, `diagnostics`	Lightweight compute-node GPU/CUDA probe: hostname, nvidia-smi, and device files.	You want a fast nvidia-smi check that GPU allocation works before any real training run.
`jupyter.yaml`	Built-in template	`notebook`, `jupyter`, `gpu`, `interactive`	Tracked JupyterLab notebook server with log readiness on a GPU allocation.	You want an interactive notebook on a compute node; pair with `hpc-compose notebook`.
`app-redis-worker.yaml`	Built-in template	`multi-service`, `readiness`, `redis`, `tcp`	Multiple services with startup ordering and TCP readiness.	Your workload depends on multi-service startup ordering.
`restart-policy.yaml`	Built-in template	`failure-policy`, `restart`, `resilience`	Bounded restart_on_failure with rolling-window crash-loop guards.	You need transient-failure retries without letting a service spin forever.
`llm-curl-workflow.yaml`	Built-in template	`llm`, `curl`, `inference`, `readiness`	Repo-local LLM service with a dependent curl client.	You want the smallest concrete inference workflow under the repository tree.
`llm-curl-workflow-workdir.yaml`	Built-in template	`llm`, `curl`, `inference`, `workdir`	Home-directory LLM workflow for direct login-node use.	You want the smallest real-cluster inference workflow.
`llama-app.yaml`	Built-in template	`llm`, `gpu`, `model-serving`, `readiness`	GPU-backed service, mounted model files, and dependent app service.	You need accelerator resources or a model-serving pattern.
`llama-uv-worker.yaml`	Built-in template	`llm`, `uv`, `worker`, `python`, `llama`	llama.cpp serving plus a source-mounted Python worker run through uv.	You want the GGUF server plus mounted worker pattern.
`hf-stage-model.yaml`	Repository file	`llm`, `gpu`, `model-serving`, `huggingface`, `stage-in`	Cluster-side hf:// stage_in of a pinned HuggingFace model into a GPU service.	You want hpc-compose to download a pinned model inside the allocation, not on your laptop.
`vllm-openai.yaml`	Built-in template	`llm`, `vllm`, `openai`, `gpu`	vLLM serving with an in-job Python client.	You want vLLM-based inference instead of llama.cpp.
`vllm-uv-worker.yaml`	Built-in template	`llm`, `vllm`, `uv`, `worker`, `python`	vLLM serving plus a source-mounted Python worker run through uv.	You want a common LLM stack with mounted app code.
`eval-harness.yaml`	Built-in template	`llm`, `vllm`, `eval`, `lm-eval-harness`, `openai`, `artifacts`, `sweep`, `gpu`	vLLM OpenAI server with HTTP /health readiness plus an lm-eval-harness client and a results.json artifact, including a model/tasks sweep stub.	You want to benchmark a served model with lm-eval-harness against a loopback OpenAI endpoint.
`training-checkpoints.yaml`	Built-in template	`training`, `gpu`, `checkpoints`, `artifacts`	GPU training with checkpoints exported to shared storage.	You need durable checkpoint outputs but not automatic resume semantics.
`training-resume.yaml`	Built-in template	`training`, `gpu`, `resume`, `checkpoints`	GPU training with a shared resume directory and attempt-aware checkpoints.	The run should resume from shared storage across retries or later submissions.
`training-sweep.yaml`	Repository file	`training`, `sweep`, `hyperparameters`	Embedded sweep parameters with interpolation defaults.	You want many independent trial allocations from one sweep block.
`training-tensorboard.yaml`	Repository file	`training`, `gpu`, `tensorboard`, `sidecar`, `http-readiness`, `artifacts`	GPU training writing TensorBoard events to a shared logdir with an HTTP-readiness TensorBoard sidecar.	You want a training run with a live TensorBoard sidecar and exported event-file artifacts.
`fairseq-preprocess.yaml`	Built-in template	`training`, `nlp`, `cpu`, `preprocess`	CPU-heavy NLP data preprocessing with parallel workers.	You need a CPU-bound data preprocessing pipeline.
`canary-right-size.yaml`	Repository file	`training`, `canary`, `rightsize`, `metrics`	Deliberately over-requested training probe for germinate.	Your first question is whether a large GPU or memory request is justified.
`mpi-hello.yaml`	Built-in template	`distributed`, `mpi`, `hello`	MPI hello world using service-level x-slurm.mpi.	You need a small first-class MPI workload.
`mpi-pmix-v4-host-mpi.yaml`	Built-in template	`distributed`, `mpi`, `pmix`, `host-mpi`	Versioned PMIx launch plus host MPI bind/env configuration.	Your site requires a host MPI stack inside containers.
`multi-node-mpi.yaml`	Built-in template	`distributed`, `mpi`, `multi-node`	Primary-node helper plus one allocation-wide distributed MPI step.	You want a minimal multi-node MPI pattern without extra orchestration.
`multi-node-partitioned.yaml`	Repository file	`distributed`, `multi-node`, `placement`, `partitioned`	Disjoint node ranges, fractional selection, and explicit co-location.	Multiple distributed roles need explicit node ranges or share_with co-location.
`multi-node-torchrun.yaml`	Built-in template	`distributed`, `torchrun`, `gpu`, `training`	Allocation-wide torchrun launch using the primary node as rendezvous.	You want a multi-node GPU training starting point.
`multi-node-deepspeed.yaml`	Built-in template	`distributed`, `deepspeed`, `gpu`, `training`	DeepSpeed no-SSH multi-node training with generated rendezvous env.	You want distributed fine-tuning without hand-written rendezvous setup.
`multi-node-accelerate.yaml`	Built-in template	`distributed`, `accelerate`, `hugging-face`, `training`	Hugging Face Accelerate multi-machine launch.	You want an Accelerate-based training or fine-tuning starting point.
`multi-node-horovod.yaml`	Built-in template	`distributed`, `horovod`, `mpi`, `gpu`	Horovod rank-per-GPU launch through Slurm MPI.	You want Horovod without SSH fanout.
`multi-node-jax.yaml`	Built-in template	`distributed`, `jax`, `gpu`, `training`	JAX distributed training with generated coordinator env.	You want a JAX distributed starting point.
`nccl-tests.yaml`	Built-in template	`distributed`, `nccl`, `mpi`, `gpu`, `fabric`	MPI-backed NCCL all-reduce test job for GPU fabric debugging.	You need to debug NCCL, InfiniBand, UCX, or OFI before real training.
`ray-symmetric.yaml`	Built-in template	`distributed`, `ray`, `symmetric`	Ray symmetric-run across one Slurm allocation.	You want a modern Ray-on-Slurm starting point without an autoscaler.
`ray-head-workers.yaml`	Built-in template	`distributed`, `ray`, `workers`	Ray head plus workers inside one Slurm allocation.	You need explicit Ray head/worker control for an older or site-specific setup.
`dask-scheduler-workers.yaml`	Built-in template	`distributed`, `dask`, `workers`	Dask scheduler on the primary node plus allocation workers.	You want Dask CLI deployment inside one Slurm allocation.
`spark-standalone.yaml`	Built-in template	`distributed`, `spark`, `workers`	Spark standalone master, workers, and app submission inside one allocation.	You need a conservative Spark standalone pattern without external cluster management.
`flux-nested.yaml`	Built-in template	`distributed`, `flux`, `nested`	Nested Flux instance launched inside a Slurm allocation.	You want Flux scheduling inside an existing Slurm allocation.
`postgres-etl.yaml`	Built-in template	`workflow`, `postgres`, `etl`, `python`	PostgreSQL plus a Python data processing job.	You need a database-backed batch pipeline.
`nextflow-bridge.yaml`	Built-in template	`workflow`, `nextflow`, `bridge`	Nextflow command wrapper inside one hpc-compose allocation.	You want hpc-compose tracking around a workflow-engine run.
`snakemake-bridge.yaml`	Built-in template	`workflow`, `snakemake`, `bridge`	Snakemake command wrapper inside one hpc-compose allocation.	You want hpc-compose tracking around a Snakemake run.
`multi-stage-pipeline.yaml`	Built-in template	`workflow`, `pipeline`, `artifacts`	Two-stage data pipeline coordinating through the shared job mount.	You need file-based stage-to-stage handoff.
`pipeline-dag.yaml`	Built-in template	`workflow`, `dag`, `pipeline`, `depends-on`	One-shot preprocess -> train -> postprocess DAG with completion dependencies.	You need stage completion, not service readiness, to gate downstream work.
`rendezvous-model-server.yaml`	Repository file	`workflow`, `rendezvous`, `model-serving`	Provider job that registers a model-server endpoint in the shared cache.	One Slurm allocation should publish a service for later jobs.
`rendezvous-client.yaml`	Repository file	`workflow`, `rendezvous`, `client`	Separate client job resolving HPC_COMPOSE_RDZV_MODEL_SERVER_URL.	A later job should discover a provider through shared storage.

Which Example Should I Start From?

Run hpc-compose examples recommend with no query for the default beginner path, or pass a short workflow description when you already know the shape you need:

hpc-compose examples recommend
hpc-compose examples recommend 'checkpoint resume training'
hpc-compose examples recommend 'workflow engine bridge'
hpc-compose examples recommend 'separate rendezvous jobs' --format json

The recommendation output is the maintained chooser. It reuses the same registry metadata that feeds the coverage table below, so new examples, tags, and prerequisite notes only need to be updated in one place.

Companion notes for the more involved examples live alongside the example assets:

Development Workflow Recipe

examples/dev-python-app.yaml mounts examples/app/ and runs a long-lived Python process, so it is best for hot reload:

hpc-compose dev -f examples/dev-python-app.yaml
hpc-compose tmux -f examples/dev-python-app.yaml --no-attach

examples/dev-python-smoke.yaml keeps the same mounted-source shape but uses a finite command, so it is suitable for smoke tests:

hpc-compose test --local -f examples/dev-python-smoke.yaml
hpc-compose test --submit --time 00:01:00 -f examples/dev-python-smoke.yaml

Adaptation Checklist

Copy the closest repository example to your own compose.yaml, or run hpc-compose new --template <name> --name my-app --output compose.yaml when a matching built-in template exists.
Configure a cache path visible from both the login node and compute nodes through hpc-compose setup --cache-dir, x-slurm.cache_dir, or [defaults.cache] / [profiles.<name>.cache].
Override CACHE_DIR before running repository examples that use ${CACHE_DIR:-...}, or replace the default cache path in your copied file.
Replace the example image, command, environment, and volumes with your workload.
Keep active source in volumes and keep slower-changing dependency installation in x-runtime.prepare.commands.
Add readiness to services that must be reachable before dependents continue.
Adjust top-level or per-service x-slurm settings for your cluster.
Run hpc-compose plan -f compose.yaml before the first run, and hpc-compose debug -f compose.yaml --preflight if that run fails.
Run cluster up only from a supported Linux Slurm submission host with the selected runtime backend available.

Keyboard shortcuts

hpc-compose