Skip to content

Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Examples

These examples are the fastest way to understand the intended hpc-compose workflows and adapt them to a real application.

There are two starting points:

  • built-in starter templates generated by hpc-compose new
  • repository example files copied directly from examples/

Before launching anything, run the safe authoring path first:

hpc-compose new --template minimal-batch --name my-app --output compose.yaml
hpc-compose plan -f compose.yaml
hpc-compose plan --show-script -f compose.yaml

If you are reading from a source checkout, you can run the same static checks directly against examples/minimal-batch.yaml.

Some repository examples keep an explicit ${CACHE_DIR:-/cluster/shared/hpc-compose-cache} for portability, while starter examples rely on the settings/builtin cache default. Before running on a real cluster, configure a shared path visible from both the submission host and the compute nodes:

export CACHE_DIR=/cluster/shared/hpc-compose-cache
mkdir -p "$CACHE_DIR"
test -w "$CACHE_DIR"

Start Here: The Four Promoted Examples

These four examples are the intended conversion funnel.

minimal-batch.yaml

  • Demonstrates: one service, no dependencies, no image prepare step
  • Expected prerequisites: any machine for plan; a Linux Slurm login node plus the selected runtime backend for up
  • Cluster run, Linux Slurm login node only: hpc-compose up -f examples/minimal-batch.yaml
  • Success signal: the batch log prints Hello from Slurm!

app-redis-worker.yaml

  • Demonstrates: multi-service startup ordering plus TCP readiness inside one allocation
  • Expected prerequisites: a normal Slurm + Enroot submission host and shared CACHE_DIR
  • Cluster run, Linux Slurm login node only: hpc-compose up -f examples/app-redis-worker.yaml
  • Success signal: worker.log shows a successful Redis PING followed by repeated INCR jobs calls

llm-curl-workflow-workdir.yaml

  • Demonstrates: one GPU-backed LLM service plus one client service in the same job
  • Expected prerequisites: a GGUF model at $HOME/models/model.gguf, a GPU-capable Slurm target, and shared CACHE_DIR
  • Cluster run, Linux Slurm login node only: hpc-compose up -f examples/llm-curl-workflow-workdir.yaml
  • Success signal: curl_client.log contains a JSON response from /v1/chat/completions

training-resume.yaml

  • Demonstrates: checkpoint export, resume-aware reruns, and attempt-aware training state
  • Expected prerequisites: shared storage for x-slurm.resume.path plus shared CACHE_DIR
  • Cluster run, Linux Slurm login node only: hpc-compose up -f examples/training-resume.yaml
  • Success signal: results/<job-id>/ contains exported checkpoints and later attempts resume from the previously saved epoch

Beginner Ladder

Use this ordering when you are new to the project:

For a guided version of the first five concepts, run hpc-compose evolve --output compose.yaml. The progressive-complexity lesson walks through minimal, second-service, readiness, failure-policy, and multi-node-placement as one evolving valid spec.

StageStart hereWhy
Authoring onlyminimal-batch.yaml with plan and plan --show-scriptConfirms the tool understands a spec without touching Slurm.
First cluster runminimal-batch.yaml on a Linux Slurm login nodeSmallest real submission and log-check path.
Single-node multi-serviceapp-redis-worker.yamlShows depends_on plus TCP readiness.
GPU or LLM servingllm-curl-workflow-workdir.yaml, llama-app.yaml, or vllm-openai.yamlAdds accelerator resources and service/client coordination.
Durable trainingtraining-checkpoints.yaml or training-resume.yamlAdds artifacts, checkpoints, and resume semantics.
Distributed launchmulti-node-mpi.yaml, multi-node-torchrun.yaml, or framework-specific examples belowAdds allocation-wide or explicitly placed multi-node services.

Built-In Starter Templates

Use built-in templates when you want hpc-compose to write a fresh compose.yaml with your application name filled in for you.

hpc-compose new --list-templates
hpc-compose new --describe-template minimal-batch
hpc-compose new --template minimal-batch --name my-app --output compose.yaml
hpc-compose new --template minimal-batch --name my-app --cache-dir '<shared-cache-dir>' --output compose.yaml

If the workflow you want is not listed by --list-templates, copy the closest repository example directly from examples/.

Broader Example Matrix

The matrix below covers the broader set of runnable examples beyond the four promoted starts. “Built-in template” means hpc-compose new --template <name> can scaffold it; “repository file” means copy the YAML from examples/ directly.

ExampleAvailabilityWhat it demonstratesWhen to start from it
dev-python-app.yamlBuilt-in templateMounted source code plus x-runtime.prepare.commands for dependenciesYou want an iterative development workflow
dev-python-smoke.yamlRepository fileFinite smoke-test variant of the source-mounted Python appYou want to test a development spec without a long-running process
llm-curl-workflow.yamlBuilt-in templateRepo-local variant of the smallest concrete inference workflowYou want the same LLM stack but with models under the repository tree
llama-app.yamlBuilt-in templateGPU-backed service, mounted model files, dependent app serviceYou need accelerator resources or a model-serving pattern
llama-uv-worker.yamlBuilt-in templatellama.cpp serving plus a source-mounted Python worker executed through uvYou want the GGUF server plus mounted worker pattern
multi-node-mpi.yamlBuilt-in templateFirst-class MPI launch, generated MPI hostfile, and one primary-node helperYou want a minimal multi-node MPI pattern without extra orchestration
mpi-pmix-v4-host-mpi.yamlBuilt-in templateVersioned PMIx launch plus host MPI bind/env configurationYour site requires a host MPI stack inside containers
multi-node-partitioned.yamlRepository fileDisjoint node ranges, fractional node selection, and explicit co-locationYou want multiple distributed roles inside one allocation
multi-node-torchrun.yamlBuilt-in templateAllocation-wide torchrun launch using the primary node as rendezvousYou want a multi-node GPU training starting point
multi-node-deepspeed.yamlBuilt-in templateDeepSpeed no-SSH launch using generated rendezvous and hostfile envYou want distributed fine-tuning without hand-written rendezvous setup
multi-node-accelerate.yamlBuilt-in templateHugging Face Accelerate multi-machine launchYou want an Accelerate-based training or fine-tuning starting point
multi-node-horovod.yamlBuilt-in templateHorovod rank-per-GPU launch through Slurm MPIYou want Horovod without SSH fanout
multi-node-jax.yamlBuilt-in templateJAX coordinator/process metadata for jax.distributed.initializeYou want a JAX distributed starting point
nccl-tests.yamlBuilt-in templateMPI-backed NCCL all-reduce probeYou are debugging GPU fabric, NCCL, UCX, or OFI settings
ray-symmetric.yamlBuilt-in templateRay symmetric-run across one Slurm allocationYou want a modern Ray-on-Slurm starting point without an autoscaler
ray-head-workers.yamlBuilt-in templateRay head plus worker steps inside one allocationYou need explicit Ray head/worker control for an older or site-specific setup
dask-scheduler-workers.yamlBuilt-in templateDask scheduler on the primary node plus allocation workersYou want Dask CLI deployment inside one Slurm allocation
spark-standalone.yamlBuilt-in templateSpark standalone master, workers, and app submissionYou need a conservative Spark standalone pattern without external cluster management
flux-nested.yamlBuilt-in templateNested Flux launched through srun flux startYou want Flux scheduling inside an existing Slurm allocation
nextflow-bridge.yamlBuilt-in templateNextflow command wrapper inside one hpc-compose allocationYou want hpc-compose tracking around a workflow engine run without parsing Nextflow files
snakemake-bridge.yamlBuilt-in templateSnakemake command wrapper inside one hpc-compose allocationYou want hpc-compose tracking around a Snakemake run without replacing Snakemake scheduling semantics
postgres-etl.yamlBuilt-in templatePostgreSQL plus a Python data processing jobYou need a database-backed batch pipeline
restart-policy.yamlBuilt-in templatePer-service restart_on_failure with bounded retries and a rolling-window crash-loop guardYou need transient-failure retries without letting one service spin forever
training-checkpoints.yamlBuilt-in templateGPU training with checkpoints exported to shared storageYou need durable checkpoint outputs but not automatic resume semantics
training-sweep.yamlRepository fileEmbedded sweep parameters with interpolation defaults for dry-run and normal render workflowsYou want a small hyperparameter sweep starting point
vllm-openai.yamlBuilt-in templatevLLM serving with an in-job Python clientYou want vLLM-based inference instead of llama.cpp
vllm-uv-worker.yamlBuilt-in templatevLLM serving plus a source-mounted Python worker executed through uvYou want a common LLM stack with mounted app code
mpi-hello.yamlBuilt-in templateMPI hello world using service-level x-slurm.mpiYou need a small first-class MPI workload
multi-stage-pipeline.yamlBuilt-in templateTwo-stage pipeline coordinating through the shared job mountYou need file-based stage-to-stage handoff
pipeline-dag.yamlBuilt-in templateOne-shot preprocess -> train -> postprocess DAG using successful-completion dependenciesYou need stage completion, not service readiness, to gate downstream work
fairseq-preprocess.yamlBuilt-in templateCPU-heavy NLP data preprocessing with parallel workersYou need a CPU-bound data preprocessing pipeline
canary-right-size.yamlRepository fileA deliberately over-requested training probe for hpc-compose germinateYou want to practice right-sizing recommendations before changing a real spec
rendezvous-model-server.yamlRepository fileA provider job that registers a model-server endpoint in the shared cacheYou want one Slurm allocation to publish a service for later jobs
rendezvous-client.yamlRepository fileA separate client job that resolves HPC_COMPOSE_RDZV_MODEL_SERVER_URLYou want cross-job service discovery through shared storage

Which Example Should I Start From?

Companion notes for the more involved examples live alongside the example assets:

Development Workflow Recipe

examples/dev-python-app.yaml mounts examples/app/ and runs a long-lived Python process, so it is best for hot reload:

hpc-compose dev -f examples/dev-python-app.yaml
hpc-compose tmux -f examples/dev-python-app.yaml --no-attach

examples/dev-python-smoke.yaml keeps the same mounted-source shape but uses a finite command, so it is suitable for smoke tests:

hpc-compose test --local -f examples/dev-python-smoke.yaml
hpc-compose test --submit --time 00:01:00 -f examples/dev-python-smoke.yaml

Adaptation Checklist

  1. Copy the closest repository example to your own compose.yaml, or run hpc-compose new --template <name> --name my-app --output compose.yaml when a matching built-in template exists.
  2. Configure a cache path visible from both the login node and compute nodes through hpc-compose setup --cache-dir, x-slurm.cache_dir, or [defaults.cache] / [profiles.<name>.cache].
  3. Override CACHE_DIR before running repository examples that use ${CACHE_DIR:-...}, or replace the default cache path in your copied file.
  4. Replace the example image, command, environment, and volumes with your workload.
  5. Keep active source in volumes and keep slower-changing dependency installation in x-runtime.prepare.commands.
  6. Add readiness to services that must be reachable before dependents continue.
  7. Adjust top-level or per-service x-slurm settings for your cluster.
  8. Run hpc-compose plan -f compose.yaml before the first run, and hpc-compose debug -f compose.yaml --preflight if that run fails.
  9. Run cluster up only from a supported Linux Slurm submission host with the selected runtime backend available.