Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Examples

These examples are the fastest way to understand the intended hpc-compose workflows and adapt them to a real application.

For almost every example, the normal run is:

hpc-compose submit --watch -f examples/<example>.yaml

Use the debugging flow (validate, inspect, preflight, prepare) when you are wiring up the example for the first time or isolating a failure.

If you want one of these files written straight to your working directory, use:

hpc-compose init --template dev-python-app --name my-app --cache-dir /shared/$USER/hpc-compose-cache --output compose.yaml

Example matrix

ExampleWhat it demonstratesWhen to start from it
app-redis-worker.yamlMultiple services, depends_on, and TCP readiness checksYou need service startup ordering or a small multi-service stack
dev-python-app.yamlMounted source code plus x-enroot.prepare.commands for dependenciesYou want an iterative development workflow
llm-curl-workflow.yamlEnd-to-end LLM request flow with a login-node prepare step and a curl clientYou want the smallest concrete inference workflow
llm-curl-workflow-workdir.yamlSame LLM workflow, but anchored under $HOME/models for direct use on a login nodeYou want the lowest-overhead path from a login-node home directory
llama-app.yamlGPU-backed service, mounted model files, dependent app serviceYou need accelerator resources or a model-serving pattern
llama-uv-worker.yamlllama.cpp serving plus a source-mounted Python worker executed through uvYou want the GGUF server + mounted worker pattern
minimal-batch.yamlSingle service, no dependencies, no GPU, no prepareYou want the simplest possible starting point
multi-node-mpi.yamlOne primary-node helper plus one allocation-wide distributed CPU stepYou want a minimal multi-node pattern without adding orchestration
multi-node-torchrun.yamlAllocation-wide torchrun launch using the primary node as rendezvousYou want a multi-node GPU training starting point
training-checkpoints.yamlGPU training with checkpoints written to shared storageYou need a batch training workflow with artifact collection
training-resume.yamlGPU training with a shared resume directory and attempt-aware checkpointsYou need restart-safe checkpoint semantics across requeues or repeated submissions
postgres-etl.yamlPostgreSQL plus a Python data processing jobYou need a database-backed batch pipeline
vllm-openai.yamlvLLM serving with an in-job Python clientYou want vLLM-based inference instead of llama.cpp
vllm-uv-worker.yamlvLLM serving plus a source-mounted Python worker executed through uvYou want a common LLM stack with mounted app code
mpi-hello.yamlMPI hello world compiled and run with Open MPIYou need an MPI workload
multi-stage-pipeline.yamlTwo-stage pipeline coordinating through the shared job mountYou need file-based stage-to-stage handoff
fairseq-preprocess.yamlCPU-heavy NLP data preprocessing with parallel workersYou need a CPU-bound data preprocessing pipeline

Which example should I start from?

Companion notes for the more involved examples live alongside the example assets:

Adaptation checklist

  1. Copy the closest example to your own compose.yaml, or run hpc-compose init --template <name> --name my-app --cache-dir /shared/$USER/hpc-compose-cache --output compose.yaml.
  2. Set x-slurm.cache_dir to a path visible from both the login node and the compute nodes.
  3. Replace the example image, command, environment, and volumes with your workload.
  4. Keep active source in volumes and keep slower-changing dependency installation in x-enroot.prepare.commands.
  5. Add readiness to services that must be reachable before dependents continue.
  6. Adjust top-level or per-service x-slurm settings for your cluster.
  7. Run the debugging flow before the first submit when you need to confirm planning, prerequisites, or cache behavior.