Task Guide
Use this page when you know what you want to do, but not yet which command or example should be your starting point.
First run
- Read Quickstart.
- Run
hpc-compose evolve --output compose.yamlif you want a guided progression fromminimalthroughmulti-node-placement. - Run
hpc-compose new --list-templatesif you want to inspect the built-in starter templates before choosing one. - Start from
minimal-batchwithhpc-compose new --template minimal-batch --name my-app --output compose.yaml. - Before running on a cluster, configure a shared cache with
hpc-compose setup --cache-dir '<shared-cache-dir>'or explicitx-slurm.cache_dir. If you copy a repository example that usesCACHE_DIR, override it for your cluster before running. - Run
hpc-compose plan -f compose.yamlbefore the first real run. Add--show-scriptwhen you want to inspect the generated launcher without writing a file. - Run
hpc-compose up -f compose.yamlonly from a supported Linux Slurm submission host.
Remember directory/data/env settings once
- Run
hpc-compose setupto create or update the project-local settings file (.hpc-compose/settings.toml). - Use
hpc-compose --profile dev upso compose path, env files, env vars, and binary paths come from the selected profile. - Run
hpc-compose context --format jsonto inspect resolved paths plus value sources. Interpolation variables are scoped to names referenced by the compose file and sensitive-looking values are redacted unless you add--show-values. - Use
--settings-file <PATH>when you need an explicit settings file instead of upward discovery.
Migrate from Docker Compose
- Read Docker Compose Migration.
- Replace
build:withimage:plusx-runtime.prepare.commands. - Replace service-name networking with
127.0.0.1or explicit allocation metadata where appropriate.
Single-node multi-service app
- Start from app-redis-worker.yaml.
- Add
depends_onandreadinessonly where ordering really matters. - Use Execution Model to confirm which services can rely on localhost.
Multi-node distributed training
- Start from multi-node-torchrun.yaml, multi-node-deepspeed.yaml, multi-node-accelerate.yaml, multi-node-jax.yaml, or multi-node-mpi.yaml.
- Use multi-node-horovod.yaml or nccl-tests.yaml when rank-per-GPU MPI launch is the right shape.
- Start from multi-node-partitioned.yaml when independent distributed roles need disjoint node ranges or explicit co-location.
- Start from ray-symmetric.yaml, dask-scheduler-workers.yaml, spark-standalone.yaml, or flux-nested.yaml when the framework can run inside one Slurm allocation without an autoscaler.
- Use generated distributed metadata such as
HPC_COMPOSE_DIST_RDZV_ENDPOINT,HPC_COMPOSE_DIST_NODE_RANK, andHPC_COMPOSE_DIST_NPROC_PER_NODEinstead of Docker-style service discovery. - Put cluster-specific NCCL/UCX/OFI fabric variables in
.hpc-compose/cluster.tomlunder[distributed.env]so specs stay portable.
Checkpoint and resume workflows
- Start from training-checkpoints.yaml when you only need artifact output.
- Start from training-resume.yaml when the run should resume from shared storage across retries or later submissions.
- Keep the canonical resume source in
x-slurm.resume.path, not in exported artifact bundles.
LLM serving workflows
- Start from llm-curl-workflow.yaml, llm-curl-workflow-workdir.yaml, llama-uv-worker.yaml, or vllm-uv-worker.yaml.
- Use
volumesfor model directories and fast-changing code. - Use
x-runtime.prepare.commandsfor slower-changing dependencies.
Debug cluster readiness
- Run
hpc-compose validate -f compose.yaml. - Run
hpc-compose validate -f compose.yaml --strict-envwhen default interpolation fallbacks should be treated as failures. - Run
hpc-compose plan --verbose -f compose.yaml. - Run
hpc-compose preflight -f compose.yaml. - Run
hpc-compose debug -f compose.yaml --preflightafter a failed tracked run. - Read Troubleshooting.
Cache and artifact management
- Use
hpc-compose cache listto inspect imported/prepared artifacts. - Use
hpc-compose cache inspect -f compose.yamlto see per-service reuse expectations. - Use
hpc-compose --profile dev cache prune --age 14when you want age-based cleanup to follow the active context cache dir. - Use
hpc-compose cache prune --age 7 --cache-dir '<shared-cache-dir>'when you want a direct cache cleanup that does not depend on compose resolution. - Use
hpc-compose artifacts -f compose.yamlafter a run to export tracked payloads.
Find and clean tracked runs
- Use
hpc-compose jobs listto scan the current repo tree for tracked runs. - Use
hpc-compose ps -f compose.yamlwhen you want a one-shot per-service runtime table. - Use
hpc-compose watch -f compose.yamlto reconnect to the live watch UI for the latest tracked job. - Use
hpc-compose jobs list --disk-usagewhen you need a quick size estimate before deleting old state. - Use
hpc-compose clean -f compose.yaml --dry-run --age 7to preview what a cleanup would remove. - Use
hpc-compose clean -f compose.yaml --all --format jsonwhen automation needs a stable cleanup report for one compose context, including effective latest IDs plus stale-pointer diagnostics.
Automation and scripting with JSON output
- Prefer
--format jsonfor machine-readable output on non-streaming commands such asnew,plan,validate,render,prepare,preflight,config,inspect,debug,status,ps,stats,score,artifacts,down,cancel,setup,cache,clean, andcontext. Forup,--format jsonrequires--detachor--dry-run. - Include
context --format jsonwhen automation needs resolved compose path, binaries, referenced interpolation vars, and runtime path roots. - Use
hpc-compose stats --format jsonlor--format csvwhen downstream tooling wants row-oriented metrics. - Treat
--jsonas a compatibility alias on older machine-readable commands; new automation should prefer--format json. Streaming commands such aslogs --follow,watch, andcompletionskeep their native text or script output.