Skip to content

Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Quickstart

This is the shortest safe path from an empty shell to a static plan, a first real Slurm run, and one-command failure triage.

If Slurm terms such as sbatch, srun, allocation, job step, Pyxis, or Enroot are unfamiliar, read Slurm And Container Basics before the first real cluster run.

1. Install The CLI

For normal use, install from the latest published GitHub Release and pin the tag you selected:

RELEASE_TAG=vX.Y.Z
curl -fsSL "https://raw.githubusercontent.com/NicolasSchuler/hpc-compose/${RELEASE_TAG}/install.sh" \
  | env HPC_COMPOSE_VERSION="${RELEASE_TAG}" sh

Replace vX.Y.Z with the published release tag shown on the release page.

The installer places hpc-compose in ~/.local/bin by default and verifies the release checksum sidecar before installing. Release verification, manual downloads, package-manager installs, and source-checkout builds are covered in Installation.

If your shell does not find the command immediately, add the default install directory to your PATH:

export PATH="$HOME/.local/bin:$PATH"
hpc-compose --version

2. Learn The Safe Authoring Path First

plan is the safe authoring command. It does not call sbatch, does not import images, and does not write a script file:

Create a starter spec first:

hpc-compose new \
  --template minimal-batch \
  --name my-app \
  --output compose.yaml

If you want a guided learning path instead of a single starter template, run the Spec Metamorphosis tutorial:

hpc-compose evolve --output compose.yaml

Then inspect the static plan:

hpc-compose plan -f compose.yaml
hpc-compose plan --show-script -f compose.yaml

Expected output includes:

spec is valid
service order: app

This is the right first path on macOS, a laptop, or any machine where you want to evaluate the authoring model before touching a real cluster. The same flow is also available as an asciinema-style demo cast, but the snippets above are the accessible reference output.

The normal workflow to remember is:

hpc-compose plan -f compose.yaml
hpc-compose up -f compose.yaml
hpc-compose debug -f compose.yaml --preflight

3. Choose A Starting Spec

Use the built-in starter templates when you want a fresh compose.yaml with your application name filled in:

hpc-compose new \
  --template minimal-batch \
  --name my-app \
  --output compose.yaml

Add --cache-dir '<shared-cache-dir>' when you want the generated file to include an explicit x-slurm.cache_dir. Otherwise the plan uses the active settings cache default or $HOME/.cache/hpc-compose.

From a source checkout, you can also inspect a known-good repository example:

hpc-compose plan -f examples/minimal-batch.yaml

The Examples page is the single selection guide for beginner, LLM, training, distributed, and pipeline workflows.

Use Spec Metamorphosis when you want to learn those concepts progressively in one evolving valid spec.

4. Pick And Test A Cache Directory

cache_dir is optional in the spec, but real clusters usually need a site-specific shared path because image preparation happens before the job starts and compute nodes must later see those artifacts.

Ask your cluster documentation or support team for a project scratch, work, or shared filesystem path, then test it:

export CACHE_DIR=/cluster/shared/hpc-compose-cache
mkdir -p "$CACHE_DIR"
test -w "$CACHE_DIR"

Persist it in project settings when you want the same value every time:

hpc-compose setup --profile-name dev --cache-dir "$CACHE_DIR" --default-profile dev --non-interactive

Or keep using an environment-backed explicit spec value and persist it next to your copied spec:

printf 'CACHE_DIR=%s\n' "$CACHE_DIR" > .env

Do not use /tmp, /var/tmp, /private/tmp, or /dev/shm for x-slurm.cache_dir. Validation may accept those strings, but preflight reports them as unsafe because prepare happens before runtime and compute nodes must later see the cached artifacts.

5. Before Your First Cluster Run

Command categoryWhere to run itRequired toolsNotes
Authoring: new, plan, validate, inspect, render, config, schemalaptop, workstation, or login nodehpc-composeplan is the recommended static pre-run check.
Prepare: prepareLinux host with selected runtime backendPyxis needs Enroot; Apptainer needs apptainer; Singularity needs singularity; host backend needs no container runtimeDoes not call sbatch, but needs runtime tools for image work.
Cluster checks: preflight, doctor cluster-reportLinux Slurm login nodeSlurm client tools plus selected backend toolsUse preflight --strict when warnings should block launch.
Run: up, runLinux Slurm login nodesbatch, srun, scheduler tools, selected backend toolsup is the normal cluster execution path.
Local launch: up --localLinux host onlyEnroot and runtime.backend: pyxisSingle-host only; not a distributed Slurm substitute.

For Pyxis, srun --help should mention --container-image.

6. Submit On A Real Cluster

When you move to a supported Linux submission host, the normal run is:

hpc-compose up -f compose.yaml

up runs preflight, prepares missing artifacts, renders the batch script, submits it through sbatch, then follows scheduler state and tracked logs. On an interactive TTY it opens the full-screen watch UI; otherwise it falls back to line-oriented output. Add --watch-queue when you want line-oriented queue polling until the Slurm job reaches RUNNING before the normal watch view opens; --queue-warn-after <DURATION> controls the one-time long-pending warning. The watch UI holds the final screen on failures by default; use --hold-on-exit never|failure|always to tune that behavior. Use hpc-compose up --detach -f compose.yaml when you want submit-and-return behavior.

Success looks like:

  • the job is submitted or launched
  • a tracked job id is recorded
  • the watch UI or text follower shows scheduler progress
  • status, ps, and logs can reconnect to the tracked run later

7. If The First Cluster Run Fails

SymptomBest next commandWhy
Missing sbatch, srun, enroot, apptainer, or singularityhpc-compose debug -f compose.yaml --preflightReruns prerequisite checks and keeps the latest tracked context in one report.
srun does not advertise --container-imagehpc-compose doctor cluster-reportPyxis support is unavailable or not loaded on that node.
Job submitted but no service log appearedhpc-compose debug -f compose.yamlShows scheduler state, batch log tail, service log hints, and the next command.
Cache path warning or errorhpc-compose debug -f compose.yaml --preflightConfirms whether x-slurm.cache_dir is shared and writable.
Services start in the wrong orderhpc-compose plan --explain --verbose -f compose.yamlShows normalized dependencies, readiness gates, and planner hints before running.

The longer symptom guide is Troubleshooting.

8. Revisit A Tracked Run Later

hpc-compose jobs list
hpc-compose status -f compose.yaml
hpc-compose ps -f compose.yaml
hpc-compose watch -f compose.yaml
hpc-compose stats -f compose.yaml
hpc-compose logs -f compose.yaml --follow

Use jobs list first when you need to rediscover tracked runs under the current repo tree. Use ps for a stable per-service snapshot, watch to reconnect to the live UI, and logs --follow for a text-only follower.

From A Source Checkout

If you are developing from a local checkout instead of an installed binary:

cargo build --release
target/release/hpc-compose validate -f examples/minimal-batch.yaml
target/release/hpc-compose plan -f examples/minimal-batch.yaml
target/release/hpc-compose plan --show-script -f examples/minimal-batch.yaml