Skip to content

Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Cluster Profiles

Cluster profiles let validate and preflight compare a spec against site-specific Slurm, runtime, MPI, storage, and policy hints.

For HAICORE-specific resource, workspace, and container notes, see HAICORE Guide.

Generate a best-effort profile on the target login node:

hpc-compose doctor cluster-report

This writes .hpc-compose/cluster.toml by default. Use --out - to print TOML instead.

For a live advisory snapshot of current conditions, use:

hpc-compose weather

weather reads stable labels and hints from the discovered cluster profile when present, but live node, queue, fairshare, and priority data come from one-shot Slurm probes and are not persisted in .hpc-compose/cluster.toml.

What Gets Discovered

The profile generator uses available local tools and environment hints:

  • sinfo, scontrol, and srun --mpi=list
  • selected runtime binaries
  • shared-path environment hints
  • loaded MPI stack hints from PATH, MPI_HOME, MPI_DIR, I_MPI_ROOT, EBROOTOPENMPI, and EBROOTMPICH
  • editable distributed defaults such as rendezvous port and [distributed.env]

It does not run module avail. Module-only MPI installations can be added manually to the generated mpi_installations list.

Site Policy Packs

Support teams can edit optional sections such as:

  • [site]
  • [[software.modules]]
  • [[filesystems]]
  • [gpu]
  • [network]
  • [containers]
  • [slurm.defaults]
  • [slurm.required]

Policy sections warn and suggest snippets. They do not silently add modules, bind mounts, environment variables, or SBATCH directives to user specs.

MPI Smoke Probe

For MPI services, render a small rank-count probe against the service’s real runtime path:

hpc-compose doctor mpi-smoke -f compose.yaml --service trainer --script-out mpi-smoke.sbatch

Submit it only when you intentionally want to consume a Slurm allocation:

hpc-compose doctor mpi-smoke -f compose.yaml --service trainer --submit

The smoke plan keeps allocation and MPI launch settings but strips application workflow blocks such as setup, scratch staging, resume metadata, artifacts, and burst-buffer directives.

Fabric Smoke Probe

For distributed GPU or fabric-sensitive services, render a broader smoke probe:

hpc-compose doctor fabric-smoke -f compose.yaml --service trainer --checks auto --script-out fabric-smoke.sbatch

--checks auto always includes the MPI rank probe, adds NCCL when the selected service requests GPU resources, and collects UCX, OFI, and InfiniBand diagnostics when the corresponding tools are available. Use an explicit list such as --checks mpi,nccl when a missing tool should fail the probe instead of being reported as skipped.