Skip to content

Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Runtime Observability

After a submission, hpc-compose records tracked metadata under:

${SLURM_SUBMIT_DIR:-$PWD}/.hpc-compose/${SLURM_JOB_ID}/

That directory lets follow-up commands reconnect without resubmitting.

Common Commands

hpc-compose status -f compose.yaml
hpc-compose ps -f compose.yaml
hpc-compose watch -f compose.yaml
hpc-compose watch -f compose.yaml --hold-on-exit always
hpc-compose replay -f compose.yaml --speed 10
hpc-compose replay -f compose.yaml --no-tui
hpc-compose logs -f compose.yaml --follow
hpc-compose logs -f compose.yaml --grep 'error|oom' --since 30m
hpc-compose stats -f compose.yaml
hpc-compose stats -f compose.yaml --accounting
hpc-compose inspect -f compose.yaml --rightsize
hpc-compose score 12345
hpc-compose germinate -f compose.yaml
hpc-compose sweep status -f compose.yaml
hpc-compose sweep list -f compose.yaml
hpc-compose diff 12345 12346 -f compose.yaml
CommandUse it for
statusScheduler state, batch log path, runtime paths, and failure-policy state.
psStable per-service snapshot with readiness, status, restart counters, and log path.
watchLive terminal UI; falls back to line-oriented output on non-interactive terminals.
replayBest-effort DVR for a tracked run, reconstructed from existing runtime artifacts.
logsText log output, optionally focused, searched, or coarsely time-filtered.
statsTracked metrics, Slurm step statistics, and optional accounting rollups.
inspect --rightsizePost-run request-versus-usage recommendations for memory, CPUs, GPUs, and walltime.
score0-100 post-run efficiency score with GPU, memory, compute-time, and kWh components.
germinateOne-minute canary submission that writes latest-canary.json and recommends resource settings from fresh metrics.
sweep statusAggregate persisted sweep trials into completed, failed, running, pending, unknown, missing-tracking, and submit-failed counts.
sweep listList prior sweep manifests without querying the scheduler.
diffCompact comparison between two tracked submissions.

Use --format json on non-streaming commands when automation needs stable fields. stats also supports --format csv and --format jsonl.

Watch UI

On an interactive terminal, watch and the default up follow mode open a live view with service state on the left and log output on the right. The UI automatically switches to a compact single-column view on narrow or short terminals. It keeps a detailed status view while the job runs and, by default, holds the final screen on failures so the failing service, final scheduler state, and next diagnostic commands stay visible.

Keybindings:

KeyAction
j, Down, TabMove to the next service.
k, UpMove to the previous service.
g / GJump to the first or last service.
/Filter services by name; press Enter to apply or Esc to cancel.
SpacePause or resume log following.
PgUp / PgDnScroll the visible log pane while paused.
EndReturn to live-follow mode at the newest log lines.
aToggle between the selected service log and all tracked service logs.
?Toggle in-UI help.
qLeave the watch view without cancelling the job.

Use --hold-on-exit never|failure|always on up or watch to control whether the final TUI stays open after a terminal scheduler state. When the view is held, press d, l, or s to print the exact debug, logs, or stats command after leaving the alternate screen.

Use hpc-compose up --watch-queue when you want explicit queue polling before the watch view opens. It prints queue state changes, pending reason, and expected start time when Slurm exposes them; --queue-warn-after <DURATION> controls the one-time long-pending warning.

Use --watch-mode line or --no-tui when you are recording output, using a screen reader, running in CI, or working in a terminal where alternate-screen UIs are inconvenient. Line mode preserves detailed scheduler and log updates without alternate-screen control codes.

Replay

hpc-compose replay reconstructs a best-effort execution timeline after the run. It reuses the watch-style view, but reads only artifacts that already exist under the tracked job directory. This makes it useful for rewinding to the time a service failed, comparing the nearest prior metrics sample, or sharing a deterministic text/JSON summary without querying Slurm again.

hpc-compose replay -f compose.yaml
hpc-compose replay -f compose.yaml --speed 10
hpc-compose replay -f compose.yaml --job-id 12345 --service trainer
hpc-compose replay -f compose.yaml --no-tui
hpc-compose replay -f compose.yaml --format json

Replay controls:

KeyAction
SpacePause or play the replay.
+ / -Move between speed presets such as 1x, 10x, and 100x.
Left / RightSeek backward or forward by five seconds.
[ / ]Jump to the previous or next reconstructed event.
Home / EndJump to the first or final replay frame.
/, a, PgUp, PgDn, qSame filter, log-pane, scroll, and quit behavior as watch.

Replay data sources:

SourceWhat replay usesFidelity notes
state.jsonFinal per-service state, start/finish times, exit code fallback, placement metadataThis file is overwritten during the run, so intermediate readiness and scheduler transitions are not exact.
service-exits/*.jsonlAppend-only service exit markers and restart evidenceMultiple exits reconstruct failure/restart sequences, but accepted restart relaunch time is inferred.
metrics/*.jsonlHistorical GPU and Slurm sampler rowsReplay shows the latest metrics sample at or before the cursor and never displays future metrics as current.
logs/*.logService log tails in the replay UIService logs do not include guaranteed per-line timestamps, so log panes are contextual tails, not exact log-time scrubbing.
Scheduler commandsNot queried during replayHistorical queue state, pending reason changes, and accounting gaps are not reconstructed.

Use --no-tui for a static summary that exits immediately. Use --format json when notebooks, dashboards, or experiment records need the reconstructed events, frame summaries, artifact paths, and fidelity notes.

Logs

Runtime logs live under:

${SLURM_SUBMIT_DIR:-$PWD}/.hpc-compose/${SLURM_JOB_ID}/logs/<service>.log

Slurm may also write a top-level batch log such as slurm-<jobid>.out, or to the path configured with x-slurm.output. Check the batch log first when a job fails before any service log appears.

Service names containing non-alphanumeric characters are encoded in log filenames. Prefer [a-zA-Z0-9_-] in service names for readability.

Use --grep <pattern> to print only matching raw log lines across selected service logs. Use --since <duration> for coarse time-bounded initial output, for example 30s, 15m, 2h, 1d, or 1h30m. Because service logs do not include line timestamps, --since filters by each log file’s modification time rather than by individual line time. Follow mode still starts from the current end of each selected log and applies --grep to appended lines.

Event Hooks

Per-service x-slurm.hooks can run host-side observability scripts when restart_on_failure accepts a restart or when the rolling restart window blocks a crash loop. Hook stdout/stderr is appended to that service’s log, and non-zero hook exits are logged without changing the restart or failure outcome.

Use on: restart for retry notifications and on: window_exhausted for crash-loop alerts. Event hooks receive service identity, exit code, Slurm attempt, and restart-window counters through HPC_COMPOSE_* environment variables; see Spec reference for the full list.

Metrics

When x-slurm.metrics is enabled, sampler files are written under:

${SLURM_SUBMIT_DIR:-$PWD}/.hpc-compose/${SLURM_JOB_ID}/metrics/
  meta.json
  gpu.jsonl
  gpu_processes.jsonl
  slurm.jsonl
  diagnostics/

The sampler can collect GPU snapshots through nvidia-smi and job-step CPU/memory snapshots through sstat. Collector failures are best-effort: missing nvidia-smi, missing sstat, or unsupported queries do not fail the batch job itself.

Add --accounting to stats when you need post-run sacct rollups for reporting. The accounting summary includes allocated CPU-hours, total CPU-hours when available, allocated GPU-hours, allocation-based memory byte-seconds, and observed maximum RSS. Memory byte-seconds are labeled as allocation-based because Slurm’s standard accounting fields do not reliably provide true per-line memory-seconds across all clusters.

Use hpc-compose inspect --rightsize -f compose.yaml after a tracked Slurm run to convert those observations into conservative resource suggestions. The assistant requires tracked submission metadata and compares explicit requests such as x-slurm.mem, x-slurm.time, x-slurm.gpus, and service x-slurm.cpus_per_task against sacct, sstat, and nvidia-smi sampler evidence. It only reports suggestions; it does not rewrite the compose file.

Use hpc-compose score <job-id> after a tracked Slurm run when you want a compact efficiency grade. The score reuses sampler history, sacct, sstat, and right-sizing recommendations, then reports GPU utilization, memory utilization, active compute-time versus requested walltime, and a best-effort kWh estimate. Energy uses sampled GPU power when available, otherwise falls back to power limits or configured TDP assumptions through --gpu-tdp-w, --cpu-watts-per-core, and --pue; it does not claim carbon intensity or emissions.

Use hpc-compose germinate -f compose.yaml before a full run when you want a short canary to gather fresh evidence. Canary runs write .hpc-compose/latest-canary.json so normal up metadata remains the latest production submission.

Sweep Manifests

hpc-compose sweep submit stores sweep state under .hpc-compose/sweeps/<sweep-id>/sweep.json and refreshes .hpc-compose/sweeps/latest.json. The manifest records the matrix mode, persisted random seed, trial ids, trial variables, rendered script paths, job ids, per-trial job record paths, submit times, and any submit error.

Each submitted trial also writes a normal job record under .hpc-compose/jobs/<job-id>.json with kind: sweep_trial and a sweep metadata block. Sweep-trial records deliberately do not replace normal latest.json or latest-run.json, so hpc-compose status, watch, and logs continue to target ordinary runs unless you pass an explicit job id.

hpc-compose sweep status -f compose.yaml --format json loads the manifest and queries the same scheduler/tracking snapshot code used for ordinary jobs. It reports per-trial state plus aggregate counts for completed, failed, running, pending, unknown, missing_tracking, and submit_failed. V1 does not parse metric files or infer the best trial; keep metric summaries in your training output or external experiment tracker.

Diffing Runs

Use hpc-compose diff <job-id-1> <job-id-2> to compare two tracked submissions. The compact text view highlights outcome, resource, and config changes; --format json returns the full uncapped diff for notebooks or experiment records. Older tracked jobs without config snapshots still compare outcome metadata and report a note that config comparison is unavailable.