Skip to content

Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Troubleshooting

Use this page when the safe authoring path worked but the first real cluster run failed.

For background on Slurm allocations, sbatch, srun, Pyxis, and Enroot, see Slurm And Container Basics. For HAICORE-specific storage and runtime checks, see HAICORE Guide.

First Triage

hpc-compose validate -f compose.yaml
hpc-compose validate -f compose.yaml --strict-env
hpc-compose plan --verbose -f compose.yaml
hpc-compose debug -f compose.yaml --preflight

plan --verbose can print resolved environment values and final mount mappings. Treat its output as sensitive when the spec contains secrets. debug is read-only unless --preflight is passed; with --preflight, it reruns prerequisite checks and includes those findings in the triage report.

Common Symptoms

SymptomLikely causeNext step
required binary '...' was not foundSelected backend or Slurm client tool is not on PATH.Run debug --preflight; pass --enroot-bin, --apptainer-bin, --singularity-bin, --srun-bin, or --sbatch-bin as needed.
srun does not advertise --container-imagePyxis support is unavailable or not loaded.Move to a supported login node, load the site module, or choose another backend.
Cache directory warning/errorThe resolved cache directory is not shared, writable, or policy-safe.Choose a shared project/work/scratch path through x-slurm.cache_dir or setup --cache-dir, then rerun debug --preflight.
Missing local mount or image pathRelative paths are resolved from the compose file directory.Check paths relative to the copied compose.yaml.
Mounted symlink exists on the host but fails in the containerThe symlink target is outside the mounted directory.Copy the real file into the mounted directory or mount the target directory.
Anonymous pull or registry warningRegistry credentials are missing or rate limits apply.Configure credentials before relying on private or rate-limited images.
Services start in the wrong orderDependency condition or readiness is too weak.Use service_healthy with readiness, or service_completed_successfully for DAG stages.
No service logs existThe batch script failed before launching a service.Use debug to see scheduler state, the tracked top-level batch log tail, and missing-log hints.
dev reports no watchable source directoriesServices only mount files, missing paths, cache paths, or container-only paths.Mount the source as a host directory or pass hpc-compose dev --watch-path ./src -f compose.yaml.
Readiness never passesProbe target, pattern, host, or dependency timing does not match the real service.Inspect the service log with logs --service <name> and try a finite hpc-compose test --local or short test --submit spec.
Smoke test times outThe spec is long-running, readiness blocks forever, or the scheduler job never reaches terminal state.Make the smoke spec finite, lower service readiness timeouts, and use --format json to inspect the failed phase and service reason.
tmux is unavailable or attach failstmux is not installed or the shell is non-interactive.Install tmux, pass --tmux-bin <PATH>, or create the dashboard with --no-attach.
Local mode is unsupportedLocal workflows require a Linux host with Pyxis-compatible Enroot behavior.Use authoring commands on non-Linux hosts, then run test --submit or up on a supported Slurm login node.

Readiness Issues

Use depends_on with condition: service_healthy when a dependent must wait for a dependency’s readiness probe. Plain list form means service_started.

Use condition: service_completed_successfully for one-shot DAG stages where the next service should start only after the previous stage exits with status 0, such as preprocess -> train -> postprocess.

When a TCP port opens before the service is fully usable, prefer HTTP or log-based readiness over TCP readiness.

For hpc-compose test, readiness failures are terminal smoke-test failures. A service with configured readiness must become healthy and then complete successfully; ignored sidecars are still expected to pass in a smoke spec.

Preview A Run

Use plan for the static preview. It never prepares images, runs preflight, calls sbatch, or writes hpc-compose.sbatch:

hpc-compose plan --show-script -f compose.yaml

Use up --dry-run only when you intentionally want to exercise preflight, prepare, and render without calling sbatch:

hpc-compose up --dry-run -f compose.yaml

Clean Old Tracked Runs

Tracked job metadata and logs accumulate in .hpc-compose/. Preview cleanup before deleting:

hpc-compose jobs list --disk-usage
hpc-compose clean -f compose.yaml --age 7 --dry-run
hpc-compose clean -f compose.yaml --age 7