LEVIATHAN SYSTEMS

Field Guide_

GPU Cluster Acceptance Testing_

A cluster that powers on is not a cluster that works. Acceptance testing proves it — node by node, link by link, then end to end under sustained load. This is the test plan: single-node validation, fabric integrity, NCCL collective bandwidth against the expected envelope, and a thermal burn-in soak — with the threshold that counts as a pass and the conditions that count as a fail.

01

Single-Node Validation (before the fabric)

Per-node baseline

  • Inventory check: every GPU enumerates (nvidia-smi shows full count), correct SKU, expected memory, and ECC enabled with zero uncorrectable errors.
  • Intra-node bandwidth: NVLink/NVSwitch verified with the bandwidth test (e.g. nvbandwidth / p2pBandwidthLatencyTest) — all GPU pairs near expected line rate, no degraded links.
  • Thermals and power: idle and loaded temperatures within spec, no clock throttling under a short stress load, PSU/power readings nominal.
  • Gate: every node passes its own baseline before it is allowed into the fabric test — one bad node corrupts collective results.
02

Fabric Health & Link Integrity

InfiniBand / Spectrum-X

  • 100% ports link at rated speed (NDR 400G / XDR 800G); no port negotiated down to a lower rate.
  • Subnet/fabric clean: no missing routes, no isolated nodes; topology matches the design (rail-optimized / fat-tree).
  • Error counters: check symbol errors, link-downed, and port-rcv-errors across a soak window — a link that is up but accumulating errors fails.
  • Gate: zero error accumulation and full topology presence before running performance tests.
03

Collective Performance (NCCL)

nccl-tests

  • Run nccl-tests all-reduce / all-gather / reduce-scatter across the full cluster at large message sizes (where bus bandwidth plateaus).
  • Compare measured bus bandwidth against the expected envelope for the GPU/fabric generation; sustained results must sit within the accepted range, not just peak once.
  • Scale the test (single node → rail → full domain) and confirm bandwidth scales as designed — a sharp drop at a scale boundary points to a cabling or topology fault.
  • Gate: collective bandwidth within the expected range at full scale, repeatable across runs.
04

Thermal Burn-In / Soak Under Load

Sustained-load soak

  • Drive the cluster at sustained high utilization for the agreed soak window (commonly 24–72 h; longer for large fleets) — the point is to surface marginal hardware and cooling under real heat.
  • Watch for thermal trips, clock throttling, ECC error growth, link flap, and coolant ΔT drift; temperatures must stay stable, not climb.
  • Correlate power draw and coolant return temperature against the commissioning baseline; investigate any node that runs hot or draws out of family.
  • Gate: no thermal trips, no error growth, stable temperatures and bandwidth through the full soak.
05

Document & Accept

ATP sign-off

  • Record every result against its threshold: per-node baselines, fabric error counts, NCCL bandwidth by scale, and the burn-in trend.
  • List any nodes/links replaced during testing and the re-test that cleared them.
  • Deliver the acceptance test report with the as-built so the operator inherits a known-good baseline to monitor against.
  • Owner sign-off on the ATP — the cluster is accepted into production.

Pass / Fail Criteria_

TestPass Criteria
GPU enumerationFull count present, correct SKU/memory, ECC on, zero uncorrectable errors
Intra-node BWAll NVLink/NVSwitch pairs near expected line rate, no degraded link
Port link rate100% ports up at rated NDR/XDR speed, no fallback
Fabric errorsNo symbol-error / link-down accumulation across the soak window
NCCL bandwidthBus bandwidth within expected envelope at full scale, repeatable
Thermal burn-inNo trips, no throttling, stable temps + ECC through full soak
DocumentationResults vs thresholds recorded; ATP report delivered + owner sign-off

Exact bandwidth and thermal thresholds are GPU/fabric-generation and design specific — the governing document is the project acceptance test plan and OEM spec. This is the field method Leviathan runs on live GB200 and GB300 clusters.

Questions_

How do you acceptance-test a GPU cluster before production?

In five stages: (1) single-node validation — GPU enumeration, ECC, intra-node NVLink bandwidth; (2) fabric health — 100% ports at rated speed with zero error accumulation; (3) collective performance — nccl-tests all-reduce/all-gather measured against the expected bus-bandwidth envelope at full scale; (4) a thermal burn-in soak under sustained load (commonly 24–72 h) watching for trips, throttling, and ECC growth; and (5) documentation and owner sign-off on the acceptance test plan. Each stage has a pass gate before the next.

What makes a GPU cluster fail acceptance testing?

A link that comes up but accumulates symbol errors, a port that negotiated down to a lower rate, NCCL bandwidth that drops at a scale boundary (a cabling/topology fault), a node running hot or throttling, ECC error growth during burn-in, or a thermal trip under sustained load. Passing once at peak is not acceptance — results must be within range, repeatable, and stable through the full soak.

Who runs GPU cluster acceptance and burn-in testing?

Leviathan Systems runs GPU cluster acceptance testing across the United States — single-node validation, fabric link integrity, NCCL and per-link bandwidth testing against expected thresholds, and thermal burn-in soak — delivering the acceptance test report and a known-good baseline with owner sign-off.

Ready to Deploy Your GPU Infrastructure?_

Tell us about your project. Book a call and we’ll discuss scope, timeline, and the best approach for your deployment.

Book a Call