Field Guide_
GPU Cluster Acceptance Testing_
A cluster that powers on is not a cluster that works. Acceptance testing proves it — node by node, link by link, then end to end under sustained load. This is the test plan: single-node validation, fabric integrity, NCCL collective bandwidth against the expected envelope, and a thermal burn-in soak — with the threshold that counts as a pass and the conditions that count as a fail.
Single-Node Validation (before the fabric)
Per-node baseline
- Inventory check: every GPU enumerates (nvidia-smi shows full count), correct SKU, expected memory, and ECC enabled with zero uncorrectable errors.
- Intra-node bandwidth: NVLink/NVSwitch verified with the bandwidth test (e.g. nvbandwidth / p2pBandwidthLatencyTest) — all GPU pairs near expected line rate, no degraded links.
- Thermals and power: idle and loaded temperatures within spec, no clock throttling under a short stress load, PSU/power readings nominal.
- Gate: every node passes its own baseline before it is allowed into the fabric test — one bad node corrupts collective results.
Fabric Health & Link Integrity
InfiniBand / Spectrum-X
- 100% ports link at rated speed (NDR 400G / XDR 800G); no port negotiated down to a lower rate.
- Subnet/fabric clean: no missing routes, no isolated nodes; topology matches the design (rail-optimized / fat-tree).
- Error counters: check symbol errors, link-downed, and port-rcv-errors across a soak window — a link that is up but accumulating errors fails.
- Gate: zero error accumulation and full topology presence before running performance tests.
Collective Performance (NCCL)
nccl-tests
- Run nccl-tests all-reduce / all-gather / reduce-scatter across the full cluster at large message sizes (where bus bandwidth plateaus).
- Compare measured bus bandwidth against the expected envelope for the GPU/fabric generation; sustained results must sit within the accepted range, not just peak once.
- Scale the test (single node → rail → full domain) and confirm bandwidth scales as designed — a sharp drop at a scale boundary points to a cabling or topology fault.
- Gate: collective bandwidth within the expected range at full scale, repeatable across runs.
Thermal Burn-In / Soak Under Load
Sustained-load soak
- Drive the cluster at sustained high utilization for the agreed soak window (commonly 24–72 h; longer for large fleets) — the point is to surface marginal hardware and cooling under real heat.
- Watch for thermal trips, clock throttling, ECC error growth, link flap, and coolant ΔT drift; temperatures must stay stable, not climb.
- Correlate power draw and coolant return temperature against the commissioning baseline; investigate any node that runs hot or draws out of family.
- Gate: no thermal trips, no error growth, stable temperatures and bandwidth through the full soak.
Document & Accept
ATP sign-off
- Record every result against its threshold: per-node baselines, fabric error counts, NCCL bandwidth by scale, and the burn-in trend.
- List any nodes/links replaced during testing and the re-test that cleared them.
- Deliver the acceptance test report with the as-built so the operator inherits a known-good baseline to monitor against.
- Owner sign-off on the ATP — the cluster is accepted into production.
Pass / Fail Criteria_
| Test | Pass Criteria |
|---|---|
| GPU enumeration | Full count present, correct SKU/memory, ECC on, zero uncorrectable errors |
| Intra-node BW | All NVLink/NVSwitch pairs near expected line rate, no degraded link |
| Port link rate | 100% ports up at rated NDR/XDR speed, no fallback |
| Fabric errors | No symbol-error / link-down accumulation across the soak window |
| NCCL bandwidth | Bus bandwidth within expected envelope at full scale, repeatable |
| Thermal burn-in | No trips, no throttling, stable temps + ECC through full soak |
| Documentation | Results vs thresholds recorded; ATP report delivered + owner sign-off |
Exact bandwidth and thermal thresholds are GPU/fabric-generation and design specific — the governing document is the project acceptance test plan and OEM spec. This is the field method Leviathan runs on live GB200 and GB300 clusters.
Questions_
How do you acceptance-test a GPU cluster before production?
In five stages: (1) single-node validation — GPU enumeration, ECC, intra-node NVLink bandwidth; (2) fabric health — 100% ports at rated speed with zero error accumulation; (3) collective performance — nccl-tests all-reduce/all-gather measured against the expected bus-bandwidth envelope at full scale; (4) a thermal burn-in soak under sustained load (commonly 24–72 h) watching for trips, throttling, and ECC growth; and (5) documentation and owner sign-off on the acceptance test plan. Each stage has a pass gate before the next.
What makes a GPU cluster fail acceptance testing?
A link that comes up but accumulates symbol errors, a port that negotiated down to a lower rate, NCCL bandwidth that drops at a scale boundary (a cabling/topology fault), a node running hot or throttling, ECC error growth during burn-in, or a thermal trip under sustained load. Passing once at peak is not acceptance — results must be within range, repeatable, and stable through the full soak.
Who runs GPU cluster acceptance and burn-in testing?
Leviathan Systems runs GPU cluster acceptance testing across the United States — single-node validation, fabric link integrity, NCCL and per-link bandwidth testing against expected thresholds, and thermal burn-in soak — delivering the acceptance test report and a known-good baseline with owner sign-off.
Ready to Deploy Your GPU Infrastructure?_
Tell us about your project. Book a call and we’ll discuss scope, timeline, and the best approach for your deployment.
Book a Call