NVIDIA GB300 NVL72 Deployment Guide: Rack Assembly, Cabling, and Commissioning

TL;DR

Complete deployment guide for the NVIDIA GB300 NVL72 rack-scale GPU system, covering site readiness, rack assembly, cooling integration, network cabling, testing, and commissioning.

What Is the NVIDIA GB300 NVL72?

The NVIDIA GB300 NVL72 is the latest rack-scale GPU system built on the Blackwell Ultra architecture. Each rack integrates 36 NVIDIA Grace CPUs and 72 Blackwell Ultra GPUs into a single NVLink domain, delivering over one exaflop of dense AI performance. With 288GB of HBM3e memory per GPU and up to 40TB of fast memory per rack, the GB300 NVL72 is purpose-built for trillion-parameter model training, high-throughput inference, and test-time scaling workloads.

Unlike its predecessor, the GB200 NVL72, the GB300 delivers 1.5x more AI compute FLOPS, 1.5x more GPU memory, and significantly improved attention performance for reasoning-intensive AI applications. These gains come at a cost: the GB300 NVL72 demands 120kW or more per rack, 100% liquid cooling, and infrastructure precision that exceeds anything the data center industry has previously deployed at scale.

Leviathan Systems is currently deploying GB300 NVL72 infrastructure at hyperscale AI training facilities in Texas. This guide reflects direct field experience with the platform, covering the full deployment lifecycle from site readiness through commissioning and handoff.

Site Readiness Requirements

Before any GB300 NVL72 rack arrives on site, the facility must meet a set of non-negotiable infrastructure requirements. Failures in any of these areas will delay deployment by weeks or months.

Power Infrastructure

Each GB300 NVL72 rack draws approximately 120kW under full load. This is equivalent to the power consumption of roughly 80 American homes concentrated in a single 48U cabinet. The facility must provide redundant power feeds capable of sustaining this load continuously, with sufficient headroom for transient spikes during GPU initialization and stress testing.

Power distribution must be planned at the row level, not the rack level. A single row of GB300 racks can draw over a megawatt. Busway distribution systems from vendors like Starline or Legrand are standard for this density class. Traditional PDU-per-rack approaches cannot handle the cable volume or heat dissipation at 120kW.

Uninterruptible power supply systems must be sized for the aggregate GPU load plus cooling infrastructure. A common mistake is sizing UPS for IT load alone while forgetting that liquid cooling pumps, CDUs, and facility water loops also require continuous power.

Liquid Cooling Infrastructure

The GB300 NVL72 is a 100% liquid-cooled system. Air cooling is not an option. Every rack requires connection to a facility chilled water loop through one or more Coolant Distribution Units (CDUs).

CDU capacity must be matched to the rack thermal load. A single GB300 rack at 120kW generates approximately 409,000 BTU/hr of heat. The CDU must dissipate this heat to the facility water loop with sufficient margin for ambient temperature variation and partial CDU failure scenarios.

The facility water loop itself must deliver chilled water at the correct temperature and flow rate. Most GB300 deployments target supply water temperatures between 30°C and 40°C, depending on the CDU configuration and ambient conditions. Water quality requirements include filtration to 50 microns or finer, corrosion inhibitor treatment, and conductivity monitoring.

Rack-level manifold routing connects the CDU to individual compute trays and NVLink switch trays within the rack. Quick-disconnect fittings are used at every connection point to enable tray-level serviceability without draining the entire cooling loop. Leak detection sensors must be installed at every fitting, manifold junction, and CDU connection point, with automated alerts to the building management system.

Floor Loading and Physical Space

A fully loaded GB300 NVL72 rack weighs approximately 3,000 pounds or more, depending on configuration. The data center floor must support this concentrated load without deflection. Raised floor installations require reinforced pedestals and stringers rated for the point load. Slab-on-grade installations must verify concrete thickness and rebar density.

The NVIDIA MGX rack form factor is 48U and follows OCP standards. However, the rack is wider and deeper than standard 19-inch cabinets. Verify aisle width, overhead clearance, and door swing clearance before scheduling delivery. The rack ships as a complete unit and cannot be disassembled for tight spaces.

Network Infrastructure

Each GB300 NVL72 rack requires high-bandwidth network connectivity for both the NVLink fabric (intra-rack) and the inter-rack cluster network. The fifth-generation NVLink provides 130TB/s of aggregate bandwidth within the rack. For inter-rack communication, the system uses either NVIDIA Quantum-X800 InfiniBand or NVIDIA Spectrum-X Ethernet with ConnectX-8 SuperNICs.

Top-of-rack switching must be provisioned before rack delivery. Cable pathways from the rack to the aggregation layer must be in place, with sufficient fiber count for the required port density. OM4 or OM5 multimode fiber handles most intra-row connections. OS2 single-mode fiber is required for runs exceeding 100 meters to aggregation and spine switches.

Rack Assembly Process

Receiving and Staging

GB300 NVL72 racks typically arrive from the integrator (Dell, ASUS, Supermicro, or others) as pre-assembled units. The rack includes compute trays, NVLink switch trays, power distribution, and manifold routing already installed. However, "pre-assembled" does not mean "ready to deploy."

Upon receiving, every rack must undergo a physical inspection before positioning. Check for shipping damage to manifold fittings, verify all tray latches are secure, confirm cable management arms are intact, and inspect power connectors for bent pins. Document any damage with photographs before moving the rack from the staging area.

Staging area requirements include a clean, temperature-controlled space with sufficient floor loading capacity and room to access all four sides of the rack. The staging area should be adjacent to the deployment row to minimize transport distance and risk.

Rack Positioning and Anchoring

Position the rack on its final location using properly rated floor tiles or slab anchors. Level the rack using adjustable leveling feet, verifying plumb with a spirit level on both axes. Secure the rack to the floor using seismic brackets if the deployment is in a seismic zone (California, parts of the Pacific Northwest, and other regions).

Verify clearance to adjacent racks, ensuring sufficient space for side panel removal, cable management, and tray extraction for servicing. The rear of the rack must have unobstructed access to cooling manifold connections and rear-mounted network ports.

Cooling System Connection

Connect the rack manifold to the CDU using pre-measured hose assemblies with quick-disconnect fittings. Follow the manufacturer's torque specifications for all fitting connections. Improper torque is the leading cause of slow leaks in liquid-cooled deployments.

After connection, perform a pressure test on the cooling loop before introducing coolant. Pressurize to the manufacturer's specified test pressure (typically 1.5x operating pressure) and hold for a minimum of 30 minutes while monitoring for pressure drop. Any pressure drop exceeding the specified tolerance indicates a leak that must be located and repaired before proceeding.

Once the pressure test passes, fill the cooling loop with the specified coolant mixture, bleed all air from the system, and verify flow rates at every manifold branch. Air trapped in cooling lines is the second most common cause of deployment delays. Bleed valves at high points in the manifold routing must be operated systematically until bubble-free flow is confirmed.

Power Connection

Connect the rack power feeds to the busway or PDU infrastructure. Verify phase rotation, voltage levels, and neutral-ground bonding before energizing. A phase rotation meter should be used to confirm correct phase sequence at the rack input.

Energize the rack power in stages: first the management controllers and BMC, then the networking components, then the compute trays. Monitor current draw at each stage and compare against expected values from the integrator's documentation. Abnormal current draw at any stage indicates a potential hardware issue that must be investigated before proceeding.

NVLink Fabric Verification

The NVLink fabric within the GB300 NVL72 rack consists of nine NVLink switch trays connecting all 72 GPUs. The cables connecting compute trays to NVLink switch trays are typically pre-installed at the factory, but every connection must be verified on site.

Use NVIDIA's diagnostic tools to validate NVLink topology. Every GPU must be visible within the NVLink domain, and bandwidth tests must confirm full link width on every connection. A single degraded NVLink lane will reduce training performance across the entire rack and must be remediated before commissioning.

Network Cabling

Fiber Optic Infrastructure

High-density fiber is the backbone of every GB300 deployment. The cabling infrastructure must support the aggregate bandwidth requirements of the cluster while maintaining signal integrity across hundreds or thousands of connections.

For intra-row connections (rack to top-of-rack switch, rack to adjacent racks), OM4 multimode fiber with MPO/MTP connectors is standard. OM4 supports 100Gbps per lane at distances up to 150 meters, which is sufficient for most data hall layouts. OM5 multimode fiber offers additional headroom for short-wavelength division multiplexing (SWDM) applications but is not required for most GB300 deployments.

For runs to aggregation or spine switches, OS2 single-mode fiber provides virtually unlimited distance within a data center campus. Single-mode is also required for wavelength-division multiplexing (WDM) applications used in some large-scale InfiniBand fabrics.

MPO/MTP trunk cables are used for high-density connections between patch panels. Trunk cables are typically pre-terminated at the factory and must be tested for insertion loss and return loss upon receipt. Do not assume factory-terminated cables meet specification. We have encountered insertion loss failures on approximately 3-5% of factory-terminated MPO trunks, which would cause link errors in production if not caught during testing.

Direct Attach Copper (DAC) cables are used for short connections within the rack, typically between the compute tray network ports and the top-of-rack switch. Active Optical Cables (AOC) and Active Electrical Cables (AEC) serve intermediate distances where DAC cannot reach but fiber is unnecessary.

Cable Management

At 120kW per rack with hundreds of connections per cabinet, cable management is not cosmetic — it is functional. Poorly managed cables obstruct airflow paths, prevent tray removal for servicing, and create fire hazards. Every cable must follow a defined pathway, be properly supported with cable management arms or trays, and have sufficient bend radius to avoid signal degradation.

Label every cable at both ends with a machine-readable label (barcode or QR code) that maps to the cable management database. After deployment, the ability to trace any connection from port to port in under 60 seconds is a commissioning requirement, not a nice-to-have.

Testing and Commissioning

Cable Certification

Every fiber connection must be tested before the rack is placed into production. Testing is not optional and cannot be deferred to "after we see if there are problems."

OTDR (Optical Time Domain Reflectometer) testing provides a complete characterization of each fiber link, identifying connector losses, splice losses, macro bends, and fiber breaks along the entire path. OTDR testing catches problems that simple power meter tests miss, including marginal connectors that pass at low data rates but fail under high-speed signaling.

Insertion loss testing using a calibrated light source and power meter verifies that the total optical loss of each link falls within the transceiver's optical budget. For OM4 fiber at 850nm, typical maximum insertion loss budgets range from 1.5 dB to 3.0 dB depending on link length and connector count.

Return loss testing measures the amount of light reflected back toward the source at each connector interface. Poor return loss indicates contaminated or damaged connector end faces and causes bit errors at high data rates. Minimum return loss requirements are typically 20 dB for multimode and 26 dB for single-mode connections.

Copper cables (DAC, Cat6A management network) must be certified using a cable certifier that tests to the appropriate standard (TIA-568-2.D for Cat6A). Certification includes testing for insertion loss, NEXT, PSNEXT, return loss, propagation delay, and delay skew.

GPU Validation

After all physical infrastructure is connected and tested, the GPU subsystem must be validated. This includes NVIDIA's built-in POST (Power-On Self-Test) verification, which checks GPU memory, compute units, NVLink connectivity, and thermal sensor functionality.

Following POST, run NVIDIA's Data Center GPU Manager (DCGM) diagnostics to perform extended health checks on every GPU. DCGM tests include memory stress tests, compute stress tests, PCIe bandwidth verification, and NVLink bandwidth verification.

Finally, run a sustained GPU stress test (typically NCCL all-reduce benchmarks) across all 72 GPUs in the rack for a minimum of 24 hours. This burn-in test identifies thermal throttling, intermittent hardware failures, and cooling system inadequacies that shorter tests miss. Any GPU that fails to maintain expected performance during the burn-in must be replaced before the rack is handed off to the customer.

Documentation and Handoff

The deliverable at commissioning is not just a working rack — it is a complete documentation package. This package includes cable maps showing every connection from source to destination, test reports for every fiber and copper link, GPU validation reports from DCGM, thermal performance data from the burn-in test, and as-built drawings showing the final physical configuration.

The documentation package becomes the baseline for ongoing operations. Without it, troubleshooting future issues requires re-testing every connection from scratch, which can take days on a single GB300 rack.

Common Deployment Challenges

Cooling Loop Air Entrapment

Air trapped in liquid cooling loops is the single most common cause of GB300 deployment delays. Symptoms include inconsistent GPU temperatures, thermal throttling under load, and intermittent thermal alarms. The solution is systematic bleeding of every branch of the manifold, starting from the lowest point and working upward. In some installations, this process must be repeated multiple times over several days as dissolved air comes out of solution during initial thermal cycling.

Fiber Contamination

Contaminated fiber end faces cause more link failures than any other single factor. At the data rates used in GB300 deployments (100Gbps, 200Gbps, 400Gbps per lane), even microscopic contamination on a connector end face causes bit errors. Every connector must be inspected with a fiber microscope and cleaned with appropriate tools before mating. "Clean and inspect every time" is not a suggestion — it is a deployment standard.

Power Quality Issues

GPU power supplies are sensitive to power quality. Voltage sags, harmonic distortion, and transient events can cause GPU resets, training job failures, and in severe cases, hardware damage. Power quality monitoring should begin before the first rack is energized and continue throughout the deployment. Any power quality event that falls outside the GPU manufacturer's specification must be investigated and resolved at the facility level.

NVLink Cable Seating

NVLink cables within the GB300 rack use high-density connectors that require precise insertion force and alignment. Under-seated connectors pass initial continuity tests but fail under thermal expansion as the rack reaches operating temperature. Every NVLink connection must be verified for proper seating using the manufacturer's insertion force gauge and confirmed with bandwidth testing under load.

Planning Your GB300 Deployment

The GB300 NVL72 is not a plug-and-play system. Successful deployment requires months of site preparation, coordinated procurement of power, cooling, and network infrastructure, and a deployment team with direct experience on the platform. Shortcuts in any phase compound downstream, turning a planned two-week deployment into a two-month remediation.

Leviathan Systems provides end-to-end GB300 NVL72 deployment services, from initial site assessment through commissioning and handoff. Our team is currently deploying GB300 infrastructure at hyperscale facilities in Texas, giving us direct, current experience with every aspect of the platform.

Contact us to discuss your GB300 deployment requirements. We respond within 48 hours with a scope assessment and timeline estimate.