NVLink Architecture Explained: GPU Interconnect from DGX to NVL72 Rack-Scale Systems

TL;DR

NVLink is NVIDIA's proprietary high-bandwidth interconnect that connects GPUs directly to each other. NVLink 5.0 connects all 72 GPUs across 18 compute trays in the GB200/GB300 NVL72, creating a single NVLink domain with 130TB/s of aggregate bandwidth.

What NVLink Is and Why It Matters

NVLink is NVIDIA's proprietary high-bandwidth interconnect that connects GPUs directly to each other, bypassing the PCIe bus that traditionally sits between the CPU and GPU subsystems. In AI training workloads, GPUs must exchange billions of parameters every training step through collective operations like all-reduce. The speed of this exchange directly determines training throughput.

PCIe Gen5 provides approximately 64GB/s per direction per x16 slot. NVLink 4.0 (Hopper generation) provides 900GB/s across an 8-GPU system. NVLink 5.0 (Blackwell generation) provides 1,800GB/s per GPU, with rack-scale systems achieving 130TB/s aggregate bandwidth across 72 GPUs. This is a 28x bandwidth advantage over PCIe for GPU-to-GPU communication — the difference between a training job completing in days versus weeks.

For deployment teams, NVLink determines the physical cabling architecture inside GPU servers and rack-scale systems, the failure modes that must be tested during commissioning, and the troubleshooting methodology when training performance falls below expectations.

NVLink Generations

NVLink 3.0 (Ampere: A100)

NVLink 3.0 connects 8 A100 GPUs in an HGX A100 baseboard through 6 NVLink bridges per GPU. Each bridge provides 25GB/s per direction, giving each GPU 600GB/s of total NVLink bandwidth (12 links × 25GB/s × 2 directions).

The A100 NVLink topology uses NVSwitch to create a fully connected mesh where every GPU can communicate directly with every other GPU at full bandwidth. The DGX A100 system contains 6 NVSwitch chips that implement this all-to-all connectivity.

From a physical perspective, NVLink 3.0 connections in the DGX A100 are made through a baseboard PCB. There are no user-serviceable NVLink cables — the NVLink topology is fixed at the board level. Deployment teams interact with NVLink 3.0 only through diagnostic tools that verify link health and bandwidth.

NVLink 4.0 (Hopper: H100, H200)

NVLink 4.0 in the HGX H100 baseboard doubles the per-link bandwidth to 50GB/s, providing 900GB/s of aggregate NVLink bandwidth per GPU across 18 links. The HGX H100 baseboard connects 8 GPUs through 4 third-generation NVSwitch chips.

Like the A100 generation, NVLink 4.0 connections in standard HGX H100 systems are baseboard-level interconnects with no external cabling. The DGX H100 is an 8U system containing two HGX H100 baseboards (for the dual-baseboard variant) or a single baseboard (for the standard variant).

NVLink 4.0 introduced the ability to extend NVLink across multiple nodes through NVLink Switch systems, but this capability was not widely deployed in the H100 generation. Most H100 deployments use NVLink within a single server and InfiniBand or Ethernet for inter-server communication.

NVLink 5.0 (Blackwell: GB200, GB300)

NVLink 5.0 is the transformative generation that takes NVLink from an intra-server interconnect to a rack-scale fabric. In the GB200 NVL72 and GB300 NVL72, NVLink 5.0 connects all 72 GPUs across 18 compute trays through 9 NVLink switch trays, creating a single NVLink domain with 130TB/s of aggregate bandwidth.

This is the first NVLink generation with external cabling that deployment teams must install, verify, and maintain. Each compute tray connects to multiple NVLink switch trays through high-density cables that run within the rack. The cables use proprietary connectors designed for the specific signal integrity requirements of NVLink 5.0.

The physical NVLink cabling in a GB200/GB300 rack is pre-installed by the system integrator (Dell, ASUS, Supermicro), but verification at the deployment site is mandatory. Shipping, handling, and rack positioning can dislodge or partially unseat NVLink connectors, and a single degraded NVLink lane impacts performance across the entire 72-GPU domain.

NVLink Topologies

All-to-All (Fully Connected Mesh)

In the DGX/HGX baseboard topology, every GPU can communicate directly with every other GPU at equal bandwidth. This is implemented through NVSwitch chips that provide non-blocking crossbar switching between all NVLink ports.

The all-to-all topology is optimal for training workloads that use collective operations (all-reduce, all-gather) where every GPU must exchange data with every other GPU. No GPU pair has a bandwidth disadvantage, which simplifies workload placement and eliminates topology-aware scheduling requirements.

Rack-Scale NVLink Domain (NVL72)

The GB200/GB300 NVL72 extends the all-to-all topology to 72 GPUs across 18 compute trays. The 9 NVLink switch trays implement the crossbar function at rack scale, providing full bisection bandwidth between any pair of GPUs in the rack.

This rack-scale NVLink domain is architecturally equivalent to the baseboard-level NVLink in a DGX system, but at 9x the GPU count. From the software perspective, CUDA sees all 72 GPUs as a single NUMA-like domain with uniform NVLink bandwidth between any GPU pair.

The physical implication is that the NVLink switch trays are single points of failure for the rack-scale fabric. A failed NVLink switch tray reduces the available bandwidth for all 72 GPUs, not just the GPUs directly connected to that tray. NVLink switch tray health monitoring is a critical operational function for NVL72 systems.

Multi-Rack NVLink (NVLink Spine)

For clusters that require NVLink bandwidth between racks (as opposed to using InfiniBand or Ethernet for inter-rack communication), NVIDIA offers NVLink spine switches that extend the NVLink domain across multiple NVL72 racks. This creates super-pods of 576 GPUs (8 racks × 72 GPUs) or larger, all connected by NVLink.

Multi-rack NVLink requires external NVLink cables between racks, routed through overhead cable trays or under-floor pathways. These cables must meet the same signal integrity requirements as intra-rack NVLink cables, with additional constraints on maximum cable length (typically 2-5 meters depending on the cable type).

Deploying multi-rack NVLink adds significant cabling complexity. Each NVL72 rack has multiple NVLink spine ports that must be connected to the spine switch, and the cable count scales with the number of racks in the NVLink domain. Leviathan manages this complexity through careful pre-planning of cable pathways, pre-labeling of all cables, and systematic verification of every connection.

NVLink Deployment Considerations

Connector Seating

NVLink connectors in the GB200/GB300 NVL72 use high-density, precision-aligned connectors that require specific insertion force for proper electrical contact. Under-seated connectors are the most common NVLink failure mode that Leviathan encounters in the field.

Symptoms of under-seated NVLink connectors include intermittent link errors visible in DCGM or nvidia-smi, reduced NVLink bandwidth on specific GPU pairs, training job hangs or crashes during collective operations, and NVLink link retraining events logged in the system event log.

The diagnostic challenge is that under-seated connectors may appear fully functional during cold testing (room temperature, no load) and only fail when the rack reaches operating temperature and thermal expansion changes the connector gap by fractions of a millimeter. This is why burn-in testing under sustained full load is essential for NVLink validation.

Prevention requires verifying connector seating using the manufacturer's insertion force gauge on every NVLink connection, not just visual inspection. Visual inspection cannot distinguish a connector that is 95% seated (which may fail under thermal cycling) from one that is 100% seated.

NVLink Lane Degradation

Each NVLink link consists of multiple lanes. A partially degraded link (some lanes functional, some failed) will operate at reduced bandwidth rather than failing completely. This degraded state is insidious because the system continues to function and training jobs complete, but at reduced throughput.

DCGM and nvidia-smi report NVLink link width for every GPU pair. During commissioning, every NVLink link must show full link width. Any link operating at reduced width indicates a connector issue, cable defect, or component failure that must be resolved before the rack enters production.

Ongoing monitoring of NVLink link width should be integrated into the cluster management system. A link that degrades during production operation indicates a developing hardware problem that will worsen over time.

Thermal Effects on NVLink Signal Integrity

NVLink signaling operates at extremely high data rates where signal integrity margins are tight. Temperature changes affect connector contact resistance, cable impedance, and transceiver performance. The NVLink system is designed to operate within specification across the GPU's rated temperature range, but operation near the thermal limits reduces margin.

Ensuring consistent cooling across all NVLink switch trays and compute trays is essential for reliable NVLink operation. Hot spots caused by uneven coolant flow (in liquid-cooled systems) or blocked airflow (in air-cooled components within liquid-cooled racks) can push NVLink signals out of specification on the affected trays while the rest of the rack operates normally.

NVLink Error Monitoring

NVIDIA provides several tools for monitoring NVLink health in production. nvidia-smi reports NVLink bandwidth, link state, and error counters for every GPU. DCGM provides extended diagnostics including NVLink replay counters (indicating retransmissions due to errors), NVLink recovery counters (indicating link retraining events), and NVLink CRC error counts (indicating data corruption).

Any non-zero NVLink error counter in production warrants investigation. Unlike PCIe, which tolerates some level of correctable errors, NVLink errors indicate a hardware or signal integrity issue that will degrade training performance and may worsen over time. The remediation is typically reseating or replacing the affected NVLink cable, cleaning connectors, or replacing a compute tray or NVLink switch tray.

NVLink vs. PCIe: When It Matters

NVLink provides its largest advantage for workloads that require frequent, high-volume GPU-to-GPU communication. Large-scale model training with data parallelism and model parallelism benefits enormously from NVLink because every training step requires all-reduce operations that exchange gradient data across all GPUs. The difference between NVLink bandwidth (900GB/s to 1,800GB/s per GPU) and PCIe bandwidth (64GB/s) translates directly to training throughput.

For inference workloads that process independent requests on individual GPUs, NVLink provides less benefit because there is minimal GPU-to-GPU communication. A single GPU processes each inference request independently, and the bottleneck is typically memory bandwidth (for loading model weights) rather than inter-GPU bandwidth.

However, for inference of very large models that are sharded across multiple GPUs (tensor parallelism), NVLink bandwidth determines the latency of each inference request. As model sizes continue to grow and test-time scaling (longer reasoning chains) becomes more common, NVLink bandwidth becomes increasingly important for inference as well.

NVLink Integration Services

Leviathan Systems verifies NVLink connectivity on every GPU rack we deploy, from single DGX systems to multi-rack NVL72 clusters. Our commissioning process includes physical inspection of all NVLink connectors (with insertion force verification on rack-scale systems), DCGM NVLink diagnostic testing on every GPU, NVLink bandwidth benchmarking across all GPU pairs, 24-72 hour burn-in testing with NVLink error monitoring, and documentation of NVLink topology and test results.

For GB200/GB300 NVL72 deployments, we also verify intra-rack NVLink cabling between compute trays and switch trays, and for multi-rack NVLink configurations, we install and certify the inter-rack NVLink spine cabling.