Putting GPUDirect RDMA to Work in AI Training

As distributed AI training scales, the gap between raw compute and communication efficiency keeps widening. Once a cluster reaches thousands of GPUs and petaFLOP-class throughput, the limits of traditional data-transfer architectures become hard to ignore. Shuttling data between CPU memory and GPU memory burns CPU cycles and inflates communication latency, and both drag down the efficiency of the entire training job. GPUDirect RDMA (GDR) was designed to remove that bottleneck, and it has since become a core acceleration technology for high-end training clusters.

This article walks through GDR end to end: the underlying mechanism, why it fits AI training so well, how to deploy it, how to tune it, and what the results look like in practice — so your team can get the most out of a GPU cluster.

I. How GDR Works: Removing the CPU From the Data Path

GPUDirect RDMA is a high-speed data-transfer technology from NVIDIA. The idea is simple: let network adapters talk directly to GPU memory (or to other GPUs) and skip CPU forwarding entirely. With hardware-level protocol support, the NIC accesses GPU memory directly, the CPU drops out of the transfer path, and you get lower latency and higher effective bandwidth.

1. What the traditional path costs you

Without GDR, cross-node GPU communication — or any data exchange between GPUs and storage — has to relay through CPU memory:

GPU memory → PCIe → CPU memory → NIC → network → remote NIC → remote CPU memory → remote PCIe → remote GPU memory

That path has two costly problems:

Latency. Every transfer is staged through CPU memory, and the CPU's handling overhead adds delay. In small-message, high-frequency traffic like gradient synchronization, that overhead accumulates and visibly slows training.

CPU load. Moving data between GPU and network requires the CPU's full involvement, which consumes compute cycles and turns the CPU into a bottleneck. When GPUs are running flat out, a CPU saturated with transfer work can't keep up with the tasks it should be doing.

2. The GDR data path

GDR enables direct GPU-memory-to-NIC communication through three components working together: GPUDirect, RDMA (Remote Direct Memory Access), and the PCIe ATU (Address Translation Unit). The path collapses to:

GPU memory → PCIe → NIC → network → remote NIC → remote GPU memory

GDR Poster cover_1.jpg

(Figure 1: GDR data path — red is the sending GPU, blue is the receiving GPU. CPU memory is bypassed.)

Three things make it work:

Address mapping. The PCIe ATU handles translation between GPU virtual and physical addresses, so the NIC can resolve GPU-memory addresses without the CPU.
Direct access. The NIC reads from and writes to GPU memory over RDMA. No CPU-issued read/write instructions, and no CPU memory buffer in the loop.
Protocol integration. GDR is tightly integrated with high-speed fabrics like InfiniBand and RoCE, and supports zero-copy transfers to cut copy overhead and push bandwidth utilization higher.

3. Why it fits AI training

AI training is defined by large data volumes, frequent communication, and sensitivity to latency. GDR maps cleanly onto all three:

Low latency. Bypassing the CPU cuts transfer latency by more than half. Small transfers like gradient sync drop to the microsecond range — a good match for the high-frequency, small-message traffic of distributed training.
High bandwidth. Removing the CPU-memory ceiling lets you use the full bandwidth of PCIe 4.0/5.0 and InfiniBand. A single link can exceed 100 Gb/s, which is enough to keep TB-scale parameter synchronization moving.
Low CPU usage. With the CPU out of the transfer path, those cycles go back to scheduling and data preprocessing instead of becoming a bottleneck.
Drop-in compatibility. GDR is integrated with CUDA and NCCL (NVIDIA's collective communications library), so it works with PyTorch, TensorFlow, and other mainstream frameworks without code changes — deployment cost stays low.

II. Where GDR Pays Off in Training

The defining requirement of AI training is efficient multi-GPU collaboration. GDR's job is to speed up communication between GPUs — within a node and across nodes — and to accelerate data movement between GPUs and storage. In practice, the wins concentrate in four scenarios that span the full pipeline from data loading to model training.

1. Multi-node distributed training (the primary use case)

As large models like Llama and the GPT series have grown, a single node's GPUs are no longer enough, and multi-node clusters have become standard. That makes cross-node gradient sync and parameter updates the key efficiency bottleneck — and GDR is the main answer.

Here, GDR works hand-in-hand with NCCL to let GPUs across nodes communicate directly. Take a 64-GPU GPT-3 cluster (8 nodes × 8 GPUs). In the traditional model, each GPU's gradients first land in local CPU memory, then go out over the network — high latency plus CPU strain. With GDR enabled, GPUs push gradients straight into remote GPU memory over InfiniBand, no CPU in the middle.

GDR Poster cover_2.jpg

(Figure 2: GDR in multi-node training — direct cross-node gradient synchronization.)

The payoff:

Lower gradient-sync latency. Cross-node sync drops from the millisecond range to microseconds. For large models with constant gradient exchange, that compounds into hours or even days saved.
Better scalability. As you add nodes (say, 8 → 32), traditional-mode latency climbs roughly linearly, while GDR holds its low-latency profile and keeps large clusters working together efficiently.

2. Single-node multi-GPU training (bandwidth-bound)

GDR matters even inside a single node, especially with high-end GPUs like the A100 and H100. Traditional intra-node communication runs over PCIe, which is bandwidth-limited (PCIe 4.0 x16 is about 32 GB/s), and with eight GPUs talking at once you hit a wall. Pairing NVLink with InfiniBand or RoCE lets GDR deliver high-speed communication among GPUs in the same box.

3. Large-scale data loading (GPU ↔ storage)

Training runs chew through petabytes of data. Traditionally, loading from storage into GPU memory relays through CPU memory (storage → CPU memory → GPU memory), which adds latency and CPU load and can make data loading the bottleneck.

GDR lets GPUs and storage exchange data directly, shortening the path to storage → NIC → GPU memory:

Faster loading. PB-scale load times drop by more than 30%, with the biggest gains on large media like images and video.
Lower CPU overhead. With the CPU out of the loading path, it stays free for preprocessing and scheduling.

2. Single-node Multi-GPU Training (High-bandwidth Communication Scenario)

Even in the scenario of single-node multi-GPU training, GDR technology can play an important role — especially for high-end GPUs such as A100 and H100. Traditional single-node multi-GPU communication relies on the PCIe bus, which has limited bandwidth (the bandwidth of PCIe 4.0 x16 is 32GB/s). When 8 GPUs communicate simultaneously, a bandwidth bottleneck will occur. In contrast, GDR technology can combine NVLink with InfiniBand (or RoCE) to achieve high-speed communication among GPUs within a single node.

3. Large-scale data loading (GPU ↔ storage)

GDR lets GPUs and storage exchange data directly, shortening the path to storage → NIC → GPU memory:

Faster loading. PB-scale load times drop by more than 30%, with the biggest gains on large media like images and video.
Lower CPU overhead. With the CPU out of the loading path, it stays free for preprocessing and scheduling.

4.Hybrid model + data parallelism

For TB-parameter models like GPT-4, pure data parallelism isn't enough, and you need a hybrid of model and data parallelism. GDR optimizes both at once: low-latency gradient sync on the data-parallel side, high-bandwidth shard-parameter transfer on the model-parallel side. On a 1,024-GPU GPT-4 run, enabling GDR cut model-shard transfer latency by 60% and shortened the training cycle by more than 40%.

III. Deploying and Configuring GDR

GDR has specific hardware and software prerequisites, and it has to be configured alongside NCCL and CUDA to perform. Below are the prerequisites, the configuration steps, and the gotchas worth knowing before you start.

1. Prerequisites

Hardware

GPU: Must support NVIDIA GPUDirect. A100/H100-class GPUs are recommended; entry-level cards (e.g., GTX series) don't support GDR.
NIC: Must support RDMA — an InfiniBand HCA (e.g., Mellanox ConnectX-6 or ConnectX-7) or a RoCE-capable Ethernet adapter, ideally 100 Gb/s or higher.
Motherboard: PCIe 4.0 or newer, with slots that support ATU address translation, so the PCIe link between NIC and GPU is unobstructed.
Cluster fabric: Single-node multi-GPU setups want NVLink; multi-node clusters need an InfiniBand or RoCE fabric with enough inter-node bandwidth.

Software

CUDA Toolkit ≥ 11.4. GDR depends on CUDA's GPUDirect API; older versions don't support all of it.
NCCL ≥ 2.10. NCCL is the core collective library for distributed training and is tightly integrated with GDR — make sure the NCCL and CUDA versions are compatible.
OS: Linux recommended (e.g., Ubuntu 20.04, CentOS 8). Windows support for GDR is incomplete and isn't recommended for training.
Drivers: NVIDIA GPU driver ≥ 470.xx and NIC driver (e.g., Mellanox OFED) ≥ 5.0. Keep driver and software versions in sync.

2. Configuration (multi-node InfiniBand example)

The two core tasks are enabling GPUDirect RDMA and setting NCCL environment variables so GDR and NCCL work together.

(1) Check the environment and install dependencies

bash
# Check whether the GPU supports GPUDirect
nvidia-smi --query-gpu=gpu_name,gpu_bus_id,driver_version --format=csv,noheader,nounits

# Check whether the NIC supports RDMA
ibstat # InfiniBand info in the output means RDMA is available

# Install dependencies (Ubuntu)
sudo apt-get install -y cuda-toolkit-11-8 nccl-toolkit libibverbs-dev ibutils

# Install the Mellanox OFED driver (InfiniBand adapter)
sudo ./mlnxofedinstall --add-kernel-support

(2) Enable GDR

Enable GPU Direct RDMA by modifying the NVIDIA driver configuration file：

bash# Check whether the GPU supports GPUDirect # Add the following content (Enable GPU Direct RDMA) options nvidia_uvm uvm_perf_prefetch_enable=1 uvm_enable_peer_memory_access=1 options nvidia NVreg_EnablePCIeRdma=1 # Reload Driver sudo rmmod nvidia_uvm nvidia sudo modprobe nvidia_uvm nvidia # Verify whether GDR is enabled nvidia-smi -q | grep "GPU Direct RDMA" # If Enabled is displayed, it means the activation is successful

(3) Set NCCL environment variables (the key step)

NCCL needs to be configured via environment variables to specify the use of GDR technology and ensure that GDR transmission is enabled for cross-node GPU communication：

bash# Add the following environment variables to the training script (or write them to /etc/profile) export NCCL_NET_GDR=1 # Enable GDR Technology export NCCL_IB_DISABLE=0 # Enable InfiniBand export NCCL_SOCKET_IFNAME=ib0 # Specify InfiniBand Network Card export NCCL_IB_HCA=mlx5_0 # Specify InfiniBand HCA Device (View via ibstat)export NCCL_DEBUG=INFO # Enable NCCL Debug Logs (Optional)

(4) Verify performance

Use the official NCCL testing tool nccl-tests to verify the transmission performance of GDR technology, and compare the differences in bandwidth and latency with GDR enabled and disabled respectively：

bash# Clone and compile nccl-tests (download manually if the link parsing fails) git clone https://github.com/NVIDIA/nccl-tests.git cd nccl-tests make MPI=1 MPI_HOME=/usr/local/openmpi CUDA_HOME=/usr/local/cuda # Enable GDR Test (2 Nodes × 8GPU, AllReduce Test) mpirun -np 16 -N 8 -H node1:8,node2:8 \ -x NCCL_NET_GDR=1 -x NCCL_IB_DISABLE=0 \ ./build/all_reduce_perf -b 256M -e 8G -f 2 -g 1 # Disable GDR Test (Comparison) mpirun -np 16 -N 8 -H node1:8,node2:8 \ -x NCCL_NET_GDR=0 -x NCCL_IB_DISABLE=0 \ ./build/all_reduce_perf -b 256M -e 8G -f 2 -g 1

Pass criteria: with GDR on, if BusBW rises by more than 30% and latency drops by more than 50%, the configuration is working.

3. Deployment Notes (Pitfall Avoidance Guide)

Version compatibility. Keep CUDA, NCCL, the GPU driver, and the NIC driver compatible, or GDR may fail to enable or behave erratically (for example, CUDA 11.8 needs NCCL 2.18 or newer).
Network config. Set a sensible MTU on the InfiniBand fabric (2048 is a good default) to avoid fragmentation-induced latency; confirm node-to-node connectivity and disable firewalls.
GPU memory. GDR requires peer-to-peer access to GPU memory — leave enough headroom so transfers don't fail on a full device.
Debugging. If GDR won't enable, set NCCL_DEBUG=INFO and read the logs. Watch for "GPU Direct RDMA not enabled" and "HCA device not found," and check your driver config and physical connections.

IV. Tuning and Troubleshooting

Turning GDR on isn't the whole job. You'll want to tune for your workload and clear common issues to get full value.

1. Tuning

(1) Match the protocol to the message size

NCCL offers several transport protocols; pick by message size to balance latency and bandwidth:

Small messages (<64 KB, e.g., gradient sync): use NCCL's Simple protocol to minimize overhead and latency.
Large messages (>1 MB, e.g., parameter sync): use LL128, whose 128-byte transfer unit maximizes bandwidth.

export NCCL_PROTO=LL128 for large messages, export NCCL_PROTO=Simple for small ones.

(2) Pick the right NCCL algorithm

Large multi-node clusters (>32 GPUs): Ring AllReduce, whose bandwidth scales linearly with GPU count — paired with GDR, communication runs essentially bottleneck-free.
Small single-node clusters (2–8 GPUs): the Tree algorithm, which has lower latency and suits small-batch training.

2. Common issues

Symptom	Likely Cause	Fix
GDR won't enable; log shows "GPU Direct RDMA not enabled"	GPU doesn't support GDR, bad driver config, or CUDA too old	1. Check whether the GPU model supports GPU Direct; 2. Reconfigure the nvidia.conf file and restart the driver; 3. Upgrade CUDA to version 11.4 or higher
GDR on, but bandwidth barely improves	NIC bandwidth too low, poor PCIe link config, or bad NCCL env vars	1. Check the network adapter bandwidth (required ≥100Gb/s); 2. Optimize the PCIe link configuration; 3. Confirm that NCCL_NET_GDR=1 is enabled
Cross-node communication fails; log shows "HCA device not found"	InfiniBand driver missing or wrong interface name	1. Reinstall the Mellanox OFED driver; 2. Check the network card name via ibstat and modify NCCL_SOCKET_IFNAME
CPU usage stays high during training	GDR not actually active, preprocessing on CPU, or bad MPI config	1. Verify whether GDR is enabled (nvidia-smi -q \| grep "GPU Direct RDMA"); 2. Optimize the data preprocessing pipeline and adopt GPU acceleration; 3. Configure MPI to bind CPU cores (--bind-to core)

V. Case Study: 64-GPU GPT-3 Training

Environment: 8 nodes × 8 A100 GPUs, 200 Gb/s InfiniBand (HDR), CUDA 11.8, NCCL 2.18, GDR enabled.
Workload: GPT-3 (175B parameters), data-parallel, batch size 128, 10 TB of training data.

GDR on vs. off:

• Gradient synchronization latency: reduced from 120μs to 45μs, a decrease of 62.5%.

• Single-step training time: reduced from 1.8s to 1.1s, an improvement of 38.9%.

• CPU usage: dropped from 75% to 20%, freeing up CPU resources for data preprocessing.

• Training cycle: shortened from 12 days to 7 days, with efficiency increased by 41.7%.

GDR Poster cover_3.jpg

VI. Takeaways and What's Next

GDR is the communication accelerator for distributed training. By taking the CPU out of the data path, it gives GPUs, networks, and storage a direct high-speed channel and addresses the classic problems of the traditional model — high latency, heavy CPU load, and bandwidth ceilings. For large models and large clusters, GDR working alongside NCCL and InfiniBand meaningfully lifts efficiency and shortens training cycles, which is why it's become a fixture in high-end clusters.

Its appeal isn't only performance — it's also low cost to adopt. There's no training code to change; sort out the hardware and a handful of environment variables and you're running, with compatibility across mainstream frameworks. For engineers, knowing how to deploy, tune, and debug GDR is the most direct way to clear communication bottlenecks and unlock a cluster's full compute potential.

Looking ahead, as GPUs get faster and models keep growing, GDR will push toward still higher bandwidth and lower latency — riding hardware like PCIe 5.0 and 400 Gb/s InfiniBand (NDR), and integrating more deeply with AI frameworks so GDR can switch on and self-tune automatically, lowering the barrier to entry and supporting larger, more efficient training.

I. How GDR Works: Removing the CPU From the Data Path

1. What the traditional path costs you

Without GDR, cross-node GPU communication — or any data exchange between GPUs and storage — has to relay through CPU memory:

GPU memory → PCIe → CPU memory → NIC → network → remote NIC → remote CPU memory → remote PCIe → remote GPU memory

That path has two costly problems:

Latency. Every transfer is staged through CPU memory, and the CPU's handling overhead adds delay. In small-message, high-frequency traffic like gradient synchronization, that overhead accumulates and visibly slows training.

CPU load. Moving data between GPU and network requires the CPU's full involvement, which consumes compute cycles and turns the CPU into a bottleneck. When GPUs are running flat out, a CPU saturated with transfer work can't keep up with the tasks it should be doing.

2. The GDR data path

GPU memory → PCIe → NIC → network → remote NIC → remote GPU memory

GDR Poster cover_1.jpg

(Figure 1: GDR data path — red is the sending GPU, blue is the receiving GPU. CPU memory is bypassed.)

Three things make it work:

Address mapping. The PCIe ATU handles translation between GPU virtual and physical addresses, so the NIC can resolve GPU-memory addresses without the CPU.
Direct access. The NIC reads from and writes to GPU memory over RDMA. No CPU-issued read/write instructions, and no CPU memory buffer in the loop.
Protocol integration. GDR is tightly integrated with high-speed fabrics like InfiniBand and RoCE, and supports zero-copy transfers to cut copy overhead and push bandwidth utilization higher.

3. Why it fits AI training

AI training is defined by large data volumes, frequent communication, and sensitivity to latency. GDR maps cleanly onto all three:

Low latency. Bypassing the CPU cuts transfer latency by more than half. Small transfers like gradient sync drop to the microsecond range — a good match for the high-frequency, small-message traffic of distributed training.
High bandwidth. Removing the CPU-memory ceiling lets you use the full bandwidth of PCIe 4.0/5.0 and InfiniBand. A single link can exceed 100 Gb/s, which is enough to keep TB-scale parameter synchronization moving.
Low CPU usage. With the CPU out of the transfer path, those cycles go back to scheduling and data preprocessing instead of becoming a bottleneck.
Drop-in compatibility. GDR is integrated with CUDA and NCCL (NVIDIA's collective communications library), so it works with PyTorch, TensorFlow, and other mainstream frameworks without code changes — deployment cost stays low.

II. Where GDR Pays Off in Training

1. Multi-node distributed training (the primary use case)

GDR Poster cover_2.jpg

(Figure 2: GDR in multi-node training — direct cross-node gradient synchronization.)

The payoff:

Lower gradient-sync latency. Cross-node sync drops from the millisecond range to microseconds. For large models with constant gradient exchange, that compounds into hours or even days saved.
Better scalability. As you add nodes (say, 8 → 32), traditional-mode latency climbs roughly linearly, while GDR holds its low-latency profile and keeps large clusters working together efficiently.

2. Single-node multi-GPU training (bandwidth-bound)

3. Large-scale data loading (GPU ↔ storage)

GDR lets GPUs and storage exchange data directly, shortening the path to storage → NIC → GPU memory:

Faster loading. PB-scale load times drop by more than 30%, with the biggest gains on large media like images and video.
Lower CPU overhead. With the CPU out of the loading path, it stays free for preprocessing and scheduling.

2. Single-node Multi-GPU Training (High-bandwidth Communication Scenario)

3. Large-scale data loading (GPU ↔ storage)

GDR lets GPUs and storage exchange data directly, shortening the path to storage → NIC → GPU memory:

Faster loading. PB-scale load times drop by more than 30%, with the biggest gains on large media like images and video.
Lower CPU overhead. With the CPU out of the loading path, it stays free for preprocessing and scheduling.

4.Hybrid model + data parallelism

III. Deploying and Configuring GDR

1. Prerequisites

Hardware

GPU: Must support NVIDIA GPUDirect. A100/H100-class GPUs are recommended; entry-level cards (e.g., GTX series) don't support GDR.
NIC: Must support RDMA — an InfiniBand HCA (e.g., Mellanox ConnectX-6 or ConnectX-7) or a RoCE-capable Ethernet adapter, ideally 100 Gb/s or higher.
Motherboard: PCIe 4.0 or newer, with slots that support ATU address translation, so the PCIe link between NIC and GPU is unobstructed.
Cluster fabric: Single-node multi-GPU setups want NVLink; multi-node clusters need an InfiniBand or RoCE fabric with enough inter-node bandwidth.

Software

CUDA Toolkit ≥ 11.4. GDR depends on CUDA's GPUDirect API; older versions don't support all of it.
NCCL ≥ 2.10. NCCL is the core collective library for distributed training and is tightly integrated with GDR — make sure the NCCL and CUDA versions are compatible.
OS: Linux recommended (e.g., Ubuntu 20.04, CentOS 8). Windows support for GDR is incomplete and isn't recommended for training.
Drivers: NVIDIA GPU driver ≥ 470.xx and NIC driver (e.g., Mellanox OFED) ≥ 5.0. Keep driver and software versions in sync.

2. Configuration (multi-node InfiniBand example)

The two core tasks are enabling GPUDirect RDMA and setting NCCL environment variables so GDR and NCCL work together.

(1) Check the environment and install dependencies

bash
# Check whether the GPU supports GPUDirect
nvidia-smi --query-gpu=gpu_name,gpu_bus_id,driver_version --format=csv,noheader,nounits

# Check whether the NIC supports RDMA
ibstat # InfiniBand info in the output means RDMA is available

# Install dependencies (Ubuntu)
sudo apt-get install -y cuda-toolkit-11-8 nccl-toolkit libibverbs-dev ibutils

# Install the Mellanox OFED driver (InfiniBand adapter)
sudo ./mlnxofedinstall --add-kernel-support

(2) Enable GDR

Enable GPU Direct RDMA by modifying the NVIDIA driver configuration file：

bash# Check whether the GPU supports GPUDirect # Add the following content (Enable GPU Direct RDMA) options nvidia_uvm uvm_perf_prefetch_enable=1 uvm_enable_peer_memory_access=1 options nvidia NVreg_EnablePCIeRdma=1 # Reload Driver sudo rmmod nvidia_uvm nvidia sudo modprobe nvidia_uvm nvidia # Verify whether GDR is enabled nvidia-smi -q | grep "GPU Direct RDMA" # If Enabled is displayed, it means the activation is successful

(3) Set NCCL environment variables (the key step)

NCCL needs to be configured via environment variables to specify the use of GDR technology and ensure that GDR transmission is enabled for cross-node GPU communication：

bash# Add the following environment variables to the training script (or write them to /etc/profile) export NCCL_NET_GDR=1 # Enable GDR Technology export NCCL_IB_DISABLE=0 # Enable InfiniBand export NCCL_SOCKET_IFNAME=ib0 # Specify InfiniBand Network Card export NCCL_IB_HCA=mlx5_0 # Specify InfiniBand HCA Device (View via ibstat)export NCCL_DEBUG=INFO # Enable NCCL Debug Logs (Optional)

(4) Verify performance

bash# Clone and compile nccl-tests (download manually if the link parsing fails) git clone https://github.com/NVIDIA/nccl-tests.git cd nccl-tests make MPI=1 MPI_HOME=/usr/local/openmpi CUDA_HOME=/usr/local/cuda # Enable GDR Test (2 Nodes × 8GPU, AllReduce Test) mpirun -np 16 -N 8 -H node1:8,node2:8 \ -x NCCL_NET_GDR=1 -x NCCL_IB_DISABLE=0 \ ./build/all_reduce_perf -b 256M -e 8G -f 2 -g 1 # Disable GDR Test (Comparison) mpirun -np 16 -N 8 -H node1:8,node2:8 \ -x NCCL_NET_GDR=0 -x NCCL_IB_DISABLE=0 \ ./build/all_reduce_perf -b 256M -e 8G -f 2 -g 1

Pass criteria: with GDR on, if BusBW rises by more than 30% and latency drops by more than 50%, the configuration is working.

3. Deployment Notes (Pitfall Avoidance Guide)

Version compatibility. Keep CUDA, NCCL, the GPU driver, and the NIC driver compatible, or GDR may fail to enable or behave erratically (for example, CUDA 11.8 needs NCCL 2.18 or newer).
Network config. Set a sensible MTU on the InfiniBand fabric (2048 is a good default) to avoid fragmentation-induced latency; confirm node-to-node connectivity and disable firewalls.
GPU memory. GDR requires peer-to-peer access to GPU memory — leave enough headroom so transfers don't fail on a full device.
Debugging. If GDR won't enable, set NCCL_DEBUG=INFO and read the logs. Watch for "GPU Direct RDMA not enabled" and "HCA device not found," and check your driver config and physical connections.

IV. Tuning and Troubleshooting

Turning GDR on isn't the whole job. You'll want to tune for your workload and clear common issues to get full value.

1. Tuning

(1) Match the protocol to the message size

NCCL offers several transport protocols; pick by message size to balance latency and bandwidth:

Small messages (<64 KB, e.g., gradient sync): use NCCL's Simple protocol to minimize overhead and latency.
Large messages (>1 MB, e.g., parameter sync): use LL128, whose 128-byte transfer unit maximizes bandwidth.

export NCCL_PROTO=LL128 for large messages, export NCCL_PROTO=Simple for small ones.

(2) Pick the right NCCL algorithm

Large multi-node clusters (>32 GPUs): Ring AllReduce, whose bandwidth scales linearly with GPU count — paired with GDR, communication runs essentially bottleneck-free.
Small single-node clusters (2–8 GPUs): the Tree algorithm, which has lower latency and suits small-batch training.

2. Common issues

Symptom	Likely Cause	Fix
GDR won't enable; log shows "GPU Direct RDMA not enabled"	GPU doesn't support GDR, bad driver config, or CUDA too old	1. Check whether the GPU model supports GPU Direct; 2. Reconfigure the nvidia.conf file and restart the driver; 3. Upgrade CUDA to version 11.4 or higher
GDR on, but bandwidth barely improves	NIC bandwidth too low, poor PCIe link config, or bad NCCL env vars	1. Check the network adapter bandwidth (required ≥100Gb/s); 2. Optimize the PCIe link configuration; 3. Confirm that NCCL_NET_GDR=1 is enabled
Cross-node communication fails; log shows "HCA device not found"	InfiniBand driver missing or wrong interface name	1. Reinstall the Mellanox OFED driver; 2. Check the network card name via ibstat and modify NCCL_SOCKET_IFNAME
CPU usage stays high during training	GDR not actually active, preprocessing on CPU, or bad MPI config	1. Verify whether GDR is enabled (nvidia-smi -q \| grep "GPU Direct RDMA"); 2. Optimize the data preprocessing pipeline and adopt GPU acceleration; 3. Configure MPI to bind CPU cores (--bind-to core)

V. Case Study: 64-GPU GPT-3 Training

Environment: 8 nodes × 8 A100 GPUs, 200 Gb/s InfiniBand (HDR), CUDA 11.8, NCCL 2.18, GDR enabled.
Workload: GPT-3 (175B parameters), data-parallel, batch size 128, 10 TB of training data.

GDR on vs. off:

• Gradient synchronization latency: reduced from 120μs to 45μs, a decrease of 62.5%.

• Single-step training time: reduced from 1.8s to 1.1s, an improvement of 38.9%.

• CPU usage: dropped from 75% to 20%, freeing up CPU resources for data preprocessing.

• Training cycle: shortened from 12 days to 7 days, with efficiency increased by 41.7%.

GDR Poster cover_3.jpg

Putting GPUDirect RDMA to Work in AI Training

I. How GDR Works: Removing the CPU From the Data Path

1. What the traditional path costs you

2. The GDR data path

3. Why it fits AI training

II. Where GDR Pays Off in Training

1. Multi-node distributed training (the primary use case)

2. Single-node multi-GPU training (bandwidth-bound)

3. Large-scale data loading (GPU ↔ storage)

3. Large-scale data loading (GPU ↔ storage)

III. Deploying and Configuring GDR

1. Prerequisites

2. Configuration (multi-node InfiniBand example)

IV. Tuning and Troubleshooting

1. Tuning

(2) Pick the right NCCL algorithm

V. Case Study: 64-GPU GPT-3 Training

VI. Takeaways and What's Next

Want to Learn More?

Putting GPUDirect RDMA to Work in AI Training

I. How GDR Works: Removing the CPU From the Data Path

1. What the traditional path costs you

2. The GDR data path

3. Why it fits AI training

II. Where GDR Pays Off in Training

1. Multi-node distributed training (the primary use case)

2. Single-node multi-GPU training (bandwidth-bound)

3. Large-scale data loading (GPU ↔ storage)

3. Large-scale data loading (GPU ↔ storage)

III. Deploying and Configuring GDR

1. Prerequisites

2. Configuration (multi-node InfiniBand example)

IV. Tuning and Troubleshooting

1. Tuning

(2) Pick the right NCCL algorithm

V. Case Study: 64-GPU GPT-3 Training

VI. Takeaways and What's Next

Want to Learn More?