Motivation

Every optimization technique in the LLM infrastructure stack ultimately comes down to two things: memory and communication. Whether you are sharding model weights across GPUs with FSDP, overlapping computation with gradient synchronization in DDP, or designing a KV cache eviction policy for an inference server, you are wrestling with the same two constraints.

Memory determines what fits on a single device. Communication determines how fast multiple devices can cooperate. If you do not understand these two fundamentals at a hardware level, every distributed training paper and every inference system will feel like a black box.

This article builds that foundation. Part A dissects the GPU memory hierarchy — from registers to HBM — and explains why memory bandwidth, not compute, is usually the bottleneck for LLM workloads. Part B covers the complete set of NCCL collective communication primitives, the algorithms that implement them, and the hardware interconnects that constrain them. By the end, you will have the mental models needed to reason about every parallelism strategy and inference optimization we cover in later articles.

Prerequisites

  • Basic PyTorch experience (training loops, nn.Module, CUDA tensors)
  • Familiarity with the Transformer architecture (attention, FFN, residual connections)

Part A: GPU Memory Model

GPU vs CPU: Different Design Philosophies

Before diving into GPU memory, it helps to understand why GPUs are built the way they are.

A modern CPU like an Intel Xeon or AMD EPYC has a small number of powerful cores (typically 32-128). Each core has deep pipelines, sophisticated branch predictors, large private caches (often 1-2 MB of L2 per core), and out-of-order execution engines. The design philosophy is latency optimization: make a single thread of execution as fast as possible.

A GPU takes the opposite approach. An NVIDIA A100 has 108 Streaming Multiprocessors (SMs), each capable of running thousands of threads concurrently. Individual threads are simple — no branch prediction, no out-of-order execution, modest cache per SM. The design philosophy is throughput optimization: keep thousands of threads in flight to hide memory latency through massive parallelism.

flowchart LR
  subgraph CPU["CPU: Latency-Oriented"]
    direction TB
    C0["Few powerful cores"] ~~~ C1["Large caches"]
    C2["Branch prediction"] ~~~ C3["Out-of-order exec"]
  end
  subgraph GPU["GPU: Throughput-Oriented"]
    direction TB
    G0["Many simple cores"] ~~~ G1["Small caches per SM"]
    G2["No branch prediction"] ~~~ G3["In-order execution"]
  end

This distinction matters because it explains the GPU memory hierarchy. A CPU can afford to stall on a cache miss — the out-of-order engine finds other work to do. A GPU cannot afford to stall 108 SMs simultaneously; instead, it relies on having enough threads ready to execute that the hardware can switch to a different group of threads while the first group waits for data from memory.

Streaming Multiprocessors, Warps, and the Execution Model

The fundamental execution unit of an NVIDIA GPU is the Streaming Multiprocessor (SM). Understanding SM architecture is essential for reasoning about memory access patterns and occupancy.

SM Architecture. Each SM contains:

  • Multiple CUDA cores (FP32/FP64 units) — 64 FP32 cores per SM on A100
  • Tensor Cores for matrix multiply-accumulate operations (crucial for Transformer training)
  • A register file (typically 256 KB per SM on A100)
  • Shared memory / L1 cache (configurable, up to 164 KB combined on A100)
  • Warp schedulers that manage thread execution

Warps. Threads on a GPU do not execute individually. They execute in groups of 32 called warps. All 32 threads in a warp execute the same instruction at the same time (SIMT — Single Instruction, Multiple Threads). If threads within a warp take different branches of an if statement, both branches execute serially (called warp divergence), wasting cycles. This is why GPU code avoids divergent control flow.

Thread Hierarchy. CUDA organizes threads into a three-level hierarchy:

Grid (the entire kernel launch)
  └── Block (a.k.a. thread block or CTA — assigned to one SM)
        └── Warp (32 threads — the actual execution unit)
              └── Thread (individual thread within a warp)

A thread block is assigned to a single SM and cannot migrate. Multiple blocks can run on the same SM if resources (registers, shared memory) allow. The number of active warps per SM relative to the maximum is called occupancy — higher occupancy generally means better latency hiding, though the relationship is not always linear.

Why this matters for LLMs. Transformer operations like matrix multiplications and attention computations map well to the GPU’s SIMT model because they are highly data-parallel. However, operations like layer normalization, softmax, and token-by-token autoregressive decoding have less parallelism, and understanding the warp/SM model helps explain why these operations become bottlenecks.

Memory Hierarchy: From Registers to HBM

The GPU memory hierarchy is the single most important concept for understanding LLM performance. Each level trades off capacity for speed:

flowchart TD
  REG["<b>Registers</b>\n256 KB | ~19 TB/s | 0 cycles\nScope: per-thread"]
  SMEM["<b>Shared Memory / L1 Cache</b>\nUp to 164 KB (A100) | ~19 TB/s\nScope: per-block"]
  L2["<b>L2 Cache</b>\n40 MB (A100) | ~5 TB/s\nScope: entire GPU"]
  HBM["<b>HBM (Global Memory)</b>\n80 GB (A100) | ~2 TB/s\nScope: entire GPU"]

  REG -->|"~4x bandwidth drop"| SMEM -->|"~4x"| L2 -->|"~2.5x"| HBM

  style REG fill:#2d6a4f,color:#fff
  style SMEM fill:#40916c,color:#fff
  style L2 fill:#74c69d,color:#000
  style HBM fill:#b7e4c7,color:#000

Let us walk through each level with concrete numbers for the NVIDIA A100 (80 GB variant):

Registers are the fastest storage, private to each thread. The A100 has 256 KB of register file per SM (65,536 32-bit registers). Registers have effectively zero latency — they are read in the same cycle as the instruction that uses them. However, if a kernel uses too many registers per thread, fewer warps can be active on the SM, reducing occupancy. This tension is called register pressure.

Shared Memory / L1 Cache is a fast on-chip SRAM shared among all threads in a block. On the A100, each SM has up to 164 KB of combined shared memory and L1 cache, with a configurable split. Shared memory is explicitly managed by the programmer (in CUDA C) and provides ~19 TB/s bandwidth per SM. It is the key to writing fast GPU kernels — FlashAttention, for instance, keeps attention tiles in shared memory to avoid repeated round trips to HBM.

L2 Cache is shared across all SMs. The A100 has 40 MB of L2. It automatically caches HBM accesses with ~5 TB/s bandwidth. While useful, 40 MB is tiny compared to model sizes (a 7B parameter model in FP16 is ~14 GB), so L2 hit rates for LLM workloads are often low.

HBM (High Bandwidth Memory) is the main GPU memory — what people mean when they say “GPU memory” or “VRAM.” The A100 has 80 GB of HBM2e with ~2 TB/s bandwidth. This is where model parameters, activations, gradients, optimizer states, and KV caches live. Despite the name “High Bandwidth Memory,” HBM is by far the slowest level of the hierarchy, and its bandwidth is the primary bottleneck for LLM workloads.

GPU Memory Hierarchy

Key Insight. There is roughly a 10x bandwidth gap between each level. Registers and shared memory provide ~19 TB/s; L2 provides ~5 TB/s; HBM provides ~2 TB/s. Effective GPU programming is fundamentally about maximizing data reuse at higher levels of the hierarchy to minimize HBM accesses. This is exactly what techniques like FlashAttention and kernel fusion accomplish.

Why HBM Bandwidth Is THE Bottleneck for LLMs

To understand whether a workload is limited by memory or compute, we use the concept of arithmetic intensity — the ratio of compute operations to bytes transferred from memory.

Arithmetic Intensity = FLOPs / Bytes Accessed

The NVIDIA A100 has:

  • ~312 TFLOPS of BF16 Tensor Core compute
  • ~2 TB/s of HBM bandwidth

The ratio of these two numbers gives the ridge point of the Roofline model:

Ridge Point = 312 TFLOPS / 2 TB/s = 156 FLOPs/byte

If your operation performs fewer than 156 FLOPs per byte of data it reads from HBM, it is memory-bound — the compute units sit idle waiting for data. If it performs more than 156 FLOPs per byte, it is compute-bound — the memory system can keep up.

Where do LLM operations fall?

OperationArithmetic IntensityBound
Large GEMM (training)100-1000+ FLOPs/byteCompute-bound
Small GEMM (inference, batch=1)~1-10 FLOPs/byteMemory-bound
Softmax~5 FLOPs/byteMemory-bound
Layer Normalization~5 FLOPs/byteMemory-bound
Attention (naive)~10 FLOPs/byteMemory-bound
Elementwise ops (GELU, residual add)1 FLOPs/byteMemory-bound

During training with large batch sizes, the big matrix multiplications (linear layers, attention projections) can be compute-bound because the matrices are large enough to amortize memory access costs. But many other operations (normalization, softmax, elementwise) remain memory-bound.

During inference, especially autoregressive decoding with batch size 1, almost everything is memory-bound. Each decoding step generates one token, which means the weight matrices must be read from HBM for a single vector-matrix multiply — catastrophically low arithmetic intensity. This is why inference optimization focuses so heavily on:

  1. Batching — processing more tokens per HBM read (continuous batching, dynamic batching)
  2. Quantization — reducing bytes per parameter (INT8, INT4, FP8)
  3. KV cache management — avoiding redundant computation/memory for prefix tokens
  4. Kernel fusion — combining multiple memory-bound operations into one kernel to reduce HBM round trips

Hands-On: GPU Memory Profiling

Theory becomes concrete when you measure it. The companion script memory_profiling.py profiles GPU memory usage through a complete training step of a Transformer model. Here is what it reveals.

The Setup. A 6-layer TransformerEncoder with d_model=512, 8 attention heads, and dim_feedforward=2048. This gives ~18.9M parameters, roughly 75.6 MB at FP32.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# code/01-gpu-memory-distributed/memory_profiling.py (excerpt)

D_MODEL = 512
N_HEADS = 8
N_LAYERS = 6
DIM_FFN = 2048

encoder_layer = nn.TransformerEncoderLayer(
    d_model=D_MODEL, nhead=N_HEADS,
    dim_feedforward=DIM_FFN, batch_first=True,
)
model = nn.TransformerEncoder(encoder_layer, num_layers=N_LAYERS).to(device)

Memory at each stage. The script tracks torch.cuda.memory_allocated() at each step:

Stage                               Allocated (MB)   Delta (MB)
─────────────────────────────────────────────────────────────────
1. Baseline (empty GPU)                       0.00        +0.00
2. Model loaded to GPU                       75.60       +75.60   ← Parameters
3. Input tensors created                     83.60        +8.00   ← Input data
4. After forward pass                       155.00       +71.40   ← Activations
5. After backward pass                      151.20        -3.80   ← Gradients created,
                                                                     some activations freed
6. After optimizer.step()                   302.40      +151.20   ← Adam state (2x params)
7. After zero_grad(set_to_none=True)        226.80       -75.60   ← Gradients freed

(Exact numbers depend on your hardware and PyTorch version. Run the script to see your own results.)

The 4x Rule with Adam. The most important takeaway is the memory multiplier for training with the Adam optimizer:

Component              Size              Multiplier
───────────────────────────────────────────────────
Parameters             P bytes           1x
Gradients              P bytes           1x
Adam momentum (m)      P bytes           1x
Adam variance (v)      P bytes           1x
───────────────────────────────────────────────────
Total                  4P bytes          4x

For a 7B parameter model in FP32, that is 7B * 4 bytes * 4 = 112 GB just for parameters and optimizer state — already exceeding a single A100’s 80 GB. This is the fundamental reason why distributed training (and techniques like FSDP that shard optimizer state) exists.

But wait — we haven’t even counted activations yet. Activation memory scales with batch_size * seq_len * hidden_dim * num_layers and can easily exceed parameter memory during training. Techniques like gradient checkpointing (recomputing activations during backward instead of storing them) trade compute for memory here.

Mixed precision changes the arithmetic. With BF16 parameters and FP32 optimizer state (the standard mixed-precision recipe):

Component              Bytes per Param
──────────────────────────────────────
BF16 parameters        2
BF16 gradients         2
FP32 master weights    4   (for numerical stability)
FP32 momentum          4
FP32 variance          4
──────────────────────────────────────
Total                  16 bytes per parameter

For a 70B model: 70B * 16 bytes = 1.12 TB. This is why large-scale training requires hundreds of GPUs.

Try it yourself. Run python memory_profiling.py on any CUDA-capable GPU. Modify N_LAYERS, BATCH_SIZE, and SEQ_LEN to see how each factor affects memory usage. The full code is at code/01-gpu-memory-distributed/memory_profiling.py.


Part B: Distributed Communication

Why Multi-GPU?

The arithmetic from Part A makes the case clearly: a 70B parameter model needs ~140 GB in FP16 just for the weights — already exceeding the 80 GB capacity of an A100. Add optimizer state, gradients, activations, and KV caches, and you need multiple GPUs even for inference, let alone training.

But distributing computation across GPUs introduces a new challenge: communication. Every parallelism strategy (DDP, FSDP, Tensor Parallel, Pipeline Parallel, Expert Parallel) is defined by what data it communicates, when, and how. All of these strategies are built on a small set of collective communication primitives provided by NVIDIA’s NCCL library.

NCCL Primitives: Derived from Training Scenarios

The best way to understand communication primitives is not to memorize definitions, but to start from real training scenarios and see what data movement each one naturally requires. NCCL (NVIDIA Collective Communications Library, pronounced “nickel”) provides 8 communication primitives, each corresponding to a concrete engineering need.

The companion script nccl_allreduce.py demonstrates all eight primitives with concrete before/after values. Run it with:

1
torchrun --nproc_per_node=2 nccl_allreduce.py

Below we assume 4 GPUs (Rank 0-3, $P = 4$) and data size $N$, and derive the primitives from 5 scenarios.


Scenario 1: DDP — Each GPU Has Different Data, How to Sync Gradients?

DDP (DistributedDataParallel) is the simplest data-parallel strategy: every GPU holds a full copy of the model, processes a different mini-batch, and then synchronizes gradients so parameter updates stay identical.

The requirement is clear: each GPU has computed local gradients $g_i$, and we need every GPU to end up with $\bar{g} = \frac{1}{P}\sum_i g_i$. This is exactly the definition of All-Reduce.

All-Reduce

What it does. Reduce (typically SUM) across all GPUs, result delivered to every GPU.

Before:   GPU 0: [a0, a1, a2, a3]     GPU 1: [b0, b1, b2, b3]
          GPU 2: [c0, c1, c2, c3]     GPU 3: [d0, d1, d2, d3]

After:    Every GPU: [Σ0, Σ1, Σ2, Σ3]     where Σi = ai + bi + ci + di
1
2
3
# PyTorch DDP uses this under the hood:
dist.all_reduce(gradient_tensor, op=dist.ReduceOp.SUM)
gradient_tensor /= world_size  # average

Communication volume. Each GPU sends and receives ~$2N \cdot \frac{P-1}{P}$ bytes (Ring algorithm). For large $P$ this approaches $2N$ — virtually independent of GPU count. This is the elegance of the Ring algorithm.


Scenario 2: FSDP — Parameters Are Sharded, How to Compute Forward/Backward?

DDP’s problem is that every GPU stores the full model — a 70B model simply does not fit. FSDP (Fully Sharded Data Parallel) shards parameters, gradients, and optimizer states across GPUs, each storing only $1/P$.

But how do you compute with sharded parameters?

  • Forward pass: Computing a layer requires its full parameters. Each GPU only has one shard, so they must temporarily reassemble all shards → All-Gather
  • Backward pass: Each GPU computes full gradients, then needs to reduce and re-shard so each GPU only keeps its own slice → Reduce-Scatter
All-Gather

What it does. Each GPU contributes its shard; the concatenated full result is delivered to every GPU.

Before:   GPU 0: [a0]   GPU 1: [b1]   GPU 2: [c2]   GPU 3: [d3]
After:    Every GPU: [a0, b1, c2, d3]

Communication volume. Each GPU receives $N \cdot \frac{P-1}{P}$ data.

Reduce-Scatter

What it does. Reduce (sum) across all GPUs, then scatter different chunks to different GPUs — GPU i gets the i-th chunk.

Before:   GPU 0: [a0, a1, a2, a3]     GPU 1: [b0, b1, b2, b3]
          GPU 2: [c0, c1, c2, c3]     GPU 3: [d0, d1, d2, d3]

After:    GPU 0: [Σ0]   GPU 1: [Σ1]   GPU 2: [Σ2]   GPU 3: [Σ3]

Communication volume. Each GPU sends $N \cdot \frac{P-1}{P}$ data.

The FSDP Communication Loop
Forward:  All-Gather → reassemble full params → compute → discard full params
Backward: All-Gather → reassemble full params → compute gradients → Reduce-Scatter → keep only gradient shard
Update:   Each GPU updates its parameter shard with its gradient shard

Key insight. All-Reduce is conceptually equivalent to Reduce-Scatter + All-Gather. DDP uses All-Reduce because every GPU needs the full gradient; FSDP uses Reduce-Scatter because each GPU only needs its own shard. NCCL internally often decomposes All-Reduce into these two steps.


Scenario 3: Pipeline Parallel — Model Split by Layers, How to Pass Activations?

Pipeline Parallelism (PP) partitions the model into stages by layer, each stage on a different GPU. During forward, stage 0 must send its output activations to stage 1; during backward, stage 1 must send gradients back to stage 0.

This does not require collective communication — it is just direct data transfer between two GPUs → Send / Recv.

Send / Recv (Point-to-Point)

What it does. Direct communication between exactly two GPUs.

Forward:  GPU 0 ──send(activations)──→ GPU 1
Backward: GPU 0 ←──recv(gradients)──── GPU 1

Communication volume. $O(N)$, involving only two GPUs.

Send/Recv is the only non-collective primitive. Pipeline scheduling algorithms (1F1B, Zero Bubble, etc.) are essentially careful orchestrations of these Send/Recv operations, timing them so different stages stay busy and pipeline bubbles are minimized.


Scenario 4: MoE Expert Parallel — Tokens Routed to Different Experts, How to Move Them?

In MoE (Mixture-of-Experts) models, a gating network routes each token to one or more experts. Under Expert Parallelism (EP), different experts live on different GPUs.

The problem: tokens on any GPU may need to go to experts on any other GPU. This is not one-to-many or many-to-one — it is every GPU sending different data to every other GPUAll-to-All.

All-to-All

What it does. Every GPU sends a different chunk to every other GPU and receives a different chunk from each.

Before:   GPU 0: [a→0, a→1, a→2, a→3]     GPU 1: [b→0, b→1, b→2, b→3]
          GPU 2: [c→0, c→1, c→2, c→3]     GPU 3: [d→0, d→1, d→2, d→3]

After:    GPU 0: [a→0, b→0, c→0, d→0]     GPU 1: [a→1, b→1, c→1, d→1]
          GPU 2: [a→2, b→2, c→2, d→2]     GPU 3: [a→3, b→3, c→3, d→3]

An MoE layer’s communication pattern: All-to-All (dispatch: tokens → experts) → expert computation → All-to-All (combine: expert outputs → original GPUs) — two All-to-All operations, one before and one after.

Communication volume. Each GPU sends and receives $N \cdot \frac{P-1}{P}$ data. All-to-All is the heaviest collective because it requires full-mesh communication, making MoE training particularly sensitive to network topology and bandwidth.


Scenario 5: Initialization and Data Flow — “Glue” Primitives

The four scenarios above cover core training communication needs. A few more primitives handle initialization and data management:

Broadcast

What it does. One GPU (the “root”) sends its data to all other GPUs.

Before:   GPU 0: [A, A, A, A]   GPU 1: [., ., ., .]
          GPU 2: [., ., ., .]   GPU 3: [., ., ., .]

After:    Every GPU: [A, A, A, A]

Typical use. Before training begins, Rank 0 initializes model parameters and Broadcasts them so all GPUs start identical. DDP internally calls Broadcast during setup.

Communication volume. Each GPU sends or receives $O(N)$ data. Global total depends on implementation: naive (root sends one by one) costs $O(N \cdot P)$; in practice, NCCL uses a tree-based broadcast where multiple GPUs relay in parallel, completing in $\log P$ steps for a global total of $O(N \log P)$.

PerspectiveVolume
Per-GPU (send or receive)$O(N)$
Global (naive)$O(N \cdot P)$
Global (tree broadcast)$O(N \log P)$
Scatter / Gather

Scatter distributes different data chunks from one GPU to all GPUs. Gather collects data from all GPUs onto one (the inverse of Scatter).

Scatter (from GPU 0):
  GPU 0: [d0, d1, d2, d3]  →  GPU 0: [d0]  GPU 1: [d1]  GPU 2: [d2]  GPU 3: [d3]

Gather (to GPU 0):
  GPU 0: [d0]  GPU 1: [d1]  GPU 2: [d2]  GPU 3: [d3]  →  GPU 0: [d0, d1, d2, d3]

Typical use. During data loading, Rank 0 reads a large batch and Scatters it across GPUs; during evaluation, Gather collects predictions onto Rank 0 for aggregation.


The Full Picture: From Scenario to Primitive

Training ScenarioCommunication NeedPrimitivePer-GPU Volume
DDP gradient syncSum gradients, result to all GPUsall_reduce$2N \cdot \frac{P-1}{P}$
FSDP forward (reassemble params)Each GPU contributes shard, all get full resultall_gather$N \cdot \frac{P-1}{P}$
FSDP backward (shard gradients)Reduce gradients, each GPU keeps its shardreduce_scatter$N \cdot \frac{P-1}{P}$
Pipeline ParallelPass activations/gradients between adjacent stagessend / recv$N$
Expert Parallel (MoE)Full-permutation token routingall_to_all$N \cdot \frac{P-1}{P}$
Parameter initializationCopy one GPU’s params to allbroadcast$N$ (global $N \log P$)
Data distribution / result collectionOne-to-many or many-to-onescatter / gather$N \cdot \frac{P-1}{P}$

Collective Communication Algorithms

The primitives above describe what communication needs to happen. The algorithms determine how the data physically moves through the network. The choice of algorithm affects bandwidth utilization, latency, and scalability.

Ring All-Reduce

The Ring algorithm is the most widely used algorithm for All-Reduce and is the default in NCCL for large messages.

Setup. Arrange N GPUs in a logical ring: GPU 0 → GPU 1 → … → GPU N-1 → GPU 0.

Phase 1: Reduce-Scatter. Each GPU splits its data into N chunks. Over N-1 steps, chunks are passed around the ring and reduced (summed) along the way. After this phase, each GPU holds the fully reduced version of one chunk.

Phase 2: All-Gather. The reduced chunks are passed around the ring again for N-1 steps. After this phase, every GPU has the complete reduced result.

Ring All-Reduce with 4 GPUs (simplified):

Step 0:  GPU0:[a0,a1,a2,a3]  GPU1:[b0,b1,b2,b3]  GPU2:[c0,c1,c2,c3]  GPU3:[d0,d1,d2,d3]
         Each GPU splits data into 4 chunks

Phase 1 (Reduce-Scatter) — 3 steps:
  Step 1: GPU0 sends a0→GPU1, GPU1 sends b1→GPU2, GPU2 sends c2→GPU3, GPU3 sends d3→GPU0
          Recipients sum the received chunk with their own
  Step 2: Continue rotating and summing...
  Step 3: Continue rotating and summing...
  Result: GPU0 has sum[*3], GPU1 has sum[*0], GPU2 has sum[*1], GPU3 has sum[*2]

Phase 2 (All-Gather) — 3 steps:
  Rotate the reduced chunks around the ring.
  Result: Every GPU has [sum0, sum1, sum2, sum3]

Ring All-Reduce

Bandwidth analysis. Each GPU sends and receives 2 * (N-1)/N * data_size bytes total across both phases. As N grows, this approaches 2 * data_size. The critical insight is that the total per-GPU communication volume is independent of the number of GPUs — it scales with data size, not GPU count. This makes Ring All-Reduce highly scalable for large messages.

Latency. The ring requires 2 * (N-1) sequential steps, so latency grows linearly with N. For very large GPU counts, this becomes a problem.

Tree All-Reduce

Tree All-Reduce organizes GPUs in a binary tree. Data is reduced up the tree (children → parent) and then broadcast down (parent → children).

graph TD
  R0["GPU 0 (root)"]
  R1["GPU 1"] 
  R2["GPU 2"]
  R3["GPU 3"]
  R4["GPU 4"]
  R5["GPU 5"]
  R6["GPU 6"]

  R0 --- R1 & R2
  R1 --- R3 & R4
  R2 --- R5 & R6

  style R0 fill:#d4a574,color:#000

Latency. Tree All-Reduce completes in O(log N) steps — much better than Ring’s O(N) for large GPU counts.

Bandwidth. However, bandwidth utilization is worse: the root node becomes a bottleneck because it must receive data from all children and send the result back down. Only the leaves can fully utilize their bandwidth.

When to use it. Tree All-Reduce is preferred for small messages where latency dominates, or for very large GPU counts. NCCL automatically chooses between Ring and Tree based on message size and topology.

Recursive Halving-Doubling

This algorithm combines the best of Ring (bandwidth) and Tree (latency). It works in two phases:

  1. Recursive Halving (Reduce-Scatter). Pairs of GPUs exchange half their data and reduce. In each step, the active group halves and the data chunk doubles, achieving both O(log N) latency and near-optimal bandwidth.

  2. Recursive Doubling (All-Gather). The reverse process: groups double in size, exchanging reduced chunks until every GPU has the full result.

Complexity. Both phases take log N steps. Total communication volume per GPU is 2 * (N-1)/N * data_size — the same as Ring. But it achieves this in O(log N) latency instead of O(N).

Trade-off. Recursive Halving-Doubling requires more complex routing and works best when N is a power of 2. NCCL uses variants of this algorithm internally.

Algorithm Selection in Practice

NCCL automatically selects the best algorithm based on:

  • Message size: Small messages → Tree (latency-sensitive); Large messages → Ring (bandwidth-sensitive)
  • GPU count: More GPUs favor lower-latency algorithms
  • Topology: NVLink vs PCIe vs cross-node affects optimal chunk sizes

You rarely need to choose algorithms manually, but understanding them helps you reason about performance. When profiling distributed training, the key question is: “Is my communication bandwidth-limited or latency-limited?” The answer determines which algorithm (and which interconnect) matters.

Communication Topology: The Hardware Layer

The choice of communication algorithm is constrained by the physical interconnects between GPUs. The interconnect determines the raw bandwidth available for each link, and the topology determines how many hops a message must traverse.

NVLink is NVIDIA’s high-speed GPU-to-GPU interconnect. On the A100, each NVLink connection provides 25 GB/s per direction (50 GB/s bidirectional), and each GPU has 12 NVLink connections, totaling 600 GB/s bidirectional bandwidth.

NVLink is a direct point-to-point connection between two GPUs. Within a single server (e.g., a DGX A100 with 8 GPUs), all GPUs are interconnected via NVLink, enabling All-Reduce at nearly the full 600 GB/s aggregate bandwidth.

NVSwitch

In a DGX A100, the 8 GPUs are connected through NVSwitch — a switch fabric that provides full bisection bandwidth. Any GPU can communicate with any other GPU at the full NVLink rate without contention. This means that collective operations within a single node are extremely fast.

flowchart LR
  subgraph DGX["DGX A100 — Any GPU pair: 600 GB/s bidirectional"]
    direction LR
    subgraph LEFT[" "]
      direction TB
      G0["GPU 0"] ~~~ G1["GPU 1"] ~~~ G2["GPU 2"] ~~~ G3["GPU 3"]
    end
    NVS["NVSwitch\n× 6"]
    subgraph RIGHT[" "]
      direction TB
      G4["GPU 4"] ~~~ G5["GPU 5"] ~~~ G6["GPU 6"] ~~~ G7["GPU 7"]
    end
    LEFT <-->|"NVLink"| NVS <-->|"NVLink"| RIGHT
  end

PCIe

PCIe (Peripheral Component Interconnect Express) is the standard bus connecting GPUs to the CPU and to each other in consumer and some server setups. PCIe 4.0 x16 provides ~32 GB/s bidirectional — roughly 20x slower than NVLink. PCIe 5.0 doubles this to ~64 GB/s, but still far below NVLink speeds.

In systems without NVLink (e.g., consumer GPUs, some cloud instances), GPU-to-GPU communication must go through PCIe, often via the CPU (PCIe → CPU → PCIe), further increasing latency. This is why high-end training clusters always use NVLink.

RDMA and InfiniBand (Cross-Node Communication)

Within a single server, NVLink handles GPU communication. But large training runs span hundreds or thousands of GPUs across many servers. Cross-node communication uses the network:

  • InfiniBand (IB) is the dominant high-performance network for GPU clusters. A single HDR InfiniBand link provides 200 Gbps (~25 GB/s), and servers typically have 4-8 IB links, giving 100-200 GB/s per node.
  • RDMA (Remote Direct Memory Access) allows GPUs to read/write memory on remote GPUs without involving the CPU. NVIDIA’s GPUDirect RDMA enables direct NIC-to-GPU transfers, bypassing CPU memory entirely. This minimizes latency and maximizes bandwidth for cross-node communication.

The bandwidth hierarchy:

NVLink (intra-node):         600 GB/s bidirectional (A100)
InfiniBand (inter-node):     100-200 GB/s per node
PCIe (fallback):             32-64 GB/s

NVLink is 3-6x faster than InfiniBand
InfiniBand is 2-5x faster than PCIe

This hierarchy is why distributed training systems are designed with topology awareness:

  • Operations within a node (e.g., Tensor Parallel) use NVLink
  • Operations across nodes (e.g., Data Parallel, Pipeline Parallel) use InfiniBand
  • The most communication-intensive strategies (TP) are always placed intra-node

The H100 and B200 generations further increase NVLink bandwidth (900 GB/s and 1.8 TB/s respectively) and introduce NVLink Network for multi-node NVLink, blurring the intra/inter-node boundary.


Summary and What’s Next

This article established the two foundational pillars of LLM infrastructure:

Memory. The GPU memory hierarchy (Registers → Shared Memory → L2 → HBM) creates a bandwidth pyramid where HBM, despite its name, is the bottleneck at ~2 TB/s. LLM workloads, especially inference, are overwhelmingly memory-bound. The 4x memory multiplier with Adam (parameters + gradients + 2 optimizer states) means a 70B model needs over 1 TB for training — far exceeding any single GPU.

Communication. Eight NCCL primitives (All-Reduce, All-Gather, Reduce-Scatter, Broadcast, Scatter, Gather, Send/Recv, All-to-All) form the vocabulary of distributed training. Each primitive maps to specific training strategies: All-Reduce for DDP, All-Gather and Reduce-Scatter for FSDP, Send/Recv for Pipeline Parallel, All-to-All for Expert Parallel. Ring, Tree, and Recursive Halving-Doubling algorithms implement these primitives, with NCCL selecting the best algorithm automatically based on message size and topology.

In the next article, we will build on this foundation to explore the full landscape of distributed parallelism strategies: DDP, FSDP/FSDP2, Tensor Parallelism, Pipeline Parallelism, Sequence Parallelism, Expert Parallelism, and Context Parallelism. Every strategy is a specific answer to the question: “How do we split memory and coordinate communication across GPUs?” With the primitives from this article, you will be able to understand each strategy at the protocol level, not just the high-level concept.


References

  1. NVIDIA A100 Tensor Core GPU Architecture Whitepaper. NVIDIA, 2020.
  2. Jia, Z., Maggioni, M., Staiger, B., & Scarpazza, D. P. “Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking.” arXiv:1804.06826, 2018.
  3. Williams, S., Waterman, A., & Patterson, D. “Roofline: An Insightful Visual Performance Model for Multicore Architectures.” Communications of the ACM, 52(4):65-76, 2009.
  4. NCCL Documentation. NVIDIA. https://docs.nvidia.com/deeplearning/nccl/
  5. Thakur, R., Rabenseifner, R., & Gropp, W. “Optimization of Collective Communication Operations in MPICH.” International Journal of High Performance Computing Applications, 19(1):49-66, 2005.
  6. Patarasuk, P. & Yuan, X. “Bandwidth Optimal All-reduce Algorithms for Clusters of Workstations.” Journal of Parallel and Distributed Computing, 69(2):117-124, 2009.
  7. Rajbhandari, S., Rasley, J., Ruwase, O., & He, Y. “ZeRO: Memory Optimizations Toward Training Trillion Parameter Models.” SC20, 2020.
  8. Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.” NeurIPS, 2022.
  9. NVIDIA DGX A100 System Architecture Whitepaper. NVIDIA, 2020.
  10. Li, S., Zhao, Y., Varma, R., et al. “PyTorch Distributed: Experiences on Accelerating Data Parallel Training.” VLDB, 2020.