Introduction to RLHF System Design

Motivation

If you have been following this series, you now understand GPU memory hierarchies, collective communication primitives, distributed parallelism strategies, and inference serving architectures. RLHF (Reinforcement Learning from Human Feedback) brings all of these concerns together in a single training pipeline — and adds several new ones.

The common misconception is that RLHF is “just add a reward model to your training loop.” In practice, RLHF requires running four large neural networks simultaneously, orchestrating a complex data flow between them, managing a generation phase that is fundamentally an inference problem embedded inside a training loop, and keeping weight copies synchronized across GPU groups. The algorithmic challenge of PPO is real, but the dominant engineering difficulty is systems design.

A 7B-parameter RLHF setup consumes roughly 112 GB of GPU memory for weights and optimizer states alone — nearly 8x what supervised fine-tuning (SFT) requires for the same model. At 70B parameters, you need a minimum of 18 A100-80GB GPUs just to hold the weights, and realistically 64-128 GPUs for reasonable training throughput. This is not an algorithm you can prototype on a single GPU; it is a distributed systems problem from day one.

This article takes a systems perspective on RLHF. We start with the four-model architecture, walk through the PPO data flow step by step, quantify the memory and compute costs, and then examine how production frameworks — particularly verl (Volcano Engine RL) — solve these challenges through hybrid parallelism strategies. By the end, you will understand not just what RLHF does, but why it is hard to run efficiently at scale.

Prerequisites

GPU Memory Model and Distributed Communication Fundamentals (Article 1)
A Panorama of Distributed Parallelism Strategies (Article 2)
LLM Inference System Architecture (Article 3)
Basic reinforcement learning concepts (Policy, Reward, PPO)

The Four Models of RLHF

RLHF is unique among training methods in that it requires four distinct models to be loaded and executed during every training iteration. Understanding their roles, update rules, and memory footprints is the first step to understanding why RLHF is a systems problem.

Actor (Policy Model)

The Actor is the LLM you are actually training. It generates responses to prompts and is updated via PPO to maximize reward while staying close to its original behavior. In production, this is your LLaMA, Qwen, or Mistral checkpoint.

The Actor participates in two fundamentally different phases:

Generation phase: autoregressive sampling — this is an inference workload (memory-bound, benefits from KV cache and tensor parallelism).
Training phase: forward pass to compute new log probabilities, followed by a backward pass to update weights via PPO — this is a standard training workload (compute-bound, benefits from FSDP).

This dual nature is one of the core reasons RLHF is a systems challenge: the Actor needs different parallelism strategies for different phases.

Reference Model

The Reference Model is a frozen copy of the Actor’s initial weights, taken before any RLHF training begins. It is never updated, but it must run a full forward pass on every batch to produce per-token log probabilities.

Its purpose is to compute a KL divergence penalty that prevents the Actor from drifting too far from its original behavior. Without this constraint, the Actor quickly learns to “hack” the Reward Model — generating degenerate outputs that achieve high reward scores but are incoherent or repetitive. The KL penalty acts as an anchor, keeping the Actor’s distribution close to a known-good baseline.

Despite being frozen, the Reference Model still consumes the same memory as a full model copy. It requires no optimizer states, but its parameters must be resident in GPU memory for forward passes.

Reward Model

The Reward Model takes a (prompt, response) pair and outputs a scalar reward score. It is trained separately on human preference data — given pairs of responses where a human annotator has indicated which one is better, the model learns to assign higher scores to preferred outputs.

Architecturally, the Reward Model typically shares the same Transformer backbone as the base LLM, but replaces the language modeling head with a value head — a linear projection from the final hidden state to a single scalar:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
class RewardModel(nn.Module):
    def __init__(self, vocab_size, d_model, ...):
        # Same Transformer backbone as the Actor
        self.token_embed = nn.Embedding(vocab_size, d_model)
        self.pos_embed = nn.Embedding(max_seq_len, d_model)
        self.blocks = nn.ModuleList([...])
        self.ln_f = nn.LayerNorm(d_model)

        # Value head replaces the language modeling head
        self.value_head = nn.Linear(d_model, 1, bias=False)

    def forward(self, input_ids):
        # ... Transformer forward pass ...
        last_hidden = x[:, -1, :]  # Sequence representation
        reward = self.value_head(last_hidden).squeeze(-1)  # Scalar per sequence
        return reward

The Reward Model is frozen during RLHF training — it only performs inference.

Critic (Value Model)

The Critic estimates $V(s)$ — the expected future reward at each token position in the response. This is essential for computing advantages via GAE (Generalized Advantage Estimation), which tell the PPO algorithm “how much better (or worse) was this action compared to what we expected?”

Unlike the Reward Model, which produces one scalar per sequence, the Critic produces a per-token value estimate:

1
2
3
4
5
6
7
8
9
class CriticModel(nn.Module):
    def __init__(self, vocab_size, d_model, ...):
        # Same backbone, but per-token output
        self.value_head = nn.Linear(d_model, 1, bias=False)

    def forward(self, input_ids):
        # ... Transformer forward pass ...
        values = self.value_head(x).squeeze(-1)  # (batch, seq_len)
        return values

The Critic is trained alongside the Actor during RLHF — it has its own optimizer and receives gradient updates. It is often initialized from the Reward Model’s weights, since both models estimate a form of “how good is this partial response.”

Memory Breakdown

The memory cost of running all four models simultaneously is what makes RLHF so resource-intensive. Here is the breakdown for FP16 weights:

Model              Weights    Optimizer States    Total
─────────────────────────────────────────────────────────
Actor              2 * P      2 * 2 * P (Adam)    6P
Critic             2 * P      2 * 2 * P (Adam)    6P
Reference          2 * P      0 (frozen)          2P
Reward Model       2 * P      0 (frozen)          2P
─────────────────────────────────────────────────────────
TOTAL              8 * P      8 * P               16P

P = number of parameters in bytes (param_count * 2 for FP16)

For a 7B model:

Component	Memory
Per model (FP16 weights)	14 GB
4 models (weights only)	56 GB
+ Adam states for Actor + Critic	112 GB
vs SFT (1 model + optimizer)	~28 GB

RLHF requires roughly 4x the memory of SFT at the same model scale. At 70B, you need over 1,100 GB just for weights and optimizer states — a minimum of 14 A100-80GB GPUs, and realistically 64+ for reasonable throughput with activation memory and communication buffers.

PPO Data Flow

Every PPO iteration in RLHF consists of two distinct phases: a rollout phase (experience collection) and a training phase (policy update). Understanding this data flow is critical for reasoning about where system bottlenecks arise.

Phase 1: Rollout (Experience Collection)

The rollout phase collects “experience” — the data that the PPO algorithm will learn from. It involves all four models and proceeds in a strict sequential order:

flowchart TD
    subgraph Rollout["Rollout Phase (all inference, no gradients)"]
        S1["**Step 1: GENERATION**<br/>Prompts → Actor (autoregressive) → Responses"]
        S1 --> S2 & S3a & S3b & S4
        S2["**Step 2: REWARD SCORING**<br/>(Prompt+Response) → Reward Model → Scalar reward"]
        S3a["**Step 3a: ACTOR LOG PROBS**<br/>(Prompt+Response) → Actor → log probs π_θ"]
        S3b["**Step 3b: REFERENCE LOG PROBS**<br/>(Prompt+Response) → Reference → log probs π_ref"]
        S4["**Step 4: VALUE ESTIMATION**<br/>(Prompt+Response) → Critic → per-token V(s)"]
        S2 --> S5
        S3a --> S5
        S3b --> S5
        S4 --> S5
        S5["**Step 5: ADVANTAGE COMPUTATION**<br/>(Rewards, KL penalties, Values) → GAE → Advantages"]
    end

    style S1 fill:#cce5ff,stroke:#007bff
    style S2 fill:#fff3cd,stroke:#ffc107
    style S3a fill:#fff3cd,stroke:#ffc107
    style S3b fill:#fff3cd,stroke:#ffc107
    style S4 fill:#fff3cd,stroke:#ffc107
    style S5 fill:#d4edda,stroke:#28a745

Notice that every step in the rollout phase is an inference workload. No gradients are computed; all four models run in torch.no_grad() mode. The generation step (Step 1) is particularly expensive because it is autoregressive — each token requires a full forward pass through the Actor.

The companion code demonstrates this pipeline:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# Step 1: Generate responses (inference — no gradients)
with torch.no_grad():
    full_ids = generate_responses(actor, prompt_ids, response_len, temperature)

# Step 2-3: Score with Reward Model + compute KL
with torch.no_grad():
    rewards = reward_model(full_ids)
    old_log_probs = actor.get_log_probs(full_ids)[:, prompt_len - 1:]
    ref_log_probs = reference.get_log_probs(full_ids)[:, prompt_len - 1:]
    kl_penalties = kl_coeff * (old_log_probs - ref_log_probs)

# Step 4-5: Compute values and GAE advantages
with torch.no_grad():
    values = critic(full_ids)[:, prompt_len - 1:-1]
    advantages, returns = compute_advantages_gae(rewards, kl_penalties, values)

The KL Penalty

The KL divergence between the Actor and Reference distributions is the critical safety mechanism of RLHF. Without it, the Actor degenerates within a few hundred iterations. The per-token KL penalty is computed as:

$$D_{KL}(\pi_\theta | \pi_\text{ref}) \approx \log \pi_\theta(a_t | s_t) - \log \pi_\text{ref}(a_t | s_t)$$

This is scaled by a coefficient $\beta$ (typically 0.01 - 0.2) and subtracted from the reward:

$$r_t^{\text{adjusted}} = r_t - \beta \cdot D_{KL}$$

The coefficient $\beta$ is often adaptively tuned during training: if the KL divergence exceeds a target threshold, $\beta$ increases to pull the Actor back; if KL is below the target, $\beta$ decreases to allow more exploration.

Generalized Advantage Estimation (GAE)

GAE computes per-token advantages that tell PPO how to update the policy. The advantage $\hat{A}_t$ at token position $t$ answers: “how much better was the action taken here compared to the Critic’s expectation?”

$$\hat{A}t = \sum{l=0}^{T-t} (\gamma \lambda)^l \delta_{t+l}$$

where $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$ is the TD error.

The $\lambda$ parameter controls a bias-variance tradeoff:

$\lambda = 1$: high variance, low bias (equivalent to Monte Carlo returns)
$\lambda = 0$: low variance, high bias (one-step TD)
$\lambda = 0.95$: the standard choice, a good balance

In practice, only the last token of the response receives the Reward Model’s scalar reward. All other tokens receive only the KL penalty. GAE then propagates this terminal reward backward through the sequence:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# Per-token rewards: KL penalty at every position, reward at the last token
per_token_rewards = -kl_penalties.clone()
per_token_rewards[:, -1] += rewards  # Sequence reward at final token

# GAE: propagate rewards backward through time
advantages = torch.zeros_like(values)
last_gae = torch.zeros(B, device=device)

for t in reversed(range(T)):
    next_value = values[:, t + 1] if t < T - 1 else torch.zeros(B, device=device)
    delta = per_token_rewards[:, t] + gamma * next_value - values[:, t]
    last_gae = delta + gamma * lam * last_gae
    advantages[:, t] = last_gae

returns = advantages + values  # Training target for the Critic

Phase 2: PPO Update (Training)

With advantages computed, the training phase updates the Actor and Critic. PPO’s key innovation is that you can reuse the same rollout data for multiple gradient updates, as long as you clip the policy ratio to prevent the Actor from changing too drastically in one step.

Actor Loss (Clipped Surrogate Objective):

$$L^{CLIP}(\theta) = \mathbb{E}\left[\min\left(r_t(\theta)\hat{A}_t, ; \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t\right)\right]$$

where $r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_\text{old}}(a_t | s_t)} = \exp(\log \pi_\theta - \log \pi_{\theta_\text{old}})$ is the probability ratio.

The clipping mechanism is elegant in its simplicity:

When $\hat{A}_t > 0$ (good action): we want to increase $r_t$, but clipping caps it at $1 + \epsilon$, preventing overcommitment.
When $\hat{A}_t < 0$ (bad action): we want to decrease $r_t$, but clipping caps it at $1 - \epsilon$, preventing overcorrection.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
def ppo_actor_loss(actor, full_ids, prompt_len, old_log_probs, advantages, clip_eps):
    new_log_probs = actor.get_log_probs(full_ids)[:, prompt_len - 1:]
    ratio = torch.exp(new_log_probs - old_log_probs.detach())

    # Normalize advantages for stability
    adv = (advantages - advantages.mean()) / (advantages.std() + 1e-8)

    # Clipped surrogate loss
    surr1 = ratio * adv
    surr2 = torch.clamp(ratio, 1.0 - clip_eps, 1.0 + clip_eps) * adv
    loss = -torch.min(surr1, surr2).mean()
    return loss, ratio

Critic Loss: a straightforward MSE between the Critic’s value predictions and the GAE returns (advantages + old values):

$$L_\text{critic} = \frac{1}{T}\sum_t \left(V_\phi(s_t) - \hat{R}_t\right)^2$$

Training Loop: for each PPO iteration, we typically run 2-4 gradient update epochs on the same rollout data:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
for epoch in range(ppo_epochs):
    # Actor update
    actor_optim.zero_grad()
    a_loss, ratio = ppo_actor_loss(actor, full_ids, prompt_len,
                                    old_log_probs, advantages, clip_eps)
    a_loss.backward()
    torch.nn.utils.clip_grad_norm_(actor.parameters(), max_norm=1.0)
    actor_optim.step()

    # Critic update
    critic_optim.zero_grad()
    c_loss = ppo_critic_loss(critic, full_ids, prompt_len, returns)
    c_loss.backward()
    torch.nn.utils.clip_grad_norm_(critic.parameters(), max_norm=1.0)
    critic_optim.step()

Only the Actor and Critic receive gradient updates. The Reference and Reward Model remain frozen throughout.

Why RLHF Is a Systems Problem

With the data flow clear, let us now enumerate the specific systems challenges that distinguish RLHF from standard SFT training.

Challenge 1: 4x Memory Footprint

As quantified earlier, RLHF needs roughly 4x the GPU memory of SFT at the same model scale. This is not just a matter of buying more GPUs — it fundamentally changes the parallelism strategy. For a 7B model:

SFT: fits on a single A100-80GB with basic mixed precision. No sharding required.
RLHF: requires at minimum 4 A100-80GB GPUs with FSDP sharding across all four models.

At 70B, the gap widens further. SFT can work with 4-8 GPUs using FSDP; RLHF needs 64+ GPUs with careful placement and scheduling.

Challenge 2: Compute Heterogeneity

The rollout phase and training phase have fundamentally different compute characteristics:

Phase               Nature           Bottleneck        Best Parallelism
─────────────────────────────────────────────────────────────────────────
Generation          Inference        Memory bandwidth  Tensor Parallelism
  (autoregressive)  (sequential)     (KV cache I/O)    (low latency)

Reward scoring      Inference        Memory bandwidth  Tensor Parallelism
Reference log probs Inference        Memory bandwidth  Tensor Parallelism

Actor training      Training         Compute (matmul)  FSDP / DDP
Critic training     Training         Compute (matmul)  FSDP / DDP

A naive approach — using the same parallelism strategy for both phases — leaves significant performance on the table. Generation with FSDP requires an all-gather of full weights before every layer, which is wasteful for the small batch sizes typical of autoregressive decoding. Training with TP wastes communication bandwidth on all-reduce operations that are unnecessary when you can shard optimizer states instead.

The optimal strategy is to switch parallelism modes between phases: use TP for generation (low latency per token) and FSDP for training (memory-efficient parameter sharding). This is exactly what verl’s hybrid engine does, and it is the key architectural insight of modern RLHF systems.

Challenge 3: Complex Data Dependencies

The data flow through four models creates a strict dependency chain:

flowchart LR
    Gen["Generation"] --> Score["Reward Scoring"]
    Gen --> |"needs Actor weights"| Gen
    Score --> KL["KL Computation"]
    KL --> |"needs Ref weights"| KL
    KL --> GAE
    GAE --> |"needs Critic values"| GAE
    GAE --> PPO["PPO Update"]

    style Gen fill:#cce5ff,stroke:#007bff
    style Score fill:#fff3cd,stroke:#ffc107
    style KL fill:#fff3cd,stroke:#ffc107
    style GAE fill:#d4edda,stroke:#28a745
    style PPO fill:#d4edda,stroke:#28a745

You cannot score responses before they are generated. You cannot compute advantages before you have rewards, KL penalties, and Critic values. And you cannot start the next PPO iteration until the Actor weights have been updated and synchronized to the generation engine.

In a distributed setting, this means some GPU groups are idle while others work. If the Reward Model runs on separate GPUs from the Actor, those GPUs sit idle during generation and training. If the Actor generates on the same GPUs it trains on, you need to manage memory carefully — generation may require gathering full model weights, which temporarily doubles the memory requirement.

Challenge 4: Weight Synchronization

After each PPO update to the Actor, the generation engine must use the new weights for the next rollout. In a colocated setup (all models on the same GPUs), this happens automatically. But in a separated setup, or when using a dedicated inference engine like vLLM for generation, the updated weights must be explicitly transferred.

For a 7B model in FP16, this is a 14 GB transfer — trivial over NVLink but costly over PCIe or network interconnects. For 70B models, it is 140 GB, which at 25 GB/s (PCIe Gen4 x16) takes over 5 seconds. Systems must overlap this transfer with other computation or use architectural choices (colocation) to avoid it entirely.

Compute Cost Comparison

Putting it all together, here is how the compute cost of one RLHF iteration compares to one SFT step:

Operation                           SFT      RLHF       Ratio
──────────────────────────────────────────────────────────────
Forward pass (Actor)                1x       2x         2x
Backward pass (Actor)              1x       1x         1x
Autoregressive generation          0        N tokens   Nx
Forward pass (Reference)           0        1x         +1x
Forward pass (Reward Model)        0        1x         +1x
Forward pass (Critic)              0        2x         +2x
Backward pass (Critic)             0        1x         +1x
GAE computation                    0        1x         +1x
──────────────────────────────────────────────────────────────
TOTAL (approximate)                ~2x      ~10-16x    5-8x

The autoregressive generation step dominates. Without KV caching, generating $T$ response tokens requires $T$ sequential forward passes through the Actor. Even with KV caching, each step is memory-bandwidth-bound with poor compute utilization. This single phase can account for 50-70% of total RLHF iteration time.

The verl Architecture

verl (Volcano Engine RL) is an open-source RLHF framework that addresses the systems challenges above through a hybrid engine design. Its core insight is that you can colocate all four models on the same GPU group and dynamically switch between parallelism strategies depending on the current phase.

Design Principles

verl’s architecture rests on three key ideas:

Colocation over separation: all four models share the same GPUs, eliminating cross-group data transfer.
Hybrid parallelism: FSDP for training phases, TP for generation phases, with seamless switching between modes.
Worker-based scheduling: a central controller dispatches work to model-specific workers, managing the phase transitions.

flowchart TD
    subgraph GPU["GPU Group (N GPUs) — verl Colocated Mode"]
        subgraph Models["All 4 Models (FSDP sharded)"]
            direction LR
            Actor["Actor<br/>(FSDP sharded)"]
            Critic["Critic<br/>(FSDP sharded)"]
            Reward["Reward<br/>(FSDP sharded)"]
            Reference["Reference<br/>(FSDP sharded)"]
        end

        subgraph Phases["Phase Transitions"]
            Train["**Training mode:**<br/>FSDP (all-gather params,<br/>reduce-scatter grads)"]
            Gen["**Generation mode:**<br/>Gather full weights →<br/>Tensor Parallelism"]
        end

        Models --> Phases
    end

    style Actor fill:#d4edda,stroke:#28a745
    style Critic fill:#d4edda,stroke:#28a745
    style Reward fill:#fff3cd,stroke:#ffc107
    style Reference fill:#fff3cd,stroke:#ffc107
    style Train fill:#d4edda,stroke:#28a745
    style Gen fill:#cce5ff,stroke:#007bff

The Hybrid Engine: FSDP-TP Mode Switching

The most important architectural decision in verl is how it switches between FSDP and TP modes for the Actor model. The challenge: FSDP shards parameters across GPUs by flattening and partitioning them, while TP shards parameters by slicing weight matrices along specific dimensions (column-parallel for the first linear layer, row-parallel for the second).

verl handles this by:

FSDP mode (training): parameters are sharded using PyTorch’s FSDP (ZeRO Stage 3). Each GPU holds $1/N$ of the flattened parameter tensor. During forward/backward passes, parameters are all-gathered on demand and gradients are reduce-scattered after the backward pass.
TP mode (generation): FSDP shards are gathered to reconstruct full parameters, then resharded along TP dimensions. For a weight matrix $W \in \mathbb{R}^{d \times 4d}$ in an FFN layer, GPU $i$ holds columns $W[:, i \cdot 4d/N : (i+1) \cdot 4d/N]$. Generation proceeds with standard TP communication (all-reduce after each layer).
Transition cost: the FSDP-to-TP reshape requires an all-gather of the full parameters followed by a local slice. For a 7B model, this is a 14 GB all-gather — roughly 1-2 ms over NVLink, which is negligible compared to the generation time.

flowchart LR
    subgraph FSDP["FSDP Shards"]
        direction TB
        S0["GPU 0: shard_0"]
        S1["GPU 1: shard_1"]
        S2["GPU 2: shard_2"]
        S3["GPU 3: shard_3"]
    end

    FSDP --> |"all-gather"| Full["full_W"]
    Full --> |"slice"| TP

    subgraph TP["TP Slices (for generation)"]
        direction TB
        T0["GPU 0: W_col_0"]
        T1["GPU 1: W_col_1"]
        T2["GPU 2: W_col_2"]
        T3["GPU 3: W_col_3"]
    end

    Note["After generation, discard TP slices<br/>and return to FSDP shards"]

    TP --> Note

    style FSDP fill:#d4edda,stroke:#28a745
    style Full fill:#fff3cd,stroke:#ffc107
    style TP fill:#cce5ff,stroke:#007bff

Weight Update Flow

The weight update cycle in verl for one PPO iteration looks like this:

Gather Actor weights (FSDP all-gather → TP slice) for generation
Generate responses using TP-sharded Actor
Discard TP slices, return to FSDP shards
Forward passes through Reward Model, Reference, Critic (FSDP mode, no TP needed for non-autoregressive inference)
Compute advantages (local computation, no communication)
PPO update on Actor and Critic (FSDP mode: all-gather for forward, reduce-scatter for backward)
Actor weights are now updated in FSDP shards — no explicit sync needed because generation will re-gather them next iteration

The key optimization here is that weight synchronization is free in the colocated design. The generation engine and training engine share the same parameter storage, so after the PPO update, the next generation phase simply re-gathers the (now updated) FSDP shards.

Resource Scheduling

verl uses a single-controller, multi-worker architecture. The controller orchestrates the PPO iteration by sending commands to workers:

flowchart TD
    subgraph Controller["Controller (single process) — per PPO iteration"]
        S1["1. Send 'generate' to Actor workers"]
        S2["2. Send 'score' to Reward workers"]
        S3["3. Send 'log_probs' to Reference workers"]
        S4["4. Send 'value' to Critic workers"]
        S5["5. Compute advantages (local)"]
        S6["6. Send 'train' to Actor workers<br/>(multiple PPO epochs)"]
        S7["7. Send 'train' to Critic workers<br/>(multiple PPO epochs)"]
        S1 --> S2 --> S3 --> S4 --> S5 --> S6 --> S7
    end

    style S1 fill:#cce5ff,stroke:#007bff
    style S2 fill:#fff3cd,stroke:#ffc107
    style S3 fill:#fff3cd,stroke:#ffc107
    style S4 fill:#fff3cd,stroke:#ffc107
    style S5 fill:#d4edda,stroke:#28a745
    style S6 fill:#d4edda,stroke:#28a745
    style S7 fill:#d4edda,stroke:#28a745

In colocated mode, each worker manages multiple model roles on the same GPU. The controller ensures that only one model is active at a time, preventing memory contention. In separated mode, each worker owns a specific model and the controller handles data routing between groups.

verl supports micro-batching within each phase to handle cases where the full batch does not fit in GPU memory. For example, if the generation batch size is 512 but each GPU can only hold 64 sequences for generation (due to KV cache memory), verl splits the batch into 8 micro-batches and processes them sequentially.

Distributed RLHF Strategies

At scale, RLHF requires distributed execution across many GPUs. There are two fundamental approaches to placing the four models, each with distinct tradeoffs.

Colocated Strategy

In the colocated approach, all four models reside on the same set of GPUs, each sharded via FSDP:

flowchart TD
    subgraph Colocated["Colocated Deployment (8 GPUs)"]
        Config["GPU 0-7: Actor (FSDP) + Critic (FSDP) + Reward (FSDP) + Reference (FSDP)"]

        subgraph Timeline["Timeline of one PPO iteration"]
            direction LR
            P1["Generation<br/>(Actor, TP)"]
            P2["Score<br/>(RM)"]
            P3["Ref LP<br/>(Ref)"]
            P4["Values<br/>(Crit)"]
            P5["Actor PPO<br/>(FSDP)"]
            P6["Critic PPO<br/>(FSDP)"]
            P1 --> P2 --> P3 --> P4 --> P5 --> P6
        end

        Config --> Timeline
        Note["All 8 GPUs active in every phase<br/>(different model active each phase)"]
    end

    style P1 fill:#cce5ff,stroke:#007bff
    style P2 fill:#fff3cd,stroke:#ffc107
    style P3 fill:#fff3cd,stroke:#ffc107
    style P4 fill:#fff3cd,stroke:#ffc107
    style P5 fill:#d4edda,stroke:#28a745
    style P6 fill:#d4edda,stroke:#28a745

Advantages:

No cross-group data transfer — responses, rewards, and log probabilities stay on the same GPUs
All GPUs are utilized in every phase
Simple scheduling — one sequential pipeline

Disadvantages:

Peak memory pressure is high — all four models must have their FSDP shards resident simultaneously
Cannot independently tune parallelism per model (e.g., more TP for Actor, less for Critic)
If one model is much larger than the others, memory allocation is unbalanced

Separated Strategy

In the separated approach, each model gets its own dedicated GPU group:

flowchart TD
    subgraph Separated["Separated Deployment (32 GPUs)"]
        subgraph Groups["GPU Allocation"]
            direction LR
            AG["GPUs 0-15: Actor<br/>(FSDP + TP hybrid)"]
            CG["GPUs 16-23: Critic<br/>(FSDP)"]
            RG["GPUs 24-27: Reward<br/>(TP, inference only)"]
            RefG["GPUs 28-31: Reference<br/>(TP, inference only)"]
        end

        subgraph Timeline["Timeline of one PPO iteration"]
            direction TB
            AT["Actor GPUs: Generation → idle → PPO Training"]
            CT["Critic GPUs: idle → Value compute → PPO Training"]
            RT["Reward GPUs: idle → Score → idle"]
            RefT["Ref GPUs: idle → Log probs → idle"]
        end

        Groups --> Timeline
    end

    style AG fill:#d4edda,stroke:#28a745
    style CG fill:#d4edda,stroke:#28a745
    style RG fill:#fff3cd,stroke:#ffc107
    style RefG fill:#fff3cd,stroke:#ffc107
    style AT fill:#cce5ff,stroke:#007bff

Advantages:

Each model has full GPU memory available — no sharing pressure
Can use different parallelism strategies optimized for each model’s workload
The Actor can use more GPUs for faster generation

Disadvantages:

Significant GPU idle time — Reward and Reference GPUs sit idle during generation and training
Data must be transferred between GPU groups (responses, rewards, log probabilities)
More complex scheduling and synchronization

Communication Patterns

The communication requirements differ significantly between phases:

Phase                     Communication                    Pattern
─────────────────────────────────────────────────────────────────────
Generation (TP)           All-reduce per layer             Latency-bound
Generation (FSDP→TP)     All-gather for weight reshape    One-time cost
Actor training (FSDP)    All-gather + reduce-scatter      Bandwidth-bound
Critic training (FSDP)   All-gather + reduce-scatter      Bandwidth-bound
Reward scoring            Broadcast prompts+responses      One-time cost
Reference log probs       Broadcast prompts+responses      One-time cost
Weight sync (separated)   Broadcast/all-gather new params  After each iter

In the colocated strategy, the dominant communication cost is the FSDP all-gather during generation (to reconstruct full weights for TP) and the all-gather + reduce-scatter during training. In the separated strategy, the additional cost of transferring rollout data between GPU groups can be substantial — for a batch of 512 sequences of length 2048 in FP16, the response tensor alone is 2 GB.

Hybrid Approaches

Modern RLHF systems increasingly use hybrid strategies that combine elements of both approaches:

verl’s default: colocated with FSDP↔TP switching. All models share GPUs but use different parallelism modes per phase.
Partial separation: the Actor gets its own GPU group for generation (where it needs maximum memory for KV cache), but shares GPUs with other models during training.
Asymmetric allocation: allocate more GPUs to the Actor (which dominates compute time) and fewer to the frozen models (which only perform inference).

The right strategy depends on model size, cluster topology, and the relative cost of generation vs. training. For models up to 13B, colocation is usually sufficient on 8-16 GPUs. For 70B+, some form of separation or asymmetric allocation becomes necessary to manage memory.

Key Takeaways

RLHF needs four models simultaneously: Actor (generates responses, trained via PPO), Critic (estimates value, trained alongside Actor), Reward Model (scores responses, frozen), Reference (KL anchor, frozen). This is roughly 10x the memory of SFT when including optimizer states for Actor and Critic.
The generation phase is an inference problem inside a training loop. Autoregressive sampling needs KV caching, tensor parallelism, and continuous batching for efficiency — but it happens between gradient updates. This dual nature is why RLHF requires a “hybrid engine” that can switch between inference and training parallelism modes.
PPO stabilizes training through two mechanisms: the KL penalty prevents reward hacking by keeping the Actor close to the Reference distribution, and the clipping objective prevents catastrophic policy updates by bounding the probability ratio to $[1-\epsilon, 1+\epsilon]$.
The core systems challenge is data flow orchestration. Each PPO iteration passes data through all four models in a strict dependency chain. In distributed settings, this means careful model placement, data routing between GPU groups, and scheduling to minimize idle time.
verl’s key insight: colocate models and reshape parallelism per phase. Use FSDP (ZeRO-3) for memory-efficient training, but gather weights and switch to TP for fast generation. This avoids the tradeoff between separated (idle GPUs) and naive colocated (memory pressure) strategies.
At scale (70B+), RLHF requires 64-128 GPUs with careful parallelism choices. The generation phase dominates wall-clock time (50-70% of each iteration), making inference optimization — KV caching, continuous batching, efficient TP — just as important as training optimization.

Companion Code

The companion code for this article is located at code/04-rlhf-system/:

minimal_rlhf.py — A self-contained RLHF training loop implementing all four models, the PPO rollout pipeline, GAE advantage estimation, and the clipped surrogate objective. Uses small model dimensions (d_model=256) so everything runs on CPU, but the architecture and data flow are identical to production RLHF systems.

Run it with:

1
2
cd code/04-rlhf-system
python minimal_rlhf.py

The script runs four demonstrations:

Four-model architecture: instantiates all models and displays parameter counts and memory estimates at production scales.
PPO data generation: walks through the rollout phase step by step, showing data shapes and model interactions.
PPO training: runs multiple PPO iterations, reporting reward, loss, KL divergence, and clip fraction.
System challenges: prints the complete data flow diagram, compute cost comparison, and distributed strategy analysis.

RLHF Data Flow

References

Ouyang, L., et al. “Training language models to follow instructions with human feedback.” NeurIPS 2022. — The InstructGPT paper that established the RLHF pipeline.
Schulman, J., et al. “Proximal Policy Optimization Algorithms.” arXiv:1707.06347, 2017. — The PPO algorithm.
Sheng, Y., et al. “HybridFlow: A Flexible and Efficient RLHF Framework.” arXiv:2409.19256v2, 2024. — The verl system paper.
Zheng, L., et al. “SGLang: Efficient Execution of Structured Language Model Programs.” arXiv:2312.07104, 2023. — Inference engine used by verl for generation.
Schulman, J., et al. “High-Dimensional Continuous Control Using Generalized Advantage Estimation.” ICLR 2016. — The GAE algorithm.
Rajbhandari, S., et al. “ZeRO: Memory Optimizations Toward Training Trillion Parameter Models.” SC 2020. — The FSDP/ZeRO foundation used by verl.
Stiennon, N., et al. “Learning to summarize from human feedback.” NeurIPS 2020. — Early RLHF application to summarization.

Motivation#

Prerequisites#

The Four Models of RLHF#

Actor (Policy Model)#

Reference Model#

Reward Model#

Critic (Value Model)#

Memory Breakdown#

PPO Data Flow#

Phase 1: Rollout (Experience Collection)#

The KL Penalty#

Generalized Advantage Estimation (GAE)#

Phase 2: PPO Update (Training)#

Why RLHF Is a Systems Problem#

Challenge 1: 4x Memory Footprint#

Challenge 2: Compute Heterogeneity#

Challenge 3: Complex Data Dependencies#

Challenge 4: Weight Synchronization#

Compute Cost Comparison#

The verl Architecture#

Design Principles#

The Hybrid Engine: FSDP-TP Mode Switching#

Weight Update Flow#

Resource Scheduling#

Distributed RLHF Strategies#

Colocated Strategy#

Separated Strategy#

Communication Patterns#

Hybrid Approaches#

Key Takeaways#

Companion Code#

References#