RLHF 系统设计入门

Motivation

如果你去问做 RLHF 的工程师"RLHF 最难的地方是什么"，答案很可能不是 PPO 算法本身——PPO 在强化学习领域已经是相当成熟的算法。真正的困难在于：RLHF 需要同时驱动四个大模型，在它们之间编排复杂的数据流，并且在训练循环内部嵌入了一个完整的推理系统。

让我们先算一笔账感受一下。假设你要对一个 7B 模型做 SFT，FP16 下参数占 14 GB，加上 Adam 优化器状态（momentum + variance 各一份）和梯度，总计大约 $14 \times 4 = 56$ GB——一张 A100-80GB 刚好能装下。

但如果要做 RLHF，你需要四个模型：

Actor（被训练的 LLM）：14 GB 参数 + 42 GB 优化器/梯度 = 56 GB
Critic（价值函数）：14 GB 参数 + 42 GB 优化器/梯度 = 56 GB
Reward Model（奖励模型，冻结）：14 GB
Reference Model（参考模型，冻结）：14 GB

总计：140 GB，至少需要 2 张 A100-80GB，而这还没算激活值和 KV Cache。

除了显存，RLHF 训练循环内部还嵌套了一个推理过程——Actor 需要自回归地生成 response。这个生成过程是 memory-bound 的推理问题，但它发生在 compute-bound 的训练循环内部。你需要同时优化这两种截然不同的计算模式。

本文将从系统视角出发，帮你理解 RLHF 训练的全貌：四模型的角色与交互、PPO 的完整数据流、为什么这是一个系统问题，以及 verl 框架如何用"混合引擎"优雅地解决这些挑战。

前置知识

GPU 显存模型与分布式通信基础（第 1 篇）——理解显存瓶颈和通信原语
分布式并行策略全景（第 2 篇）——特别是 FSDP 和 Tensor Parallelism
LLM 推理系统架构（第 3 篇）——理解自回归生成和 KV Cache
强化学习基本概念（Policy、Reward、Value Function），不要求精通

先看一张全局架构图，后面逐一展开：

flowchart TD
    subgraph RLHF["RLHF 训练系统全景"]
        A["**Actor (Policy)**<br/>生成 response<br/>PPO 更新<br/>✓ 可训练"]
        R["**Reference (Frozen)**<br/>KL 锚点<br/>防止漂移<br/>✗ 冻结"]
        RM["**Reward Model (Frozen)**<br/>打分<br/>标量奖励<br/>✗ 冻结"]
        C["**Critic (Value)**<br/>估计价值<br/>计算优势<br/>✓ 可训练"]

        A --> Flow
        R --> Flow
        RM --> Flow
        C --> Flow

        Flow["PPO 数据流：生成 → 打分 → 优势估计 → 更新"]
    end

    subgraph Challenges["系统挑战"]
        CH1["4x 显存 vs SFT"]
        CH2["推理嵌套在训练中"]
        CH3["复杂数据依赖"]
        CH4["权重同步"]
        CH5["异构计算模式"]
        CH6["模型放置策略"]
    end

    RLHF --> Challenges

    style A fill:#d4edda,stroke:#28a745
    style C fill:#d4edda,stroke:#28a745
    style R fill:#fff3cd,stroke:#ffc107
    style RM fill:#fff3cd,stroke:#ffc107
    style Flow fill:#cce5ff,stroke:#007bff

RLHF 四模型架构

RLHF 之所以是系统问题，根源在于它需要四个模型协同工作。理解每个模型的角色是理解整个系统的前提。

Actor（策略模型）

Actor 就是我们要训练的 LLM——它的任务是根据 prompt 生成高质量的 response。在 RLHF 之前，它通常已经经过 SFT（Supervised Fine-Tuning），具备基本的指令遵循能力。

从架构上看，Actor 就是一个标准的 Causal Language Model：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
class CausalLM(nn.Module):
    """
    Causal Language Model (GPT-style).
    生产环境中就是 LLaMA、Qwen 等 7B-70B+ 的模型。
    """
    def __init__(self, vocab_size, d_model=256, n_heads=8,
                 n_layers=4, d_ff=512, max_seq_len=256):
        super().__init__()
        self.token_embed = nn.Embedding(vocab_size, d_model)
        self.pos_embed = nn.Embedding(max_seq_len, d_model)
        self.blocks = nn.ModuleList([
            TransformerBlock(d_model, n_heads, d_ff)
            for _ in range(n_layers)
        ])
        self.ln_f = nn.LayerNorm(d_model)
        self.lm_head = nn.Linear(d_model, vocab_size, bias=False)
        self.lm_head.weight = self.token_embed.weight  # Weight tying

Actor 在 RLHF 中有两种工作模式，这正是系统设计的难点所在：

生成模式（推理）：自回归地采样 token，memory-bound，受益于 KV Cache 和 TP
训练模式：基于 PPO 目标函数计算梯度并更新参数，compute-bound，受益于 FSDP

Reference Model（参考模型）

Reference Model 是 Actor 在 RLHF 训练开始前的一份冻结副本。它的唯一作用是计算 KL 散度惩罚——防止 Actor 在追逐高奖励的过程中偏离原始行为太远。

1
2
3
4
5
6
# Reference = Actor 的冻结副本（训练开始前复制一次，此后永不更新）
reference = CausalLM(vocab_size, d_model, n_heads, n_layers,
                     d_ff, max_seq_len)
reference.load_state_dict(actor.state_dict())
for param in reference.parameters():
    param.requires_grad = False  # 完全冻结

为什么需要 Reference？ 这涉及 RLHF 中一个著名的问题——Reward Hacking。Reward Model 并不完美，它是一个学出来的近似函数。如果没有任何约束，Actor 很容易找到 Reward Model 的"漏洞"：生成的 response 得到高分，但实际上是无意义的、重复的、或者过度冗长的文本。KL 惩罚通过约束 Actor 不能偏离 Reference 太远，有效缓解了这个问题。

系统开销：Reference 虽然冻结不需要优化器状态，但它仍然需要完整的前向传播来计算 log probabilities。对于 7B 模型，这意味着额外 14 GB 显存和一次完整的前向计算。

Reward Model（奖励模型）

Reward Model 是在人类偏好数据上预训练好的——给定 (prompt, response_A, response_B)，它学会为人类偏好的那个 response 打更高的分。在 RLHF 训练期间，它作为"评委"对 Actor 生成的 response 打分。

架构上，它和 Actor 共享相同的 Transformer backbone，但最后一层的 language modeling head 被替换为一个 value head，输出标量奖励：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
class RewardModel(nn.Module):
    """
    Reward Model：将 (prompt, response) 映射到标量奖励值。
    使用最后一个 token 的 hidden state 作为序列表示。
    """
    def __init__(self, vocab_size, d_model=256, n_heads=8,
                 n_layers=4, d_ff=512, max_seq_len=256):
        super().__init__()
        # ... 与 CausalLM 相同的 Transformer backbone ...
        # 关键区别：value head 替代 lm_head
        self.value_head = nn.Linear(d_model, 1, bias=False)

    def forward(self, input_ids):
        # ... Transformer 前向传播 ...
        last_hidden = x[:, -1, :]   # 最后一个 token 的 hidden state
        reward = self.value_head(last_hidden).squeeze(-1)  # 标量
        return reward  # (batch,)

Reward Model 在 RLHF 训练中完全冻结，只做推理。但不要小看它的系统开销——每个 PPO iteration 都需要对一整个 batch 的 (prompt + response) 做前向传播。

Critic（价值模型）

Critic 估计每个 token 位置的期望未来奖励 $V(s_t)$。这个价值估计用于计算 GAE（Generalized Advantage Estimation），从而降低策略梯度的方差。

没有 Critic，PPO 退化为 REINFORCE——虽然理论上也能工作，但方差极高，训练极不稳定。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
class CriticModel(nn.Module):
    """
    Critic：估计每个 token 位置的 V(s) 值。
    与 Reward Model 不同，它输出 per-token 的标量值，而非整个序列的标量。
    """
    def __init__(self, vocab_size, d_model=256, n_heads=8,
                 n_layers=4, d_ff=512, max_seq_len=256):
        super().__init__()
        # ... 与 CausalLM 相同的 Transformer backbone ...
        self.value_head = nn.Linear(d_model, 1, bias=False)

    def forward(self, input_ids):
        # ... Transformer 前向传播 ...
        values = self.value_head(x).squeeze(-1)  # (batch, seq_len)
        return values  # 每个位置一个标量值

Critic 和 Actor 一样是可训练的——它与 Actor 同步更新，通常用 MSE 损失拟合 GAE 计算出的 returns。在实践中，Critic 常用 Reward Model 的权重来初始化（因为两者的目标相似：估计"response 有多好"）。

四模型总览

把四个模型放在一起比较：

flowchart TD
    subgraph Compare["RLHF 四模型对比"]
        direction LR
        subgraph Trainable["✓ 可训练"]
            Actor["**Actor**<br/>Adam 优化器<br/>前向 + 反向<br/>56 GB (7B FP16)"]
            Critic["**Critic**<br/>Adam 优化器<br/>前向 + 反向<br/>56 GB (7B FP16)"]
        end
        subgraph Frozen["✗ 冻结"]
            Reward["**Reward Model**<br/>无优化器<br/>仅前向<br/>14 GB (7B FP16)"]
            Reference["**Reference**<br/>无优化器<br/>仅前向<br/>14 GB (7B FP16)"]
        end
    end
    Compare --> Total["**TOTAL: 140 GB**<br/>对比：SFT 只需 1 个模型 + Adam ≈ 56 GB<br/>RLHF 需要 2.5x SFT 的显存"]

    style Actor fill:#d4edda,stroke:#28a745
    style Critic fill:#d4edda,stroke:#28a745
    style Reward fill:#fff3cd,stroke:#ffc107
    style Reference fill:#fff3cd,stroke:#ffc107
    style Total fill:#cce5ff,stroke:#007bff

更大规模模型的显存需求更加惊人：

模型规模	单模型 (FP16)	四模型权重	含优化器状态	最少 GPU 数 (A100-80GB)
7B	14 GB	56 GB	140 GB	2
13B	26 GB	104 GB	260 GB	4
70B	140 GB	560 GB	1400 GB	18

PPO 数据流

理解了四个模型各自的角色之后，关键问题是：它们之间的数据是如何流动的？ PPO 的每个 iteration 包含两个阶段：Rollout Phase（经验收集）和 Training Phase（梯度更新）。

Phase 1: Rollout（经验收集）

Rollout 阶段的目标是收集一批"经验"——Actor 生成的 response、Reward Model 的评分、以及 Critic 的价值估计。整个过程的核心数据流如下：

flowchart TD
    subgraph Rollout["ROLLOUT PHASE (推理，无梯度)"]
        S1["**Step 1: 生成**<br/>Prompts → Actor → 自回归采样 → Responses"]
        S1 --> S2a & S2b & S3
        S2a["**Step 2a: Reward 打分**<br/>Reward Model → 标量 Rewards"]
        S2b["**Step 2b: KL 计算**<br/>Reference Model → log_probs → KL Penalties"]
        S3["**Step 3: 价值估计**<br/>Critic → per-token Values"]
        S2a --> S4
        S2b --> S4
        S3 --> S4
        S4["**Step 4: 优势计算**<br/>GAE(Rewards, KL, Values) → Advantages"]
    end

    style S1 fill:#cce5ff,stroke:#007bff
    style S2a fill:#fff3cd,stroke:#ffc107
    style S2b fill:#fff3cd,stroke:#ffc107
    style S3 fill:#fff3cd,stroke:#ffc107
    style S4 fill:#d4edda,stroke:#28a745

让我们逐步拆解。

Step 1: 自回归生成

Actor 拿到一批 prompt，自回归地生成 response。这本质上是一个推理问题：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
@torch.no_grad()
def generate_responses(actor, prompt_ids, max_new_tokens, temperature=1.0):
    """
    Actor 自回归生成 response。
    生产环境中会使用 KV Cache、Continuous Batching、Tensor Parallelism。
    """
    generated = prompt_ids.clone()

    for _ in range(max_new_tokens):
        logits = actor(generated)            # 完整前向传播
        next_logits = logits[:, -1, :]       # 只取最后一个位置
        probs = F.softmax(next_logits / temperature, dim=-1)
        next_token = torch.multinomial(probs, num_samples=1)
        generated = torch.cat([generated, next_token], dim=1)

    return generated  # (batch, prompt_len + response_len)

系统洞察：这个循环的每一步都需要一次完整的前向传播（没有 KV Cache 的情况下）。生成 $T$ 个 token 就需要 $T$ 次前向传播。这是 RLHF 中计算开销最大的单个环节。在生产系统（如 verl）中，这里会使用 KV Cache + Tensor Parallelism + Continuous Batching 来加速——本质上需要在训练循环内部嵌入一个推理引擎。

Step 2: Reward 打分与 KL 计算

生成 response 后，三个模型需要分别处理这些 response：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
def compute_rewards_and_kl(full_ids, prompt_len, actor, reference,
                            reward_model, kl_coeff):
    # 1. Reward Model 打分（冻结，无梯度）
    with torch.no_grad():
        rewards = reward_model(full_ids)   # (batch,) 标量奖励

    # 2. Actor 计算 log probabilities
    actor_log_probs = actor.get_log_probs(full_ids)
    response_actor_lp = actor_log_probs[:, prompt_len - 1:]

    # 3. Reference 计算 log probabilities（冻结，无梯度）
    with torch.no_grad():
        ref_log_probs = reference.get_log_probs(full_ids)
        response_ref_lp = ref_log_probs[:, prompt_len - 1:]

    # 4. KL 惩罚：KL(Actor || Reference) ≈ actor_lp - ref_lp
    kl_penalties = kl_coeff * (response_actor_lp.detach() - response_ref_lp)

    return rewards, response_actor_lp, response_ref_lp, kl_penalties

KL 惩罚是 RLHF 稳定性的关键。在实践中，KL 散度可以用 per-token 的近似来计算：

$$D_{KL}(\pi_\theta | \pi_\text{ref}) \approx \sum_t \left[\log \pi_\theta(a_t | s_t) - \log \pi_\text{ref}(a_t | s_t)\right]$$

当 KL 为正值时，说明 Actor 在该 token 上的概率比 Reference 更高——即 Actor 正在偏离原始行为。kl_coeff 控制惩罚力度：

太大：Actor 几乎学不到东西（被"锁死"在 Reference 附近）
太小：容易出现 Reward Hacking（Actor 找到 Reward Model 的漏洞）
常见取值：0.01 - 0.2，有些系统（如 InstructGPT）会自适应调整

Step 3: GAE 优势估计

有了 rewards、KL penalties 和 Critic 的 value estimates 之后，就可以计算 GAE（Generalized Advantage Estimation）了。优势函数 $A(s_t, a_t)$ 告诉我们：“这个 action 比期望水平好多少？”

$$A^{GAE(\gamma,\lambda)}t = \sum{l=0}^{T-t} (\gamma\lambda)^l \delta_{t+l}$$

其中 TD error $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
@torch.no_grad()
def compute_advantages_gae(rewards, kl_penalties, values,
                           gamma=1.0, lam=0.95):
    """
    GAE 通过 lambda 参数平衡偏差与方差：
      - lam=1: 高方差，低偏差（Monte Carlo）
      - lam=0: 低方差，高偏差（1-step TD）
      - lam=0.95: 常用的折衷选择
    """
    B, T = values.shape

    # 构造 per-token rewards：
    # - 每个 token 承受 KL 惩罚
    # - 只有最后一个 token 获得 Reward Model 的序列级奖励
    per_token_rewards = -kl_penalties.clone()
    per_token_rewards[:, -1] += rewards  # 序列奖励分配到最后一个 token

    # 从后往前计算 GAE
    advantages = torch.zeros_like(values)
    last_gae = torch.zeros(B, device=values.device)

    for t in reversed(range(T)):
        next_value = values[:, t + 1] if t < T - 1 else torch.zeros(B)
        delta = per_token_rewards[:, t] + gamma * next_value - values[:, t]
        last_gae = delta + gamma * lam * last_gae
        advantages[:, t] = last_gae

    returns = advantages + values  # Critic 的训练目标
    return advantages, returns

注意这里的一个设计选择：序列级的 reward 被分配到最后一个 token，而 KL 惩罚是 per-token 的。这是 RLHF 中的标准做法。

Phase 2: Training（PPO 更新）

收集完经验之后，进入 PPO 更新阶段。PPO 的核心思想是：用同一批经验数据做多次梯度更新，但通过 clipping 防止策略变化过大。

flowchart TD
    subgraph Training["TRAINING PHASE (梯度更新)"]
        Input["**输入：Rollout 经验数据**<br/>full_ids, old_log_probs, advantages, returns"]
        Input --> Loop

        subgraph Loop["for epoch in range(ppo_epochs) — 通常 2-4 个 epoch"]
            direction LR
            ActorUpdate["**Actor PPO 更新**<br/>new_lp = Actor(full_ids)<br/>ratio = exp(new - old)<br/>loss = -min(r*A, clip)<br/>backward + step"]
            CriticUpdate["**Critic 更新**<br/>values = Critic(full_ids)<br/>loss = MSE(values, returns)<br/>backward + step"]
        end

        Note["Reference 和 Reward Model：本阶段不参与"]
    end

    style ActorUpdate fill:#d4edda,stroke:#28a745
    style CriticUpdate fill:#d4edda,stroke:#28a745
    style Input fill:#cce5ff,stroke:#007bff
    style Note fill:#fff3cd,stroke:#ffc107

PPO Clipped Surrogate Loss

PPO 的核心是 clipped surrogate objective，它是 PPO 相比 vanilla policy gradient 的关键创新：

$$L^{CLIP}(\theta) = \mathbb{E}\left[\min\left(r_t(\theta)\hat{A}_t,; \text{clip}(r_t(\theta),; 1-\epsilon,; 1+\epsilon)\hat{A}_t\right)\right]$$

其中 policy ratio $r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{old}}(a_t | s_t)} = \exp(\log\pi_\theta - \log\pi_{\theta_{old}})$。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
def ppo_actor_loss(actor, full_ids, prompt_len, old_log_probs,
                   advantages, clip_eps):
    """
    PPO clipped surrogate loss。
    ratio > 1: action 现在更可能被选中
    ratio < 1: action 现在更不可能被选中
    clipping 防止 ratio 偏离 [1-eps, 1+eps] 区间
    """
    new_log_probs = actor.get_log_probs(full_ids)[:, prompt_len - 1:]

    # Policy ratio
    ratio = torch.exp(new_log_probs - old_log_probs.detach())

    # 标准化 advantages（训练稳定性的标准做法）
    adv = (advantages - advantages.mean()) / (advantages.std() + 1e-8)

    # Clipped surrogate loss
    surr1 = ratio * adv
    surr2 = torch.clamp(ratio, 1.0 - clip_eps, 1.0 + clip_eps) * adv
    loss = -torch.min(surr1, surr2).mean()

    return loss, ratio

Clipping 的直觉是：

当 $\hat{A}_t > 0$（好的 action）：我们想增大其概率，但 clipping 防止 $r_t$ 超过 $1+\epsilon$
当 $\hat{A}_t < 0$（差的 action）：我们想减小其概率，但 clipping 防止 $r_t$ 低于 $1-\epsilon$

这确保了每次更新的"步长"有界，让 PPO 能安全地在同一批数据上做多次更新。

Critic Loss

Critic 用 MSE 损失拟合 GAE 计算出的 returns：

$$L^{VF} = \frac{1}{2}\mathbb{E}\left[(V_\phi(s_t) - R_t)^2\right]$$

1
2
3
4
5
6
def ppo_critic_loss(critic, full_ids, prompt_len, returns):
    """Critic 价值函数损失：MSE(predicted_value, returns)"""
    values = critic(full_ids)[:, prompt_len - 1:-1]
    min_len = min(values.shape[1], returns.shape[1])
    loss = F.mse_loss(values[:, :min_len], returns[:, :min_len].detach())
    return loss

完整的 PPO 训练循环

把 Rollout 和 Training 两个阶段拼在一起，就是完整的 PPO 训练循环：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# 只有 Actor 和 Critic 需要优化器
actor_optim = torch.optim.Adam(actor.parameters(), lr=1e-5)
critic_optim = torch.optim.Adam(critic.parameters(), lr=1e-5)

for iteration in range(num_iterations):
    # ===== Phase 1: Rollout (推理模式) =====
    with torch.no_grad():
        full_ids = generate_responses(actor, prompt_ids,
                                      response_len, temperature)
        rewards = reward_model(full_ids)
        old_log_probs = actor.get_log_probs(full_ids)[:, prompt_len-1:]
        ref_log_probs = reference.get_log_probs(full_ids)[:, prompt_len-1:]
        kl_penalties = kl_coeff * (old_log_probs - ref_log_probs)
        values = critic(full_ids)[:, prompt_len-1:-1]
        advantages, returns = compute_advantages_gae(
            rewards, kl_penalties, values, gamma, lam)

    # ===== Phase 2: PPO Update (训练模式) =====
    for epoch in range(ppo_epochs):  # 同一批数据做多次更新
        # Actor 更新
        actor_optim.zero_grad()
        a_loss, ratio = ppo_actor_loss(actor, full_ids, prompt_len,
                                        old_log_probs, advantages, clip_eps)
        a_loss.backward()
        torch.nn.utils.clip_grad_norm_(actor.parameters(), max_norm=1.0)
        actor_optim.step()

        # Critic 更新
        critic_optim.zero_grad()
        c_loss = ppo_critic_loss(critic, full_ids, prompt_len, returns)
        c_loss.backward()
        torch.nn.utils.clip_grad_norm_(critic.parameters(), max_norm=1.0)
        critic_optim.step()

注意几个关键细节：

学习率极低（1e-5）：RLHF 中 Actor 的学习率通常比 SFT 低 10 倍以上，因为我们只想做微小调整
梯度裁剪（max_norm=1.0）：防止梯度爆炸，RLHF 训练中几乎是必须的
多个 PPO epoch：同一批 rollout 数据做 2-4 次更新，clipping 确保不会过度更新

为什么 RLHF 是系统问题

到这里，你应该已经感受到 RLHF 的复杂性了。让我们系统性地总结它带来的四大系统挑战。

挑战 1：显存压力——4 倍于 SFT

最直观的挑战是显存。我们在前面已经算过：7B 模型的 RLHF 需要 ~140 GB，而 SFT 只需要 ~56 GB。

flowchart TD
  subgraph ACTOR["Actor — 56 GB"]
    direction TB
    A1["参数: 14 GB"]
    A2["梯度: 14 GB"]
    A3["Adam momentum: 14 GB"]
    A4["Adam variance: 14 GB"]
  end
  subgraph CRITIC["Critic — 56 GB"]
    direction TB
    C1["参数: 14 GB"]
    C2["梯度: 14 GB"]
    C3["Adam momentum: 14 GB"]
    C4["Adam variance: 14 GB"]
  end
  subgraph REWARD["Reward Model — 14 GB"]
    R1["参数: 14 GB (冻结，仅推理)"]
  end
  subgraph REF["Reference — 14 GB"]
    RF1["参数: 14 GB (冻结，仅推理)"]
  end
  ACTOR & CRITIC & REWARD & REF --> TOTAL["**总计: 140 GB**\n(SFT: ~56 GB)\n+ 激活值 + KV Cache + 通信 buffer"]

  style ACTOR fill:#d4edda,stroke:#28a745
  style CRITIC fill:#d4edda,stroke:#28a745
  style REWARD fill:#fff3cd,stroke:#ffc107
  style REF fill:#fff3cd,stroke:#ffc107
  style TOTAL fill:#cce5ff,stroke:#007bff

在实际生产中，还需要考虑激活值（可以用 activation checkpointing 压缩）和通信 buffer。70B 模型的 RLHF 需要 1400 GB 显存，即使 18 张 A100-80GB 也仅够放下模型参数和优化器。

挑战 2：计算异构——推理嵌套在训练中

RLHF 最独特的系统挑战是：训练循环内部嵌入了一个完整的推理过程。

在 SFT 中，整个训练循环都是 compute-bound 的：前向传播 → 反向传播 → 参数更新，计算模式统一。但 RLHF 的 Rollout Phase 需要自回归生成——这是一个典型的 memory-bound 推理任务。

flowchart LR
    subgraph SFT["SFT 训练循环 (全部 compute-bound)"]
        direction LR
        SF["前向"] --> SB["反向"] --> SU["更新"]
        SStrategy["最佳策略: FSDP / DDP"]
    end

    subgraph RLHF["RLHF 训练循环 (异构计算)"]
        direction LR
        RG["自回归生成<br/>⚡ memory-bound<br/>最佳: TP + KV Cache"] --> RS["打分"] --> RGAE["GAE"] --> RPPO["PPO 更新<br/>⚡ compute-bound<br/>最佳: FSDP"]
    end

    style RG fill:#fff3cd,stroke:#ffc107
    style RPPO fill:#d4edda,stroke:#28a745

这意味着你需要在同一组 GPU 上切换两种截然不同的并行策略：

阶段	计算特性	最佳并行策略	瓶颈
自回归生成	Memory-bound	Tensor Parallelism	HBM 带宽
Reward 打分	Compute-bound	Data Parallelism	算力
PPO 更新	Compute-bound	FSDP (ZeRO-3)	算力 + 显存

挑战 3：数据流复杂——四模型的严格依赖顺序

RLHF 的数据流不是简单的"输入 → 输出"，而是四个模型之间的有向无环图（DAG）。每个 PPO iteration 的执行顺序是严格确定的：

flowchart LR
    P["Prompts"] --> Actor
    Actor --> resp["responses"]
    resp --> RM["Reward Model"]
    resp --> Ref["Reference"]
    resp --> Crit["Critic"]

    RM --> rewards
    Ref --> KL["KL penalties"]
    Crit --> values

    rewards --> GAE
    KL --> GAE
    values --> GAE

    GAE --> advantages
    advantages --> ActorPPO["Actor PPO update<br/>← old_log_probs"]
    advantages --> CriticMSE["Critic MSE update<br/>← returns"]

    style Actor fill:#d4edda,stroke:#28a745
    style GAE fill:#cce5ff,stroke:#007bff
    style ActorPPO fill:#d4edda,stroke:#28a745
    style CriticMSE fill:#d4edda,stroke:#28a745
    style RM fill:#fff3cd,stroke:#ffc107
    style Ref fill:#fff3cd,stroke:#ffc107
    style Crit fill:#fff3cd,stroke:#ffc107

这个 DAG 的关键约束：

生成必须先完成：Reward、Reference、Critic 都依赖 Actor 生成的 response
GAE 依赖三方输入：rewards + KL penalties + values 必须全部就绪才能计算
PPO 更新依赖 GAE：advantages 和 returns 是 Actor 和 Critic 更新的输入
Actor 更新后需要权重同步：下一次生成要用更新后的权重

在分布式场景下，如果四个模型放在不同的 GPU 组上，这些依赖关系就变成了跨设备的数据传输，调度复杂度大幅上升。

挑战 4：权重同步

每个 PPO iteration 结束后，Actor 的权重被更新了。但下一个 iteration 的生成阶段需要使用更新后的权重进行推理。如果训练和推理使用不同的并行策略（比如训练用 FSDP，推理用 TP），那就需要在两种权重格式之间进行权重重整（weight resharding）：

flowchart TD
    A["PPO 更新结束<br/>(FSDP 格式: 每个 GPU 持有 1/N 参数分片)"]
    A --> B["权重重整 (Resharding)"]
    B --> C["下一轮生成开始<br/>(TP 格式: 每个 GPU 持有所有层的一个切片)"]

    style A fill:#d4edda,stroke:#28a745
    style B fill:#fff3cd,stroke:#ffc107
    style C fill:#cce5ff,stroke:#007bff

这个重整过程需要 all-gather 通信来收集所有分片，然后按 TP 的切分方式重新分发。对于大模型来说，这是一笔不小的通信开销。

计算开销对比

把 RLHF 和 SFT 的计算开销放在一起比较：

操作	SFT	RLHF	额外倍数
Actor 前向传播	1x	2x	2x
Actor 反向传播	1x	1x	1x
自回归生成	0	$N$x	$+N$x
Reference 前向传播	0	1x	+1x
Reward Model 前向传播	0	1x	+1x
Critic 前向传播	0	2x	+2x
Critic 反向传播	0	1x	+1x
GAE 计算	0	1x	+1x
总计（近似）	~2x	~10-16x	5-8x

其中 $N$ 是生成的 token 数。如果生成 256 个 token，仅生成阶段就需要 256 次前向传播（无 KV Cache 时）。RLHF 单步训练的计算量大约是 SFT 的 5-8 倍。

verl 架构深入分析

面对上述挑战，业界提出了多种解决方案。verl（Volcano Engine Reinforcement Learning）是字节跳动开源的 RLHF 训练框架，它通过**混合引擎（Hybrid Engine）和共置策略（Colocated Strategy）**优雅地解决了这些问题。

核心设计理念

verl 的核心观察是：RLHF 的两个阶段需要不同的并行策略，但传统方案要么用 Separated 策略（浪费 GPU），要么用 Colocated 策略（显存不够）。verl 的解法是：同一组 GPU 上动态切换并行模式。

flowchart LR
    subgraph HybridEngine["verl 混合引擎 — 同一组 GPU 动态切换并行策略"]
        subgraph Gen["GENERATION 阶段"]
            G1["Tensor Parallel<br/>(低延迟推理)"]
            G2["每个 GPU 持有<br/>所有层的一个切片"]
            G3["all-reduce 通信"]
        end
        subgraph Train["TRAINING 阶段"]
            T1["FSDP (ZeRO-3)<br/>(高效训练)"]
            T2["每个 GPU 持有<br/>1/N 的参数分片"]
            T3["all-gather +<br/>reduce-scatter"]
        end
        Gen <-- "weight<br/>resharding" --> Train
    end

    Adv["**关键优势**<br/>生成用 TP: 低延迟<br/>训练用 FSDP: 显存高效<br/>无 GPU 空闲浪费"]

    HybridEngine --> Adv

    style Gen fill:#cce5ff,stroke:#007bff
    style Train fill:#d4edda,stroke:#28a745
    style Adv fill:#fff3cd,stroke:#ffc107

共置策略 vs 分离策略

在分布式 RLHF 中，四个模型的放置策略是一个核心决策。verl 支持两种策略：

策略 1: Colocated（共置）

所有四个模型放在同一组 GPU 上。

flowchart TD
    subgraph Colocated["Colocated — GPU 0-7: 所有模型共置"]
        Models["Actor(FSDP) + Critic(FSDP) + Reward(TP) + Reference(TP)"]

        Pro1["(+) 无跨组数据传输"]
        Pro2["(+) 调度简单：按阶段顺序执行"]
        Pro3["(+) GPU 利用率高"]

        Con1["(-) 显存压力大：4 模型共享"]
        Con2["(-) 无法独立优化并行策略"]
        Con3["(-) 需要精细的显存管理"]
    end

    style Models fill:#cce5ff,stroke:#007bff
    style Pro1 fill:#d4edda,stroke:#28a745
    style Pro2 fill:#d4edda,stroke:#28a745
    style Pro3 fill:#d4edda,stroke:#28a745
    style Con1 fill:#fff3cd,stroke:#ffc107
    style Con2 fill:#fff3cd,stroke:#ffc107
    style Con3 fill:#fff3cd,stroke:#ffc107

策略 2: Separated（分离）

每个模型放在独立的 GPU 组上。

flowchart TD
    subgraph Separated["Separated — 每个模型独立 GPU 组"]
        direction LR
        subgraph AG["Actor GPUs 0-3"]
            A1["Actor (FSDP + TP)"]
        end
        subgraph CG["Critic GPUs 4-5"]
            C1["Critic (FSDP)"]
        end
        subgraph RG["Reward GPU 6"]
            R1["Reward (TP)"]
        end
        subgraph RefG["Ref GPU 7"]
            Ref1["Reference (TP)"]
        end
    end

    Pros["(+) 每个模型有充足显存<br/>(+) 可独立选择并行策略<br/>(+) 模型间解耦"]
    Cons["(-) 必须跨组传输数据<br/>(-) GPU 空闲时间长<br/>(-) 资源分配不灵活"]

    Separated --> Pros
    Separated --> Cons

    style Pros fill:#d4edda,stroke:#28a745
    style Cons fill:#fff3cd,stroke:#ffc107

verl 选择了 Colocated 策略，因为它避免了分离策略中最大的问题：GPU 空闲和数据传输开销。为了解决显存压力，verl 使用 FSDP 来分片参数，并在不同阶段动态切换模型的工作模式。

权重更新与 Resharding

verl 混合引擎中最关键的操作是权重重整（resharding）——在 FSDP 分片格式和 TP 切片格式之间转换。

flowchart TD
    subgraph FSDP_Format["FSDP 格式 (训练后)"]
        direction LR
        G0s["GPU 0<br/>shard 0"]
        G1s["GPU 1<br/>shard 1"]
        G2s["GPU 2<br/>shard 2"]
        G3s["GPU 3<br/>shard 3"]
    end

    FSDP_Format --> |"all-gather"| Full["Full Model"]
    Full --> |"split by dimension"| TP_Format

    subgraph TP_Format["TP 格式 (生成时)"]
        direction LR
        G0t["GPU 0<br/>col 0<br/>所有层"]
        G1t["GPU 1<br/>col 1<br/>所有层"]
        G2t["GPU 2<br/>col 2<br/>所有层"]
        G3t["GPU 3<br/>col 3<br/>所有层"]
    end

    style FSDP_Format fill:#d4edda,stroke:#28a745
    style TP_Format fill:#cce5ff,stroke:#007bff
    style Full fill:#fff3cd,stroke:#ffc107

这个过程的通信开销是 $O(P)$，其中 $P$ 是模型参数量。对于 7B 模型大约是 14 GB 的数据传输。但考虑到生成阶段会运行数百步（每步都需要通信），这个一次性开销是完全可以接受的。

verl 的训练流程

把上面所有部分串起来，verl 的一次 PPO iteration 流程如下：

flowchart TD
    subgraph Iter["verl PPO Iteration"]
        subgraph Phase1["1. Rollout Phase"]
            R1["a) FSDP → TP resharding<br/>(Actor 权重重整)"]
            R2["b) Actor (TP) 自回归生成 responses"]
            R3["c) Reward Model (TP) 打分"]
            R4["d) Reference (TP) 计算 log probs"]
            R5["e) Critic (FSDP) 估计 values"]
            R6["f) 计算 KL penalties + GAE advantages"]
            R1 --> R2 --> R3 --> R4 --> R5 --> R6
        end
        subgraph Phase2["2. Training Phase"]
            T1["a) TP → FSDP resharding<br/>(Actor 权重重整回 FSDP)"]
            T2["b) Actor (FSDP) PPO 更新 x ppo_epochs"]
            T3["c) Critic (FSDP) Value 更新 x ppo_epochs"]
            T1 --> T2 --> T3
        end
        Phase1 --> Phase2
        Phase2 --> Repeat["3. 重复"]
    end

    style Phase1 fill:#cce5ff,stroke:#007bff
    style Phase2 fill:#d4edda,stroke:#28a745

分布式 RLHF 策略

当模型规模增大到需要数十甚至上百张 GPU 时，RLHF 的通信模式变得非常复杂。让我们系统地梳理各个阶段的通信需求。

各阶段通信模式

阶段	通信操作	特性
生成 (TP)	All-reduce（每层之后）	Latency-bound
生成 (PP)	Point-to-point（流水线）	Bubble overhead
Actor 训练 (FSDP)	All-gather + reduce-scatter	Bandwidth-bound
Critic 训练 (FSDP)	All-gather + reduce-scatter	Bandwidth-bound
Reward 打分	Broadcast prompts + responses	一次性开销
Reference log probs	Broadcast prompts + responses	一次性开销
权重同步 (Actor)	All-gather / broadcast	每个 iteration 一次

几个关键观察：

生成阶段是 latency-bound 的：自回归生成的每一步都需要一次 all-reduce（TP 的情况），$T$ 个 token 就需要 $T$ 次。这就是为什么生成阶段更适合用 TP 而非 FSDP——TP 的 all-reduce 量小（只有 hidden dimension），而 FSDP 的 all-gather 量大（整个参数分片）。
训练阶段是 bandwidth-bound 的：FSDP 的 all-gather 和 reduce-scatter 传输的是完整的参数和梯度分片，数据量大但通信次数少（每层一次 all-gather + 一次 reduce-scatter）。
权重同步是固定开销：每个 PPO iteration 只需做一次，开销与模型参数量成正比。

生产级配置示例

以 70B 模型在 64 张 A100-80GB 上的 RLHF 训练为例：

flowchart TD
    subgraph Config["70B RLHF 生产配置 (共置策略)"]
        HW["**硬件** 8 节点 x 8 GPU = 64 A100-80GB<br/>互联: 节点内 NVLink, 节点间 RDMA"]

        subgraph Models["模型并行策略"]
            AR["**Actor + Reference**<br/>训练: FSDP across 64 GPUs<br/>生成: TP=8 + PP=8"]
            CR["**Critic**<br/>训练: FSDP across 64 GPUs"]
            RW["**Reward Model**<br/>推理: TP=8 within node"]
        end

        subgraph Mem["显存预算 (per GPU) — 总计 ~73 GB"]
            M1["Actor FSDP shard: ~2.2 GB"]
            M2["Actor optimizer: ~4.4 GB"]
            M3["Critic shard + optimizer: ~6.6 GB"]
            M4["Reward (TP=8): ~17.5 GB"]
            M5["Reference (TP=8): ~17.5 GB"]
            M6["激活值 + KV Cache: ~20 GB"]
            M7["通信 buffer: ~5 GB"]
        end

        HW --> Models --> Mem
    end

    style HW fill:#cce5ff,stroke:#007bff
    style AR fill:#d4edda,stroke:#28a745
    style CR fill:#d4edda,stroke:#28a745
    style RW fill:#fff3cd,stroke:#ffc107

在实际部署中，还需要考虑：

Activation Checkpointing：用计算换显存，对 Actor 和 Critic 的训练阶段尤为重要
Mixed Precision：BF16 训练 + FP32 优化器状态
Gradient Accumulation：增大有效 batch size，减少通信频率
Prompt 排序：按长度排序 prompt，减少 padding 浪费

RLHF 的前沿方向

值得一提的是，近年来 RLHF 系统还在快速演进：

GRPO（Group Relative Policy Optimization）：DeepSeek 提出的方法，去掉了 Critic 模型，用组内 response 的相对排名来估计优势。这直接减少了 25-40% 的显存需求和相应的计算开销。
Online DPO：结合了 DPO（Direct Preference Optimization）和在线生成，在某些场景下可以替代 PPO。
异步 PPO：将生成和训练异步化，提高 GPU 利用率。Actor 在训练的同时，用旧版本的权重进行下一批生成。

Key Takeaways

RLHF 需要四个模型：Actor（生成 response）、Critic（估计价值）、Reward Model（打分）、Reference（KL 锚点）。加上优化器状态，总显存需求约为 SFT 的 2.5 倍。
PPO 的数据流是严格有序的：生成 → 打分 → KL 计算 → GAE 优势估计 → PPO 更新。四个模型之间形成 DAG 依赖，任何一步都不能跳过或并行化。
RLHF 的核心系统挑战是推理嵌套在训练中：生成阶段是 memory-bound 的推理问题（受益于 TP + KV Cache），训练阶段是 compute-bound 的优化问题（受益于 FSDP）。同一组 GPU 需要在两种模式间切换。
PPO 通过两个机制稳定训练：
- KL 惩罚：$D_{KL}(\pi_\theta | \pi_\text{ref})$ 防止 Reward Hacking
- Clipping：$\text{clip}(r_t, 1-\epsilon, 1+\epsilon)$ 限制每步更新幅度
verl 的核心创新是"混合引擎"：在同一组 GPU 上动态切换 FSDP（训练）和 TP（推理）。通过共置策略避免 GPU 空闲，通过权重重整（resharding）在两种并行格式之间切换。
规模效应：70B 模型的 RLHF 至少需要 1400 GB 显存（18 张 A100-80GB），实际部署通常使用 64-128 张 GPU 以获得合理的训练吞吐量。

配套代码

本文配套代码位于 code/04-rlhf-system/：

minimal_rlhf.py — 从零实现完整的 RLHF 训练循环。包含四个模型的定义、PPO 数据生成管线、PPO 训练步骤、以及系统挑战的可视化。使用小模型 (d_model=256) 在 CPU 上运行，但架构和数据流与生产系统完全一致。

运行方式：

1
2
cd code/04-rlhf-system
python minimal_rlhf.py

代码分为四个部分，对应本文的四个核心主题：

Part 1: Four-Model Architecture — 创建四个模型，展示参数量和显存估算
Part 2: PPO Data Generation — 完整的 rollout 管线：生成 → 打分 → KL → GAE
Part 3: PPO Training Step — 端到端的 PPO 训练循环，包含 Actor 和 Critic 更新
Part 4: System Challenges — 数据流图、计算开销对比、分布式策略分析

RLHF 数据流

参考资料

Ouyang et al., 2022. Training language models to follow instructions with human feedback — InstructGPT，RLHF 的开创性工作
Schulman et al., 2017. Proximal Policy Optimization Algorithms — PPO 算法原论文
Sheng et al., 2024. HybridFlow: A Flexible and Efficient RLHF Framework — verl 的技术论文
Zheng et al., 2023. Secrets of RLHF in Large Language Models — RLHF 训练的实践经验总结
Rafailov et al., 2023. Direct Preference Optimization: Your Language Model is Secretly a Reward Model — DPO，RLHF 的替代方案
DeepSeek-AI, 2024. DeepSeek-R1 — GRPO 方法，去掉 Critic 的简化版 RLHF
verl GitHub Repository. https://github.com/volcengine/verl — 字节跳动开源的 RLHF 训练框架

Motivation#

前置知识#

RLHF 四模型架构#

Actor（策略模型）#

Reference Model（参考模型）#

Reward Model（奖励模型）#

Critic（价值模型）#

四模型总览#

PPO 数据流#

Phase 1: Rollout（经验收集）#

Step 1: 自回归生成#

Step 2: Reward 打分与 KL 计算#

Step 3: GAE 优势估计#

Phase 2: Training（PPO 更新）#

PPO Clipped Surrogate Loss#

Critic Loss#

完整的 PPO 训练循环#

为什么 RLHF 是系统问题#

挑战 1：显存压力——4 倍于 SFT#

挑战 2：计算异构——推理嵌套在训练中#

挑战 3：数据流复杂——四模型的严格依赖顺序#

挑战 4：权重同步#

计算开销对比#

verl 架构深入分析#

核心设计理念#

共置策略 vs 分离策略#

策略 1: Colocated（共置）#

策略 2: Separated（分离）#

权重更新与 Resharding#

verl 的训练流程#

分布式 RLHF 策略#

各阶段通信模式#

生产级配置示例#

RLHF 的前沿方向#

Key Takeaways#

配套代码#

参考资料#