从 llama.py 源码出发：vLLM 分布式推理机制深度剖析

之前两篇 vLLM 文章分别讲了 PagedAttention 和 fused_moe kernel。但有一个核心问题没有展开：当模型大到单卡放不下时，vLLM 是怎么把计算分布到多张 GPU 上的？

这篇文章直接从 vLLM 的 vllm/model_executor/models/llama.py 源码出发，逐层拆解 LlamaForCausalLM → LlamaModel → LlamaDecoderLayer → LlamaAttention → LlamaMLP 的完整前向传播链路，看清每一步中 Tensor Parallel 如何切分权重、如何通信、如何与 PagedAttention 和 Pipeline Parallel 协作。

所有代码片段来自 vLLM 主分支的真实源码，不是伪代码。

一、全景：一个 Transformer 层中的分布式操作

在进入代码之前，先建立心智模型。一个标准的 Llama Decoder Layer 在 Tensor Parallel 下的数据流如下：

输入 hidden_states [batch, seq, hidden_dim]（每张 GPU 持有完整副本）
     │
     ▼
┌─────────────────────────────────────────────────────────────┐
│  RMSNorm（每张 GPU 独立执行，无通信）                         │
└─────────────────┬───────────────────────────────────────────┘
                  │
     ┌────────────▼────────────┐
     │    QKV Projection       │
     │  QKVParallelLinear      │  ← Column Parallel: 按 head 维度切分
     │  每张 GPU 只算自己的    │     输出: q,k,v 各自的 head 分片
     │  head 对应的 Q/K/V      │     ★ 无通信
     └────────────┬────────────┘
                  │
     ┌────────────▼────────────┐
     │    Attention + RoPE     │
     │  PagedAttention kernel  │  ← 每张 GPU 独立计算自己的 head
     │  KV Cache 也按 head 分  │     ★ 无通信
     └────────────┬────────────┘
                  │
     ┌────────────▼────────────┐
     │    Output Projection    │
     │  RowParallelLinear      │  ← Row Parallel: 按 input 维度切分
     │  每张 GPU 计算局部结果  │     输出后做 AllReduce
     │                         │     ★ AllReduce 通信
     └────────────┬────────────┘
                  │
     ┌────────────▼────────────┐
     │  RMSNorm + Residual     │  ← 无通信
     └────────────┬────────────┘
                  │
     ┌────────────▼────────────┐
     │    Gate+Up Projection   │
     │  MergedColumnParallel   │  ← Column Parallel: 按 output 维度切分
     │  gate 和 up 合并        │     ★ 无通信
     └────────────┬────────────┘
                  │
     ┌────────────▼────────────┐
     │    SiLU Activation      │  ← 无通信
     └────────────┬────────────┘
                  │
     ┌────────────▼────────────┐
     │    Down Projection      │
     │  RowParallelLinear      │  ← Row Parallel: AllReduce
     │                         │     ★ AllReduce 通信
     └────────────┬────────────┘
                  │
输出 hidden_states [batch, seq, hidden_dim]（每张 GPU 持有完整副本）

每个 Transformer 层有且仅有 2 次 AllReduce 通信——一次在 Attention 的 output projection 后，一次在 MLP 的 down projection 后。这是 Megatron-LM 风格 Tensor Parallel 的标准模式。

二、从最外层开始：LlamaForCausalLM

# vllm/model_executor/models/llama.py

class LlamaForCausalLM(nn.Module, SupportsLoRA, SupportsPP, SupportsEagle, SupportsEagle3):

    def __init__(self, *, vllm_config: VllmConfig, prefix: str = "", ...):
        config = vllm_config.model_config.hf_config

        # 核心模型
        self.model = LlamaModel(vllm_config=vllm_config, prefix="model", ...)

        # LM Head：只在 Pipeline Parallel 的最后一个 rank 上创建
        if get_pp_group().is_last_rank:
            self.lm_head = ParallelLMHead(
                config.vocab_size, config.hidden_size, ...)
            self.logits_processor = LogitsProcessor(config.vocab_size, ...)
        else:
            self.lm_head = PPMissingLayer()  # 占位，不分配显存

    def forward(self, input_ids, positions, intermediate_tensors=None, ...):
        model_output = self.model(input_ids, positions, intermediate_tensors, ...)
        return model_output

    def compute_logits(self, hidden_states):
        logits = self.logits_processor(self.lm_head, hidden_states)
        return logits

两个关键的分布式设计：

① ParallelLMHead 的词表并行。 LM Head 将词表维度（vocab_size）沿 TP 维度切分。Llama 3 70B 的词表大小是 128,256，4 卡 TP 时每张卡只持有 ~32,064 个 token 的 logits 权重。

② Pipeline Parallel 的条件创建。 get_pp_group().is_last_rank 确保 LM Head 只在 PP 的最后一个 stage 创建。其他 stage 用 PPMissingLayer() 占位——这个占位层不分配任何显存，forward 时直接返回输入。

三、LlamaModel：Pipeline Parallel 的层分配

class LlamaModel(nn.Module, EagleModelMixin):

    def __init__(self, *, vllm_config: VllmConfig, prefix: str = "", ...):
        config = vllm_config.model_config.hf_config

        # Embedding：只在 PP 第一个 rank 创建
        if get_pp_group().is_first_rank:
            self.embed_tokens = VocabParallelEmbedding(
                config.vocab_size, config.hidden_size, ...)
        else:
            self.embed_tokens = PPMissingLayer()

        # ★ 核心：按 PP rank 分配 Decoder Layer
        self.start_layer, self.end_layer, self.layers = make_layers(
            config.num_hidden_layers,
            lambda prefix: LlamaDecoderLayer(vllm_config=vllm_config, prefix=prefix),
            prefix=f"{prefix}.layers",
        )

        # Final Norm：只在 PP 最后一个 rank 创建
        if get_pp_group().is_last_rank:
            self.norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
        else:
            self.norm = PPMissingLayer()

make_layers 是 Pipeline Parallel 的核心分配函数：

# vllm/model_executor/models/utils.py

def make_layers(num_hidden_layers, layer_fn, prefix):
    start_layer, end_layer = get_pp_indices(
        num_hidden_layers,
        get_pp_group().rank_in_group,
        get_pp_group().world_size
    )
    # 只实际创建当前 rank 负责的层，其余用 PPMissingLayer 占位
    modules = torch.nn.ModuleList(
        [PPMissingLayer() for _ in range(start_layer)]                    # 前面的占位
        + [layer_fn(prefix=f"{prefix}.{idx}")
           for idx in range(start_layer, end_layer)]                      # 真实层
        + [PPMissingLayer() for _ in range(end_layer, num_hidden_layers)] # 后面的占位
    )
    return start_layer, end_layer, modules

get_pp_indices 尝试均匀分配层数。例如 Llama 3 70B 有 80 层，PP=4 时：

PP Rank 0: layers 0-19   (20 层) + Embedding
PP Rank 1: layers 20-39  (20 层)
PP Rank 2: layers 40-59  (20 层)
PP Rank 3: layers 60-79  (20 层) + RMSNorm + LM Head

forward 中的 PP 数据流：

def forward(self, input_ids, positions, intermediate_tensors=None, ...):
    if get_pp_group().is_first_rank:
        # PP Rank 0：从 token embedding 开始
        hidden_states = self.embed_tokens(input_ids)
        residual = None
    else:
        # PP Rank 1-3：从上一个 rank 传来的中间结果开始
        hidden_states = intermediate_tensors["hidden_states"]
        residual = intermediate_tensors["residual"]

    # 只执行当前 rank 负责的层
    for layer in islice(self.layers, self.start_layer, self.end_layer):
        hidden_states, residual = layer(positions, hidden_states, residual)

    if not get_pp_group().is_last_rank:
        # 不是最后一个 rank：把中间结果传给下一个 rank
        return IntermediateTensors({
            "hidden_states": hidden_states,
            "residual": residual
        })

    # PP 最后一个 rank：做最终的 RMSNorm
    hidden_states, _ = self.norm(hidden_states, residual)
    return hidden_states

PP 的通信模式： Stage 之间通过 P2P（点对点）传输 IntermediateTensors。每个 micro-batch 的 hidden_states 和 residual 从 Rank 0 逐级流向 Rank 3。

四、LlamaDecoderLayer：Residual + Pre-Norm

class LlamaDecoderLayer(nn.Module):

    def __init__(self, vllm_config, prefix="", ...):
        config = vllm_config.model_config.hf_config

        self.self_attn = LlamaAttention(
            config=config,
            hidden_size=config.hidden_size,        # 8192
            num_heads=config.num_attention_heads,   # 64
            num_kv_heads=config.num_key_value_heads,# 8 (GQA)
            ...)
        self.mlp = LlamaMLP(
            hidden_size=config.hidden_size,        # 8192
            intermediate_size=config.intermediate_size, # 28672
            hidden_act="silu", ...)
        self.input_layernorm = RMSNorm(config.hidden_size, ...)
        self.post_attention_layernorm = RMSNorm(config.hidden_size, ...)

    def forward(self, positions, hidden_states, residual):
        # Pre-Norm + Attention
        if residual is None:
            residual = hidden_states
            hidden_states = self.input_layernorm(hidden_states)
        else:
            hidden_states, residual = self.input_layernorm(hidden_states, residual)

        hidden_states = self.self_attn(positions=positions,
                                       hidden_states=hidden_states)

        # Pre-Norm + MLP
        hidden_states, residual = self.post_attention_layernorm(hidden_states,
                                                                residual)
        hidden_states = self.mlp(hidden_states)
        return hidden_states, residual

RMSNorm 的 fused residual add。 注意 self.input_layernorm(hidden_states, residual) 接受两个参数——它在一个 kernel 中同时完成 residual 加法和 RMSNorm，避免了一次额外的 HBM 往返。这个 fused 操作不涉及任何通信（RMSNorm 是 element-wise 的，residual 每张卡都有完整副本）。

五、LlamaAttention：Tensor Parallel 的核心

这是整篇文章的重点。

class LlamaAttention(nn.Module):

    def __init__(self, config, hidden_size, num_heads, num_kv_heads, ...):
        tp_size = get_tensor_model_parallel_world_size()

        self.total_num_heads = num_heads        # 64 (Llama 3 70B)
        self.num_heads = num_heads // tp_size   # 64/4 = 16 per GPU

        # GQA: KV heads 可能少于 TP size
        self.total_num_kv_heads = num_kv_heads  # 8 (Llama 3 70B)
        if self.total_num_kv_heads >= tp_size:
            self.num_kv_heads = num_kv_heads // tp_size  # 8/4 = 2 per GPU
        else:
            self.num_kv_heads = 1  # 不够分时每卡复制一份

        self.head_dim = hidden_size // num_heads  # 8192/64 = 128

        # ★ QKV 合并投影：Column Parallel
        self.qkv_proj = QKVParallelLinear(
            hidden_size=hidden_size,           # 8192
            head_size=self.head_dim,           # 128
            total_num_heads=num_heads,         # 64
            total_num_kv_heads=num_kv_heads,   # 8
            ...)

        # ★ Output 投影：Row Parallel
        self.o_proj = RowParallelLinear(
            input_size=num_heads * self.head_dim,  # 64*128 = 8192
            output_size=hidden_size,                # 8192
            ...)

        self.rotary_emb = get_rope(self.head_dim, ...)
        self.attn = Attention(self.num_heads, self.head_dim, self.scaling,
                              num_kv_heads=self.num_kv_heads, ...)

    def forward(self, positions, hidden_states):
        qkv, _ = self.qkv_proj(hidden_states)       # Column Parallel, 无通信
        q, k, v = qkv.split([self.q_size,
                              self.kv_size,
                              self.kv_size], dim=-1)
        q, k = self.rotary_emb(positions, q, k)      # 无通信
        attn_output = self.attn(q, k, v)              # PagedAttention, 无通信
        output, _ = self.o_proj(attn_output)          # Row Parallel, AllReduce!
        return output

5.1 QKVParallelLinear：按 Head 切分

QKVParallelLinear 继承自 ColumnParallelLinear，将 Q、K、V 的投影权重合并为一个矩阵，并按 head 维度切分到各 GPU。

# vllm/model_executor/layers/linear.py

class QKVParallelLinear(ColumnParallelLinear):
    def __init__(self, hidden_size, head_size, total_num_heads,
                 total_num_kv_heads, ...):
        tp_size = get_tensor_model_parallel_world_size()

        # 每张 GPU 分到的 head 数
        self.num_heads = divide(total_num_heads, tp_size)
        if tp_size >= total_num_kv_heads:
            self.num_kv_heads = 1  # KV head 不够分，每卡复制一份
        else:
            self.num_kv_heads = divide(total_num_kv_heads, tp_size)

        # 输出维度 = Q heads + K heads + V heads（每卡的）
        # 再乘 tp_size 是因为父类 ColumnParallelLinear 会再除以 tp_size
        self.output_sizes = [
            self.num_heads * head_size * tp_size,      # Q
            self.num_kv_heads * head_size * tp_size,   # K
            self.num_kv_heads * head_size * tp_size,   # V
        ]

以 Llama 3 70B (TP=4) 为例：

原始权重：
  W_Q: [8192, 8192]    = [hidden_dim, 64 heads × 128 head_dim]
  W_K: [8192, 1024]    = [hidden_dim, 8 heads × 128 head_dim]
  W_V: [8192, 1024]    = [hidden_dim, 8 heads × 128 head_dim]
  合并: [8192, 10240]   = [hidden_dim, (64+8+8) × 128]

TP=4 切分后，每张 GPU 持有：
  W_Q_shard: [8192, 2048]  = [hidden_dim, 16 heads × 128]
  W_K_shard: [8192, 256]   = [hidden_dim, 2 heads × 128]
  W_V_shard: [8192, 256]   = [hidden_dim, 2 heads × 128]
  合并 shard: [8192, 2560]  = [hidden_dim, (16+2+2) × 128]

切分方式（Column Parallel）：
┌──────────── 10240 (output_dim) ──────────────┐
│     Q (8192)      │  K (1024)  │  V (1024)   │  完整权重
├───────────────────────────────────────────────┤
│Q_0 (2048)|Q_1|Q_2|Q_3|K_0|K_1|K_2|K_3|V_0|..│  按 head 分给 4 张 GPU
│  GPU 0   |GP1|GP2|GP3|G0 |G1 |G2 |G3 |G0 |  │
└───────────────────────────────────────────────┘

Column Parallel 的关键特性：输入不切分，输出切分，无需通信。 每张 GPU 拿到完整的 hidden_states，乘以自己的权重分片，得到自己负责的 head 的 Q/K/V。

5.2 GQA (Grouped Query Attention) 与 TP 的交互

Llama 3 70B 使用 GQA：64 个 Q head，只有 8 个 KV head。这意味着每 8 个 Q head 共享 1 个 KV head。

在 TP=4 时：

每张 GPU 分到 16 个 Q head 和 2 个 KV head
每 8 个 Q head 共享 1 个 KV head → 分配合理

但如果 TP=16（KV head 数量少于 TP size）：

每张 GPU 分到 4 个 Q head，但 KV head 只有 8 个不够 16 张卡分
此时 num_kv_heads = 1，每张卡复制一份 KV head（不切分）

# QKVParallelLinear 中的 KV head 处理逻辑
if tp_size >= self.total_num_kv_heads:
    self.num_kv_heads = 1
    self.num_kv_head_replicas = divide(tp_size, self.total_num_kv_heads)
else:
    self.num_kv_heads = divide(self.total_num_kv_heads, tp_size)
    self.num_kv_head_replicas = 1

5.3 RowParallelLinear：AllReduce 通信

Output Projection（o_proj）使用 RowParallelLinear——按输入维度切分，输出后做 AllReduce。

class RowParallelLinear(LinearBase):
    def __init__(self, input_size, output_size, ...):
        tp_size = get_tensor_model_parallel_world_size()
        # 每张 GPU 的输入维度 = 总输入维度 / TP
        self.input_size_per_partition = divide(input_size, tp_size)
        # 输出维度不切分
        self.output_size_per_partition = output_size

    def forward(self, input_):
        # 输入已经是并行的（来自上一层的 Column Parallel 输出）
        input_parallel = input_

        # 局部矩阵乘：每张 GPU 用自己的 shard 计算
        output_parallel = self.quant_method.apply(self, input_parallel, bias_)

        # ★ AllReduce：把所有 GPU 的局部结果相加
        if self.reduce_results and self.tp_size > 1:
            output = tensor_model_parallel_all_reduce(output_parallel)
        else:
            output = output_parallel

        return output, output_bias

以 o_proj 为例（Llama 3 70B, TP=4）：

原始权重 W_O: [8192, 8192]

TP=4 切分（Row Parallel，按 input_dim 切分）：
  GPU 0: W_O_0 = [8192, 2048]  (input_dim 0:2048)
  GPU 1: W_O_1 = [8192, 2048]  (input_dim 2048:4096)
  GPU 2: W_O_2 = [8192, 2048]  (input_dim 4096:6144)
  GPU 3: W_O_3 = [8192, 2048]  (input_dim 6144:8192)

计算过程：
  GPU 0: y_0 = x_0 × W_O_0   (x_0 是 GPU 0 的 attention 输出)
  GPU 1: y_1 = x_1 × W_O_1
  GPU 2: y_2 = x_2 × W_O_2
  GPU 3: y_3 = x_3 × W_O_3

AllReduce:
  y = y_0 + y_1 + y_2 + y_3   (每张 GPU 得到完整的 y)

为什么 Column Parallel + Row Parallel 能配对？ 因为数学上 Y = X × [A_0, A_1, ..., A_p] 等价于把 A 按列切分后各自算再拼接（Column Parallel 的输出拼接在逻辑上就是完整结果的不同部分）。而下一步的 Row Parallel 把这些部分直接作为输入的不同维度，各自乘以权重的行分片，最后 AllReduce 求和得到完整结果。两者串联恰好消除了中间的通信——Column Parallel 的分片输出直接是 Row Parallel 的分片输入。

六、LlamaMLP：相同的切分模式

class LlamaMLP(nn.Module):

    def __init__(self, hidden_size, intermediate_size, hidden_act, ...):
        # Gate 和 Up 合并为一个 Column Parallel 层
        self.gate_up_proj = MergedColumnParallelLinear(
            input_size=hidden_size,                    # 8192
            output_sizes=[intermediate_size] * 2,      # [28672, 28672]
            ...)
        # Down Projection: Row Parallel
        self.down_proj = RowParallelLinear(
            input_size=intermediate_size,               # 28672
            output_size=hidden_size,                    # 8192
            ...)
        self.act_fn = SiluAndMul()

    def forward(self, x):
        x, _ = self.gate_up_proj(x)   # Column Parallel, 无通信
        x = self.act_fn(x)            # SiLU(gate) * up, 无通信
        x, _ = self.down_proj(x)      # Row Parallel, AllReduce!
        return x

6.1 MergedColumnParallelLinear

MergedColumnParallelLinear 是 ColumnParallelLinear 的变体——把多个逻辑独立的 Column Parallel 层合并到一个物理权重矩阵中。在 Llama 的 MLP 中，gate_proj 和 up_proj 共享相同的输入，所以可以合并。

class MergedColumnParallelLinear(ColumnParallelLinear):
    def __init__(self, input_size, output_sizes, ...):
        self.output_sizes = output_sizes  # [28672, 28672]
        tp_size = get_tensor_model_parallel_world_size()

        # 每个子层的输出维度除以 TP
        assert all(size % tp_size == 0 for size in output_sizes)
        # 总输出 = sum(output_sizes) = 57344
        super().__init__(input_size=input_size, output_size=sum(output_sizes), ...)

以 Llama 3 70B (TP=4) 为例：

原始权重：
  W_gate: [8192, 28672]
  W_up:   [8192, 28672]
  合并:    [8192, 57344]

TP=4 切分（Column Parallel，按 output_dim 切分）：
  GPU 0: [8192, 14336]  = gate_shard[8192,7168] + up_shard[8192,7168]
  GPU 1: [8192, 14336]
  GPU 2: [8192, 14336]
  GPU 3: [8192, 14336]

6.2 权重加载时的 Shard 映射

HuggingFace 的 checkpoint 中 gate_proj 和 up_proj 是分开存储的，但 vLLM 内部合并为 gate_up_proj。load_weights 中的 stacked_params_mapping 处理这个映射：

# LlamaModel.load_weights()
stacked_params_mapping = [
    # (vllm内部名, HF权重名, shard_id)
    (".qkv_proj", ".q_proj", "q"),
    (".qkv_proj", ".k_proj", "k"),
    (".qkv_proj", ".v_proj", "v"),
    (".gate_up_proj", ".gate_proj", 0),
    (".gate_up_proj", ".up_proj", 1),
]

# 加载时：
# HF 权重 "layers.0.mlp.gate_proj.weight" → vLLM "layers.0.mlp.gate_up_proj.weight", shard_id=0
# HF 权重 "layers.0.mlp.up_proj.weight"   → vLLM "layers.0.mlp.gate_up_proj.weight", shard_id=1
# weight_loader 根据 shard_id 和 tp_rank 将权重写入正确的位置

七、Llama 3 70B 在 4×A100 上的具体数字

让我们用具体数字算清楚每层的参数分布。

7.1 模型配置

Llama 3 70B 配置：
  hidden_size = 8192
  num_attention_heads = 64
  num_key_value_heads = 8 (GQA)
  head_dim = 128
  intermediate_size = 28672
  num_hidden_layers = 80
  vocab_size = 128256

7.2 TP=4 时每张 GPU 的权重

每张 GPU 的 Attention 权重：
┌──────────────────────────────────────────────────────────┐
│ 层名称              │ 完整形状          │ 每卡形状        │ 每卡参数量   │
├──────────────────────────────────────────────────────────┤
│ qkv_proj (Column)   │ [8192, 10240]    │ [8192, 2560]   │ 20.97M      │
│   Q 部分            │ [8192, 8192]     │ [8192, 2048]   │ 16.78M      │
│   K 部分            │ [8192, 1024]     │ [8192, 256]    │ 2.10M       │
│   V 部分            │ [8192, 1024]     │ [8192, 256]    │ 2.10M       │
│ o_proj (Row)        │ [8192, 8192]     │ [2048, 8192]   │ 16.78M      │
├──────────────────────────────────────────────────────────┤
│ Attention 合计/卡   │                  │                │ 37.75M      │
└──────────────────────────────────────────────────────────┘

每张 GPU 的 MLP 权重：
┌──────────────────────────────────────────────────────────┐
│ gate_up_proj (Column)│ [8192, 57344]   │ [8192, 14336]  │ 117.44M     │
│   gate 部分          │ [8192, 28672]   │ [8192, 7168]   │ 58.72M      │
│   up 部分            │ [8192, 28672]   │ [8192, 7168]   │ 58.72M      │
│ down_proj (Row)      │ [28672, 8192]   │ [7168, 8192]   │ 58.72M      │
├──────────────────────────────────────────────────────────┤
│ MLP 合计/卡          │                 │                │ 176.16M     │
└──────────────────────────────────────────────────────────┘

每层 Decoder 合计/卡：~213.91M 参数
80 层总计/卡：~17.11B 参数

加上 Embedding + LM Head（词表并行）：
  Embedding: 128256 × 8192 / 4 ≈ 262.66M / 卡
  LM Head:   8192 × 128256 / 4 ≈ 262.66M / 卡（tie_word_embeddings 时共享）

每卡总参数：~17.64B（BF16 下 ~35.3 GB）
4 卡总计：~70.6B（符合 70B 参数量）

7.3 通信量分析

每个 Decoder Layer 的 AllReduce 通信量：
  o_proj 后: hidden_states [batch × seq, 8192] → 2 次传输（reduce-scatter + all-gather）
  down_proj 后: 同上

  单次 AllReduce 数据量 = batch × seq × 8192 × 2 bytes (BF16)

  假设 batch=32, seq=1（Decode 阶段）:
  单次 = 32 × 8192 × 2 = 512 KB
  每层 2 次 = 1 MB
  80 层 = 80 MB

  A100 NVLink 带宽: 600 GB/s（双向）
  通信时间 ≈ 80 MB / 600 GB/s ≈ 0.13 ms
  远小于计算时间，通信不是瓶颈

7.4 KV Cache 分配

KV Cache 也按 TP 切分（每张 GPU 只缓存自己负责的 KV head）：

每个 token 的 KV Cache（每卡）：
  K: num_kv_heads_per_gpu × head_dim = 2 × 128 = 256 元素
  V: 同上 = 256 元素
  每层 KV: 512 × 2 bytes (BF16) = 1024 bytes = 1 KB
  80 层: 80 KB / token / GPU

对比完整 KV Cache（不分片）：
  K: 8 × 128 = 1024 元素
  V: 同上 = 1024 元素
  每层 KV: 2048 × 2 = 4096 bytes = 4 KB
  80 层: 320 KB / token

TP=4 时 KV Cache 被均匀分摊到 4 张卡，每卡 80 KB/token。
假设最大上下文 8192 tokens：
  每卡 KV Cache = 8192 × 80 KB = 640 MB

PagedAttention block 分配（block_size=16）：
  每个 block 存 16 个 token 的 KV
  每 block 大小 = 16 × 80 KB = 1.28 MB（每卡）
  8192 tokens 需要 512 个 block

八、PagedAttention 与分布式的结合

PagedAttention 的分页 KV Cache 管理在分布式场景下有一个优雅的设计：Block Table 是全局的，KV 数据是按 TP 分片的。

                        BlockManager (CPU 端，全局)
                        ┌─────────────────────────┐
                        │ Block Table:             │
                        │ Seq 0 → [B3, B7, B1]    │ ← 所有 GPU 共享同一份 Block Table
                        │ Seq 1 → [B5, B2]        │
                        │ Seq 2 → [B0, B4, B6]    │
                        └────────────┬────────────┘
                                     │
              ┌──────────────────────┼──────────────────────┐
              ▼                      ▼                      ▼
         GPU 0 KV Cache         GPU 1 KV Cache         GPU 2 KV Cache
   ┌──────────────────┐   ┌──────────────────┐   ┌──────────────────┐
   │ Block 0: KV 分片 │   │ Block 0: KV 分片 │   │ Block 0: KV 分片 │
   │ (heads 0-1)      │   │ (heads 2-3)      │   │ (heads 4-5)      │
   │ Block 1: ...     │   │ Block 1: ...     │   │ Block 1: ...     │
   │ ...              │   │ ...              │   │ ...              │
   └──────────────────┘   └──────────────────┘   └──────────────────┘

   每张 GPU 的每个 block 只存储自己负责的 KV head
   Block 编号在所有 GPU 间一致（Block 3 在所有 GPU 上代表同一段 token）

为什么这样设计？

Block Table 不需要通信。 调度器（Scheduler）在 CPU 端维护全局 Block Table，每次 forward 前通过 InputMetadata 广播给所有 GPU。由于 Block Table 只是整数索引（每个序列几十个），数据量极小。
KV 数据天然按 head 分片。 QKVParallelLinear 的输出已经是按 head 切分的——GPU 0 计算 head 0-15 的 Q/K/V，自然只需要缓存 head 0-1 的 K/V（GQA 下 16 个 Q head 共享 2 个 KV head）。
PagedAttention kernel 内部无通信。 每张 GPU 独立执行自己负责的 head 的 attention 计算，读取自己的 KV Cache block。attention 结果的合并在后续的 o_proj AllReduce 中完成。

九、Embedding 与 LM Head 的词表并行

9.1 VocabParallelEmbedding

Embedding 层按词表维度切分：

class VocabParallelEmbedding(CustomOp):
    def __init__(self, num_embeddings, embedding_dim, ...):
        tp_rank = get_tensor_model_parallel_rank()
        self.tp_size = get_tensor_model_parallel_world_size()
        # 词表按 TP 切分
        # 例如 vocab=128256, TP=4:
        #   GPU 0: token 0-32063
        #   GPU 1: token 32064-64127
        #   GPU 2: token 64128-96191
        #   GPU 3: token 96192-128255

forward 过程：

每张 GPU 检查 input_ids 中哪些 token 属于自己的词表分片
属于自己的 token 正常查 embedding table
不属于自己的 token 输出零向量
AllReduce 求和（零向量 + 正确 embedding = 正确 embedding）

9.2 ParallelLMHead

LM Head 同样按词表维度切分。最终的 logits 需要 AllGather 拼接成完整的词表分布：

GPU 0: logits_0 [batch, 32064]   (token 0-32063 的分数)
GPU 1: logits_1 [batch, 32064]
GPU 2: logits_2 [batch, 32064]
GPU 3: logits_3 [batch, 32064]
    ↓ AllGather
完整 logits [batch, 128256]

十、完整数据流图

以 Llama 3 70B, TP=4, PP=1 为例，一个 token 在一个 Decoder Layer 中的完整数据流：

hidden_states [B, 8192]（每张 GPU 持有完整副本）
│
├─ RMSNorm ─────────────────────────────── 每卡独立，无通信
│
├─ QKVParallelLinear ────────────────────── Column Parallel
│   │  GPU 0: [B, 8192] × [8192, 2560] → [B, 2560]  (16 Q + 2 K + 2 V heads)
│   │  GPU 1: [B, 8192] × [8192, 2560] → [B, 2560]
│   │  GPU 2: [B, 8192] × [8192, 2560] → [B, 2560]
│   │  GPU 3: [B, 8192] × [8192, 2560] → [B, 2560]
│   │  ★ 无通信
│   │
│   ├─ split → Q[B, 2048], K[B, 256], V[B, 256] per GPU
│   ├─ RoPE(Q, K) ───────────────────────── 每卡独立
│   │
│   ├─ PagedAttention(Q, K, V, KV_Cache) ── 每卡独立
│   │  │  GPU 0: 16 Q heads attend to 2 KV heads (GQA ratio 8:1)
│   │  │  GPU 1: 同上
│   │  │  GPU 2: 同上
│   │  │  GPU 3: 同上
│   │  │  KV Cache blocks 按 head 分片，Block Table 全局共享
│   │  │  ★ 无通信
│   │  │
│   │  └→ attn_output [B, 2048] per GPU
│   │
│   └─ RowParallelLinear (o_proj) ────────── Row Parallel
│      │  GPU 0: [B, 2048] × [2048, 8192] → [B, 8192] (局部)
│      │  GPU 1: [B, 2048] × [2048, 8192] → [B, 8192] (局部)
│      │  GPU 2: [B, 2048] × [2048, 8192] → [B, 8192] (局部)
│      │  GPU 3: [B, 2048] × [2048, 8192] → [B, 8192] (局部)
│      │
│      └─ ★ AllReduce: output = sum(局部结果) → [B, 8192] per GPU
│
├─ residual + RMSNorm ──────────────────── 每卡独立
│
├─ MergedColumnParallelLinear (gate_up) ── Column Parallel
│   │  GPU 0: [B, 8192] × [8192, 14336] → [B, 14336] (gate[7168] + up[7168])
│   │  ★ 无通信
│   │
│   ├─ SiLU(gate) * up → [B, 7168] per GPU
│   │
│   └─ RowParallelLinear (down_proj) ───── Row Parallel
│      │  GPU 0: [B, 7168] × [7168, 8192] → [B, 8192] (局部)
│      │
│      └─ ★ AllReduce → [B, 8192] per GPU
│
└→ hidden_states [B, 8192]（每张 GPU 持有完整副本）

通信总结：每层 2 次 AllReduce，每次传输 B × 8192 × 2 bytes

十一、Pipeline Parallel vs Tensor Parallel

11.1 核心区别

Tensor Parallel (TP)：                 Pipeline Parallel (PP)：
┌───────┐ ┌───────┐ ┌───────┐        ┌───────┐ ┌───────┐ ┌───────┐
│ GPU 0 │ │ GPU 1 │ │ GPU 2 │        │ GPU 0 │→│ GPU 1 │→│ GPU 2 │
│ Layer │ │ Layer │ │ Layer │        │ Layer │ │ Layer │ │ Layer │
│  0    │ │  0    │ │  0    │        │ 0-26  │ │27-53  │ │54-79  │
│ (1/3) │ │ (1/3) │ │ (1/3) │        │ (完整)│ │(完整) │ │(完整) │
├───────┤ ├───────┤ ├───────┤        └───────┘ └───────┘ └───────┘
│ Layer │ │ Layer │ │ Layer │
│  1    │ │  1    │ │  1    │        每层完整参数在一张 GPU 上
│ (1/3) │ │ (1/3) │ │ (1/3) │        层间 P2P 传输 hidden_states
│  ...  │ │  ...  │ │  ...  │        有 pipeline bubble（空闲等待）
└───────┘ └───────┘ └───────┘

每层按维度切分到多张 GPU
层内 AllReduce 通信
无 pipeline bubble

11.2 对比分析

维度	Tensor Parallel (TP)	Pipeline Parallel (PP)
切分方式	每层的权重按维度切分	不同层分配到不同 GPU
通信模式	AllReduce（每层 2 次）	P2P（层间传 hidden_states）
通信量/步	O(batch × hidden_dim)	O(batch × seq × hidden_dim)
通信频率	每层 2 次	每个 micro-batch 流转时
延迟敏感性	需要高带宽互联（NVLink）	可容忍较高延迟
Pipeline Bubble	无	有（需要 micro-batching 缓解）
单卡显存	减少到 1/TP	减少到 ~1/PP
适用互联	节点内（NVLink, 600 GB/s）	节点间（InfiniBand, ~400 Gb/s）

11.3 适用场景

选择策略决策树：

模型放得下单卡？
├── 是 → 不需要并行
└── 否
    ├── 放得下单节点（多卡）？
    │   ├── 是 → 纯 TP（性能最优，NVLink 带宽充足）
    │   └── 否 → TP + PP 混合
    │       ├── 节点内 TP（利用 NVLink）
    │       └── 节点间 PP（利用 InfiniBand/RoCE）
    │
    └── 极大模型（MoE 256 experts）？
        → TP + PP + EP（Expert Parallel）

典型配置：

模型	参数量	推荐配置	说明
Llama 3 8B	8B	TP=1	单卡 A100 80GB 即可
Llama 3 70B	70B	TP=4 或 TP=8	4×A100 (TP=4) 或 8×A100 (TP=8)
Llama 3.1 405B	405B	TP=8, PP=2-4	2-4 节点，节点内 TP=8，节点间 PP
DeepSeek V3	671B	TP=8, PP=4-8, EP=32-64	多节点，大规模 EP

11.4 PP 的 Bubble 问题

Pipeline Parallel 的固有问题是 pipeline bubble——当一个 micro-batch 在某个 stage 计算时，其他 stage 可能在空闲等待。

PP=4, 1 个 micro-batch 的时间线（最差情况）：

Stage 0: [████]
Stage 1:       [████]
Stage 2:             [████]
Stage 3:                   [████]
                                  ↑ 只有 1/4 的 GPU 在工作

PP=4, 4 个 micro-batch（pipeline 填满后）：

Stage 0: [M0][M1][M2][M3]         ← bubble 被 micro-batch 填充
Stage 1:     [M0][M1][M2][M3]
Stage 2:         [M0][M1][M2][M3]
Stage 3:             [M0][M1][M2][M3]
          ↑ bubble ↑              ↑ bubble ↑

vLLM 使用 micro-batching 来缓解 bubble。一般建议 micro-batch 数量 >= PP stages 数量，让 pipeline 尽量填满。

十二、总结

回顾 vLLM 中 Llama 的分布式推理，核心设计可以归纳为几个原则：

① Column-Row Parallel 配对消除中间通信。 QKV projection 用 Column Parallel（输出切分），O projection 用 Row Parallel（输入切分），两者串联只需一次 AllReduce。MLP 的 gate_up 和 down 也是同样的模式。

② 每个 Transformer 层恰好 2 次 AllReduce。 不多不少——Attention 结束一次，MLP 结束一次。这是 Megatron-LM 在 2020 年提出的经典方案，至今仍是最高效的 TP 策略。

③ KV Cache 天然按 head 分片。 不需要额外的分片逻辑——QKVParallelLinear 的输出已经是按 head 切分的，PagedAttention kernel 直接操作本地 KV Cache。Block Table 全局共享但数据量极小。

④ Pipeline Parallel 通过层分配实现。 make_layers + PPMissingLayer + IntermediateTensors 构成了一个干净的 PP 抽象——每个 rank 只创建自己负责的层，forward 时通过 P2P 传递中间状态。

⑤ 权重加载时做分片，推理时零额外开销。 所有的切分和分配在 load_weights 阶段完成（通过 weight_loader 从完整 checkpoint 中提取对应的 shard）。推理时每张 GPU 就像在运行一个「小模型」——只是 GEMM 的维度更小，加上两次 AllReduce 通信。

这些设计的共同目标是：让分布式推理的通信开销尽可能小，让每张 GPU 的计算效率尽可能接近单卡水平。 在 NVLink 互联的节点内，通信延迟可以控制在总推理时间的 5% 以内，基本实现了线性 scaling。