LLM Inference System Architecture (SGLang as Case Study)

Motivation

Training a large model is expensive, but serving one can be even more so.

Training is a one-time cost — you throw a few thousand GPUs at the problem for a few weeks and you are done. Inference, on the other hand, runs 24/7: every user query, every API call demands a real-time response from the model. OpenAI processes billions of tokens per day, and inference accounts for the lion’s share of operational costs. If you can double inference efficiency, you either halve costs or serve twice as many users on the same budget.

Inference optimization is also a fundamentally different game from training. Training optimizes aggregate compute throughput — burn through the data as fast as possible. Inference must simultaneously optimize two metrics that are often in tension:

Throughput: How many tokens per second can the system generate? This determines how many concurrent users you can serve.
Latency: How long does each user wait? The key metrics are TTFT (Time To First Token) and TPOT (Time Per Output Token).

More importantly, inference has completely different compute characteristics than training. As we discussed in Article 1, autoregressive decoding is essentially a matrix-vector multiply — arithmetic intensity around 1 FLOPs/byte, GPU utilization below 1%. Inference is not compute-bound. It is memory-bandwidth-bound — and the KV Cache is the single largest consumer of that memory.

This article uses SGLang as a case study to dissect four core techniques in modern inference systems:

KV Cache — the dominant memory consumer during inference, and why it becomes a bottleneck
PagedAttention — borrowing the OS virtual memory paradigm to eliminate KV Cache fragmentation
RadixAttention — SGLang’s core innovation, using a radix tree for cross-request KV Cache reuse
Continuous Batching — iteration-level scheduling to maximize GPU utilization
Chunked Prefill — preventing long prompts from blocking decode operations

By the end, you should understand why vLLM and SGLang achieve 2-10x throughput improvements over naive implementations, and how their design philosophies differ.

Prerequisites

GPU Memory Model and Distributed Communication Fundamentals (Article 1) — especially the HBM bandwidth bottleneck and the Roofline model
Basic understanding of Transformer attention (Self-Attention, Q/K/V projections)
Familiarity with autoregressive generation (token-by-token decoding)

KV Cache Fundamentals

Why KV Cache Exists

When an autoregressive language model generates token $t$, the attention computation is:

$$\text{Attention}(Q_t, K_{1:t}, V_{1:t}) = \text{softmax}\left(\frac{Q_t \cdot K_{1:t}^T}{\sqrt{d_k}}\right) V_{1:t}$$

The key observation: the current token’s Query must attend to the Keys and Values of every previously generated token. If you recompute all K and V from scratch at each step, total computation is $O(t^2)$ — generating 1000 tokens requires $1 + 2 + \cdots + 1000 = 500{,}500$ attention computations.

The KV Cache insight is simple: previous tokens’ K and V never change, so compute them once and cache them for reuse.

flowchart TD
    subgraph NO["Autoregressive Decoding (without KV Cache)"]
        N1["Step 1: Input [t1] → Compute K1,V1 → Attention → Output t2"]
        N2["Step 2: Input [t1,t2] → Recompute K1,V1,K2,V2 → Attention"]
        N3["Step 3: Input [t1,t2,t3] → Recompute K1,K2,K3... → Attention"]
        NN["Step N: Input [t1,...,tN] → Recompute all K,V → O(N²) total"]
        N1 --> N2 --> N3 -.-> NN
    end

    subgraph YES["Autoregressive Decoding (with KV Cache)"]
        Y1["Step 1: Input [t1] → Compute K1,V1 → Cache → Attention → Output t2"]
        Y2["Step 2: Input [t2] → Compute K2,V2 → Append to cache → Attention"]
        Y3["Step 3: Input [t3] → Compute K3,V3 → Append to cache → Attention"]
        YN["Step N: Input [tN] → Compute KN,VN → Append to cache → O(N) total"]
        Y1 --> Y2 --> Y3 -.-> YN
        YNOTE["Each step computes only 1 token's K,V, then attends over the cache"]
    end

    NO --> YES

    style NO fill:#fff3cd,stroke:#856404
    style YES fill:#d4edda,stroke:#155724

KV Cache reduces total computation from $O(N^2)$ to $O(N)$ — this is the first principle of inference optimization. Every modern inference engine implements it.

How Large Is the KV Cache?

The KV Cache size per token is:

$$\text{KV per token} = 2 \times n_{\text{layers}} \times n_{\text{heads}} \times d_{\text{head}} \times \text{dtype_size}$$

The factor of 2 accounts for storing both K and V. Here are concrete numbers for several popular models (FP16, 2 bytes per element):

Model	$n_{\text{layers}}$	$n_{\text{heads}}$	$d_{\text{head}}$	KV per token	2048 tokens
LLaMA-7B	32	32	128	0.5 MB	1.0 GB
LLaMA-13B	40	40	128	0.8 MB	1.6 GB
LLaMA-70B	80	64	128	2.5 MB	5.0 GB
GPT-3 175B	96	96	128	4.5 MB	9.2 GB

A single LLaMA-70B request generating 2048 tokens consumes 5 GB of KV Cache alone. If you want to serve 16 such requests concurrently, the KV Cache requires 80 GB — the entire memory of an A100. And the model weights themselves take another 140 GB (in FP16).

This is the fundamental tension of inference memory management: model weights are a fixed cost, KV Cache is a dynamic cost, and KV Cache grows linearly with both the number of concurrent requests and sequence length — easily becoming the bottleneck.

How many requests you can batch simultaneously depends on how much KV Cache you can fit — and batch size directly determines throughput. So the efficiency of KV Cache memory management sets the performance ceiling for the entire inference system.

The Waste Problem with Naive Allocation

The most straightforward approach is to pre-allocate a contiguous memory region of size max_seq_len for each request’s KV Cache. The problem is that you do not know how many tokens a request will actually generate — it could be 10 or it could be 2000.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
class NaiveKVCache:
    """
    Pre-allocated KV cache for a single sequence.

    Problem: if max_seq_len=2048 but the sequence only generates 50 tokens,
    we waste 2048-50 = 1998 slots of memory.  Multiply by batch_size and
    num_layers, and you're throwing away most of your GPU memory.
    """

    def __init__(self, max_seq_len: int, num_heads: int, head_dim: int):
        self.max_seq_len = max_seq_len
        self.num_heads = num_heads
        self.head_dim = head_dim
        self.current_len = 0

        # Pre-allocate full tensors — this is the waste!
        self.k_cache = torch.zeros(max_seq_len, num_heads, head_dim, device=device)
        self.v_cache = torch.zeros(max_seq_len, num_heads, head_dim, device=device)

The companion code demonstrates that when request lengths are non-uniform (which is nearly always the case in practice), naive pre-allocation wastes 60-80% of GPU memory:

5 requests, max_seq_len=2048
Actual lengths: [37, 152, 8, 420, 91]
Memory allocated: 160.0 MB
Memory actually used:  22.1 MB
Waste: 86.2%

--> This is why we need PagedAttention!

Wasting 86% of memory means a GPU that could handle 7 concurrent requests is limited to just 1. What you are wasting is not just memory — it is throughput and revenue.

KV Cache growth and waste

PagedAttention

Borrowing Wisdom from Operating Systems

PagedAttention is the core contribution of vLLM (SOSP'23), and its inspiration comes from a technology that has been battle-tested for 50 years — OS virtual memory.

In an operating system, each process sees a virtual address space that appears contiguous and private. But physical memory is allocated in fixed-size pages, and a page table maps virtual addresses to physical addresses. The process never needs contiguous physical memory — the OS stitches together scattered physical pages into what looks like a contiguous virtual space behind the scenes.

PagedAttention transplants this idea to KV Cache management:

flowchart LR
    subgraph OS["OS Virtual Memory"]
        A1["Process"]
        A2["Virtual page"]
        A3["Physical page frame"]
        A4["Page table"]
        A5["Demand paging (page fault)"]
        A6["Fixed page size (4 KB)"]
        A7["CoW (fork)"]
        A8["Page swap"]
    end
    subgraph PA["PagedAttention"]
        B1["Request (sequence)"]
        B2["Logical KV block"]
        B3["Physical KV block"]
        B4["Block table"]
        B5["On-demand allocation (new token)"]
        B6["Fixed block size (16 tokens)"]
        B7["CoW (beam search)"]
        B8["Preemption (swap out request)"]
    end
    A1 -.-> B1
    A2 -.-> B2
    A3 -.-> B3
    A4 -.-> B4
    A5 -.-> B5
    A6 -.-> B6
    A7 -.-> B7
    A8 -.-> B8

    style OS fill:#cce5ff,stroke:#004085
    style PA fill:#d4edda,stroke:#155724

Block Allocation Mechanism

PagedAttention divides GPU KV Cache memory into a pool of fixed-size blocks. Each block stores the KV data for a fixed number of tokens (typically 16 or 32). Each request maintains a block table that maps its logical blocks to physical blocks.

flowchart TD
    subgraph POOL["Physical Block Pool (GPU Memory)"]
        B0["B0<br/>Seq0"]
        B1["B1<br/>Seq1"]
        B2["B2<br/>Free"]
        B3["B3<br/>Seq0"]
        B4["B4<br/>Seq1"]
        B5["B5<br/>Free"]
        B6["B6<br/>Seq0"]
        B7["B7<br/>Free"]
    end

    subgraph T0["Seq 0 Block Table"]
        L0_0["Logic 0 → B0"]
        L0_1["Logic 1 → B3"]
        L0_2["Logic 2 → B6"]
    end

    subgraph T1["Seq 1 Block Table"]
        L1_0["Logic 0 → B1"]
        L1_1["Logic 1 → B4"]
    end

    L0_0 --> B0
    L0_1 --> B3
    L0_2 --> B6
    L1_0 --> B1
    L1_1 --> B4

    NOTE["Physical blocks need NOT be contiguous — only logical order matters"]

    style POOL fill:#cce5ff,stroke:#004085
    style B2 fill:#f8f9fa,stroke:#6c757d
    style B5 fill:#f8f9fa,stroke:#6c757d
    style B7 fill:#f8f9fa,stroke:#6c757d

The workflow:

New request arrives: Allocate an empty block table — no blocks pre-allocated
First token generated: Grab a block from the free pool, store the KV, update the block table
Current block fills up (16 tokens worth of KV written): Grab another block from the free pool
Request completes: Return all blocks to the free pool, immediately available for other requests

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
class BlockManager:
    """
    PagedAttention-style block manager.

    Key ideas:
      - Physical KV cache is a pool of fixed-size blocks.
      - Each sequence has a block table mapping logical -> physical blocks.
      - Blocks are allocated on demand as sequences grow.
      - Free blocks are recycled when sequences finish.
      - Copy-on-Write: shared blocks are copied only when modified.
    """

    def __init__(self, num_blocks, num_layers, num_heads, head_dim, block_size=16):
        self.num_blocks = num_blocks
        self.block_size = block_size

        # Physical block pool — the actual GPU memory
        self.k_pool = torch.zeros(
            num_layers, num_blocks, block_size, num_heads, head_dim, device=device
        )
        self.v_pool = torch.zeros(
            num_layers, num_blocks, block_size, num_heads, head_dim, device=device
        )

        # Track physical block metadata
        self.physical_blocks = [PhysicalBlock(block_id=i) for i in range(num_blocks)]
        self.free_block_ids: list[int] = list(range(num_blocks))
        self.seq_tables: dict[int, SequenceBlockTable] = {}

From 86% Waste to Under 3%

With PagedAttention, memory waste comes from exactly one source: internal fragmentation in each request’s last block. If block_size=16, the last block is on average half-full, wasting roughly $\frac{1}{2} \times \frac{1}{N_{\text{blocks}}}$ of total memory.

The companion code shows the comparison:

  Config: 32 heads, head_dim=128, max_seq=2048, block_size=16
  10 requests with lengths: [47, 183, 12, 891, 256, 5, 1024, 73, 330, 15]

  Metric                              Naive        Paged       Actual
  -----------------------------------------------------------------------
  KV cache memory (MB)               320.0         45.5         44.3
  Memory utilization                  13.8%        97.4%       100.0%
  Waste (MB)                         275.7          1.2          0.0
  Waste (%)                          86.2%         2.6%         0.0%

Naive allocation wastes 86%; PagedAttention wastes just 2.6%. This means the same GPU can batch 2-4x more requests, directly translating to a 2-4x throughput increase.

PagedAttention block management

Copy-on-Write: Efficient Sequence Forking

In beam search and parallel sampling, a request forks into multiple candidate sequences. These candidates share the prefix’s KV Cache — copying the prefix KV for every candidate would multiply memory consumption.

PagedAttention solves this with Copy-on-Write (CoW), again borrowing from the OS fork() system call:

flowchart TD
    subgraph BEFORE["Before fork"]
        S0a["Seq 0: Block A → Block B → Block C (ref=1)"]
    end

    subgraph AFTER["After fork (Seq 0 + Seq 1)"]
        S0b["Seq 0: Block A → Block B → Block C"]
        S1b["Seq 1: Block A → Block B → Block C"]
        REF["Block C ref=2 (shared, no copy!)"]
    end

    subgraph WRITE["Seq 1 writes new token to Block C"]
        S0c["Seq 0: Block A → Block B → Block C (ref=1)"]
        S1c["Seq 1: Block A → Block B → Block C' (ref=1, copied now)"]
        NOTE2["Only the modified block is copied; earlier blocks remain shared"]
    end

    BEFORE --> AFTER --> WRITE

    style BEFORE fill:#cce5ff,stroke:#004085
    style AFTER fill:#fff3cd,stroke:#856404
    style WRITE fill:#d4edda,stroke:#155724

The core logic lives in append_token: before writing, check the block’s reference count. If it exceeds 1 (the block is shared), copy it first, then write:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
def append_token(self, seq_id, k, v, layer_idx):
    table = self.seq_tables[seq_id]
    logical_block_idx = token_idx // self.block_size
    slot_in_block = token_idx % self.block_size

    # Allocate new block if current one is full
    if logical_block_idx >= len(table.logical_to_physical):
        phys_id = self._allocate_block()
        table.logical_to_physical.append(phys_id)

    phys_id = table.logical_to_physical[logical_block_idx]

    # CoW check: if this block is shared, copy before writing
    if self.physical_blocks[phys_id].ref_count > 1:
        new_phys_id = self._allocate_block()
        self.k_pool[:, new_phys_id] = self.k_pool[:, phys_id]
        self.v_pool[:, new_phys_id] = self.v_pool[:, phys_id]
        self._free_block(phys_id)  # Decrement old block's ref count
        table.logical_to_physical[logical_block_idx] = new_phys_id
        phys_id = new_phys_id

    # Write KV to the physical block
    self.k_pool[layer_idx, phys_id, slot_in_block] = k
    self.v_pool[layer_idx, phys_id, slot_in_block] = v

CoW pays off most dramatically in beam search: with beam_size=4, four candidate sequences share the vast majority of their prefix KV. Only the last block (the part being actively generated) needs independent storage. Memory cost approaches that of 1 sequence rather than 4.

RadixAttention (SGLang’s Core Innovation)

From PagedAttention to Prefix Caching

PagedAttention solves memory fragmentation within a single request. But it misses a larger optimization opportunity: KV Cache reuse across requests.

In real-world LLM applications, many requests share identical prefixes:

flowchart TD
    subgraph S1["Scenario 1: System Prompt"]
        SP["Shared prefix: You are a helpful assistant..."]
        UA["+ User A: What is machine learning?"]
        UB["+ User B: Write a Python function for..."]
        UC["+ User C: Translate this text..."]
        SP --> UA
        SP --> UB
        SP --> UC
    end

    subgraph S2["Scenario 2: Few-shot Learning"]
        FS["Shared prefix: Here are some examples x5"]
        Q1["+ Input: new query 1"]
        Q2["+ Input: new query 2"]
        FS --> Q1
        FS --> Q2
    end

    subgraph S3["Scenario 3: Multi-turn Chat"]
        MT["Shared prefix: System + Turn 1 + Turn 2"]
        T3["+ User Turn 3 (new message)"]
        MT --> T3
    end

    style S1 fill:#cce5ff,stroke:#004085
    style S2 fill:#d4edda,stroke:#155724
    style S3 fill:#fff3cd,stroke:#856404

If a system prompt is 500 tokens long and every request recomputes and stores those 500 tokens’ KV Cache independently, then 100 concurrent requests store 100 identical copies — a massive waste.

The prefix caching insight: if two requests share the same token prefix, their KV Cache at those positions is identical and can be shared directly, with no recomputation needed.

vLLM eventually added prefix caching too, but its implementation relies on predefined prefix matching — you need to explicitly declare which requests share prefixes. SGLang’s RadixAttention offers a more flexible and elegant approach.

Radix Tree: A Data Structure for Indexing Arbitrary Prefixes

SGLang (LMSYS, 2024) introduces a Radix Tree (radix trie) to index and manage all cached KV data.

A radix tree is a compressed trie. In a standard trie, each node represents a single character; a radix tree compresses single-child paths into a single node, reducing tree depth. In SGLang, each node represents a token sequence segment, and nodes store pointers to the corresponding KV Cache blocks.

graph TD
    ROOT["Root"]
    SYS["System Prompt<br/>KV blocks: B0, B1, B2"]
    ML["User: What is ML?<br/>KV blocks: B3, B4"]
    PY["User: Write Python<br/>KV blocks: B5, B6"]
    AST["Asst: ML is...<br/>KV blocks: B7"]

    ROOT --> SYS
    SYS --> ML
    SYS --> PY
    ML --> AST

    REQD["New Req D: System Prompt + What is ML? + new query<br/>Matches B0-B4 directly, only new query needs fresh KV"]

    style SYS fill:#cce5ff,stroke:#004085
    style ML fill:#d4edda,stroke:#155724
    style PY fill:#d4edda,stroke:#155724
    style AST fill:#fff3cd,stroke:#856404
    style REQD fill:#f8f9fa,stroke:#6c757d

Radix Tree construction

Workflow

When a new request arrives, SGLang’s scheduler executes the following steps:

flowchart TD
    subgraph FLOW["RadixAttention: Processing a New Request"]
        A["1. PREFIX MATCHING<br/>Search Radix Tree for longest matching prefix<br/>e.g., sys_prompt KV blocks are reused directly"]
        B["2. REUSE CACHED KV<br/>Skip prefill for the matched prefix<br/>Only compute KV for the unmatched suffix"]
        C["3. INSERT INTO TREE<br/>After completion, insert new KV into Radix Tree<br/>Future requests with the same prefix can reuse it"]
        A --> B --> C
    end

    subgraph TIMELINE["Timeline Comparison"]
        T1["No cache: ====== prefill 500 tokens ======  decode"]
        T2["With cache: == prefill 100 tokens ==  decode<br/>400 tokens KV reused from cache, skipped"]
    end

    FLOW --> TIMELINE

    style FLOW fill:#cce5ff,stroke:#004085
    style T1 fill:#fff3cd,stroke:#856404
    style T2 fill:#d4edda,stroke:#155724

In the system prompt scenario (where nearly every request shares the same prompt), prefix caching skips the bulk of prefill computation, reducing TTFT (Time To First Token) by 50-90%.

LRU Eviction

GPU memory is finite — you cannot cache every prefix you have ever seen. When the block pool is full, some cached KV data must be evicted to make room for new requests.

SGLang uses LRU (Least Recently Used) eviction: the KV Cache node that has gone the longest without being accessed is evicted first.

flowchart TD
    subgraph NODES["Radix Tree Nodes (sorted by last access time)"]
        N1["sys_prompt + user_A — 10 sec ago — keep"]
        N2["sys_prompt + user_B — 30 sec ago — keep"]
        N3["sys_prompt + user_C — 120 sec ago — eviction candidate"]
        N4["old_prompt + old_query — 300 sec ago — evict first"]
    end

    subgraph EVICT["Eviction Process"]
        E1["1. Take least-recently-used node from LRU list"]
        E2["2. Free that node's KV blocks"]
        E3["3. Remove node from Radix Tree"]
        E4["4. If parent becomes childless leaf, recursively clean up"]
        E1 --> E2 --> E3 --> E4
    end

    style N1 fill:#d4edda,stroke:#155724
    style N2 fill:#d4edda,stroke:#155724
    style N3 fill:#fff3cd,stroke:#856404
    style N4 fill:#f8d7da,stroke:#721c24

LRU works well here because of temporal locality: recently accessed prefixes are likely to be accessed again (the same user continuing a multi-turn conversation, or a popular system prompt being reused across many concurrent requests).

Cache-Aware Scheduling

With the Radix Tree in place, the scheduler can make a remarkably effective optimization: prioritize requests that have longer prefix matches in the cache.

flowchart TD
    subgraph QUEUE["Waiting Queue"]
        RA["Req A: Matches 500 cached tokens → prefill only 100 tokens"]
        RB["Req B: Matches 0 tokens → prefill needs 600 tokens"]
        RC["Req C: Matches 300 cached tokens → prefill only 200 tokens"]
    end

    subgraph SCHED["Scheduling Strategy Comparison"]
        FCFS["Naive FCFS: A → B → C"]
        CACHE["Cache-aware: A → C → B (prioritize high cache hits)"]
    end

    subgraph BENEFIT["Benefits"]
        B1["1. High cache-hit requests prefill faster → lower TTFT"]
        B2["2. KV stays in tree → similar future requests also hit → positive feedback"]
        B3["3. Less total prefill compute → more GPU cycles for decode"]
    end

    QUEUE --> SCHED --> BENEFIT

    style RA fill:#d4edda,stroke:#155724
    style RB fill:#f8d7da,stroke:#721c24
    style RC fill:#fff3cd,stroke:#856404
    style CACHE fill:#d4edda,stroke:#155724

This scheduling strategy gives SGLang a significant edge in workloads with shared prefixes — few-shot learning, multi-turn chat, and any application using system prompts.

Comparison with vLLM

vLLM (v0.3+) later added prefix caching as well, but with a different design philosophy:

Feature	SGLang (RadixAttention)	vLLM (Prefix Caching)
Data structure	Radix Tree	Hash Map
Prefix matching	Automatic matching of any shared prefix	Token block hash-based matching
Matching granularity	Token-level	Block-level (block_size-aligned)
Eviction policy	LRU on tree nodes	LRU on blocks
Cache-aware scheduling	Native support	Added in later versions
Multi-turn conversation	Natural fit (tree structure)	Requires hash matching

SGLang’s Radix Tree approach has clear advantages in flexibility:

Arbitrary prefix sharing: Not limited to predefined prompt templates — any two requests with a common prefix are automatically matched
Tree structure naturally fits multi-turn dialogue: Conversation history naturally forms a tree, with different branches representing different follow-up exchanges
Token-level precision: Matching can occur at exact token boundaries, not just at block-aligned positions

vLLM’s hash-based approach has advantages in implementation simplicity and stability at scale. In practice, the performance gap depends heavily on the specific workload characteristics.

Continuous Batching

The Problem with Static Batching

Now that we understand KV Cache memory management, let us look at a different dimension of optimization: scheduling.

The most straightforward inference approach is static batching: assemble a group of requests into a batch, run prefill together, then decode together, and wait until every request in the batch finishes before processing the next batch.

The problem: output lengths vary wildly across requests.

flowchart TD
    subgraph STATIC["Static Batching — Batch = Req 0-3, Output lengths: 3, 8, 2, 5"]
        direction LR
        R0["Req 0: ## ## ## .. .. .. .. ..<br/>Done at step 3"]
        R1["Req 1: ## ## ## ## ## ## ## ##<br/>Longest — everyone waits"]
        R2["Req 2: ## ## .. .. .. .. .. ..<br/>Done at step 2"]
        R3["Req 3: ## ## ## ## ## .. .. ..<br/>Done at step 5"]
    end

    WASTE["## = useful compute · .. = GPU idle<br/>GPU utilization: 18/32 = 56.25%<br/>43.75% of GPU cycles wasted<br/>Req 4, 5 queued — must wait for Req 1 to finish"]

    style STATIC fill:#fff3cd,stroke:#856404
    style WASTE fill:#f8d7da,stroke:#721c24

When output length variance is high (and it nearly always is in practice — one user asks “what time is it?” while another requests a long essay), static batching can drive GPU utilization as low as 20-50%.

Iteration-Level Scheduling

Continuous Batching (Orca, OSDI'22) makes one fundamental change: it drops the scheduling granularity from batch level to iteration level.

After every single decode step, the scheduler checks:

Have any requests finished (generated EOS or hit max_length)? If so, evict them and free their KV Cache.
Are there waiting requests that can be admitted? If so, admit them and begin their prefill.

flowchart TD
    subgraph CB["Continuous Batching — Iteration-Level Scheduling"]
        direction LR
        R0["Req 0: ## ## ## — done step 3"]
        R1["Req 1: ## ## ## ## ## ## ## ## — done step 8"]
        R2["Req 2: ## ## — done step 2"]
        R3["Req 3: ## ## ## ## ## — done step 5"]
        R4["Req 4: admitted step 3 → ## ## ## ##"]
        R5["Req 5: admitted step 4 → ## ## ## ## ## ##"]
    end

    RESULT["Active count: 4 4 4 4 4 3 3 3 1<br/>GPU stays nearly full! Finished requests immediately replaced"]

    style CB fill:#d4edda,stroke:#155724
    style RESULT fill:#cce5ff,stroke:#004085

The companion code demonstrates that continuous batching typically delivers a 2-3x throughput improvement:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def run_continuous_batching(requests, max_batch_size, total_blocks=256, ...):
    """
    After EVERY decode step:
      1. Remove finished sequences (free their KV blocks).
      2. If there are free slots AND enough KV blocks, admit waiting requests.
      3. If out of KV blocks, preempt the newest running sequence (LIFO).
    """
    # ... per-step scheduling ...
    for req in running:
        req.tokens_generated += 1
        if req.tokens_generated >= req.target_output_length:
            req.status = SequenceStatus.FINISHED
            finished_this_step.append(req)

    # Remove finished, free their blocks
    for req in finished_this_step:
        running.remove(req)
        block_pool.free(req.num_kv_blocks)

    # Try to admit waiting requests
    while waiting and len(running) < max_batch_size:
        req = waiting[0]
        if block_pool.can_allocate(blocks_needed):
            waiting.popleft()
            running.append(req)

Request scheduling flow

Preemption: Graceful Handling of Memory Pressure

Continuous batching introduces another critical mechanism: preemption. When there are not enough free KV Cache blocks to allocate for all running requests, the scheduler can:

Pause one or more running requests (typically the most recently admitted — LIFO strategy, since they have done the least work)
Free the paused request’s KV Cache blocks
Use the reclaimed space to continue serving other requests
Resume the paused request later (re-prefill to rebuild its KV Cache)

flowchart TD
    A["Normal operation<br/>Running: Req 0, 1, 2, 3<br/>KV blocks: 200/256 (78%)"]
    B["Req 1 generates long response, KV keeps growing...<br/>KV blocks: 253/256 (99%) — about to OOM!"]
    C["Preemption triggered<br/>Pause Req 3 (LIFO: newest, least work done)<br/>Free Req 3's 20 blocks"]
    D["KV blocks: 233/256 (91%)<br/>Room to continue"]
    E["Later, Req 0/1 finishes, space freed<br/>→ Resume Req 3, re-prefill to rebuild KV"]
    F["Alternative: no preemption → OOM crash → ALL requests fail!"]

    A --> B --> C --> D --> E
    B -.-> F

    style A fill:#d4edda,stroke:#155724
    style B fill:#f8d7da,stroke:#721c24
    style C fill:#fff3cd,stroke:#856404
    style D fill:#d4edda,stroke:#155724
    style F fill:#f8d7da,stroke:#721c24

Preemption has a cost: the paused request must re-prefill (rebuild its KV Cache) when resumed, increasing that request’s latency. But compared to an OOM crash that kills every request in flight, this is a very reasonable trade-off.

The companion code’s preemption demo shows that even under extremely constrained memory (32 blocks), preemption ensures all requests eventually complete:

Config: 20 requests, 32 KV blocks, max_batch=4
Preemptions triggered: 12
Preemption allows the system to handle load spikes without OOM crashes.
All 20 requests completed: True

Chunked Prefill

Long Prompts Blocking Decode

The prefill phase (processing the user’s input prompt) and the decode phase (generating tokens one at a time) have fundamentally different compute characteristics:

Prefill: Processes the entire prompt at once (potentially thousands of tokens) — a compute-bound large GEMM operation
Decode: Generates 1 token per step — a memory-bound GEMV operation

The problem: if a request’s prompt is 8000 tokens, prefill might take several hundred milliseconds. During that time, every decode request is blocked — they are all waiting for the GPU to finish this one long prefill.

flowchart TD
    subgraph NOCHUNK["Without Chunked Prefill"]
        GPU1["GPU: ====== Prefill 8K tokens (200ms) ====== → D D D D"]
        RA1["Req A: waiting 200ms... then decode"]
        RB1["Req B: waiting 200ms... then decode"]
    end

    PROBLEM["Problem: Req A, B TPOT spikes from 10ms to 200ms+<br/>User experience: streaming text suddenly freezes for 200ms"]

    style NOCHUNK fill:#fff3cd,stroke:#856404
    style PROBLEM fill:#f8d7da,stroke:#721c24

For a user in the middle of a streaming conversation, TPOT jumping from 10ms to 200ms produces a very noticeable stutter — the text was flowing smoothly, then it just… stops.

Splitting Prefill into Chunks

Chunked Prefill takes a straightforward approach: break the long prompt’s prefill into multiple smaller chunks (e.g., 512 tokens each), and interleave decode steps between chunks.

flowchart TD
    subgraph CHUNK["Chunked Prefill — 8192 tokens, chunk_size=512, 16 chunks"]
        GPU2["GPU: P1 → D → P2 → D → P3 → D → P4 → D → P5 ..."]
        LEGEND["P = prefill chunk (512 tokens, ~12ms) · D = decode step (~3ms)"]
    end

    subgraph DECODE["From decode requests' perspective"]
        RA2["Req A: D — D — D — D — D (every ~15ms)"]
        RB2["Req B: D — D — D — D — D (every ~15ms)"]
    end

    RESULT2["TPOT: ~15ms (stable!) vs 200ms spike without chunking"]

    CHUNK --> DECODE
    DECODE --> RESULT2

    style CHUNK fill:#d4edda,stroke:#155724
    style RESULT2 fill:#cce5ff,stroke:#004085

The trade-off:

Total prefill time increases: Because decode steps are interleaved between chunks, an 8K-token prefill goes from 200ms to roughly 240ms (TTFT goes up slightly)
Decode latency becomes stable: TPOT drops from a 200ms spike to a consistent ~15ms

This is a TTFT vs TPOT trade-off:

Metric	Without Chunked Prefill	With Chunked Prefill
TTFT (first token)	200ms	~240ms (+20%)
TPOT (worst case)	200ms (blocked)	~15ms (stable)
TPOT (P99)	Very poor	Well-controlled
User experience	Occasional severe stutter	Smooth and consistent

For most online serving scenarios, stable TPOT matters far more than slightly lower TTFT — users care more about “smooth streaming” than “seeing the first character 40ms sooner.”

Choosing the Chunk Size

Chunk size requires balancing competing concerns:

Too small (e.g., 64 tokens): Prefill GPU utilization suffers (small matrix multiplies are inefficient), and TTFT increases significantly
Too large (e.g., 4096 tokens): Approaches the unchunked case, and decode requests still experience long blocks
Typical range: 512-1024 tokens, balancing prefill efficiency with decode stability

Both SGLang and vLLM support chunked prefill and can dynamically adjust chunk size based on the current running batch — if no decode requests are active, there is no need to chunk at all, so the system does a full prefill to minimize TTFT.

System Architecture Overview

Putting all the pieces together, here is the overall architecture of a modern LLM inference engine like SGLang:

flowchart TD
    HTTP["HTTP Server<br/>Receives user requests"]

    subgraph SCHED["Scheduler (the brain)"]
        WQ["Waiting Queue"]
        RT["Radix Tree<br/>(Prefix Cache Index)"]
        STEPS["Every iteration:<br/>1. Evict completed requests<br/>2. Match new request prefixes<br/>3. Cache-aware ranking + admit<br/>4. Preempt if memory insufficient<br/>5. Chunked prefill scheduling"]
    end

    subgraph BM["Block Manager (the memory)"]
        POOL2["Physical Block Pool: B0 B1 B2 B3 B4 B5 ..."]
        BMOPS["On-demand alloc/free · CoW · LRU eviction"]
    end

    subgraph ME["Model Executor (the compute)"]
        ATTN["Attention kernel (FlashAttention / FlashInfer)"]
        KV["Reads KV from Block Table (non-contiguous phys.)"]
        PAR["TP / PP parallel inference"]
    end

    HTTP --> SCHED --> BM --> ME

    style SCHED fill:#cce5ff,stroke:#004085
    style BM fill:#fff3cd,stroke:#856404
    style ME fill:#d4edda,stroke:#155724

The Scheduler is the brain — it decides which requests to run at each step, whether to prefill or decode, and whether preemption is needed. The Block Manager is the memory manager — it allocates, frees, and shares KV Cache blocks. The Model Executor is the compute engine — it takes the scheduler’s decisions and the block table mappings and executes the actual attention computation.

These three components work in concert to form an efficient inference pipeline.

Key Takeaways

KV Cache is the memory bottleneck of inference. Each token requires $2 \times n_{\text{layers}} \times n_{\text{heads}} \times d_{\text{head}} \times \text{dtype_size}$ bytes of KV data. For LLaMA-70B, that is ~2.5 MB per token. With 100 concurrent requests at 2048 tokens each, the KV Cache alone needs 500 GB — far beyond any single GPU. How many requests you can batch and how long your sequences can be depends entirely on how efficiently you manage the KV Cache.
PagedAttention applies OS virtual memory to eliminate memory fragmentation. Fixed-size blocks are allocated on demand, a block table provides logical-to-physical mapping, and CoW supports efficient sequence forking. Memory waste drops from 60-80% to under 5%, allowing the same GPU to batch 2-4x more requests.
RadixAttention enables cross-request KV Cache reuse. A Radix Tree indexes all cached token prefixes; new requests automatically match their longest cached prefix and reuse the corresponding KV. Combined with LRU eviction and cache-aware scheduling, this dramatically reduces TTFT in shared-prefix workloads (system prompts, few-shot examples, multi-turn conversations).
Continuous Batching drops scheduling granularity from batch level to iteration level. Finished requests are evicted and waiting requests are admitted at every decode step, pushing GPU utilization from 20-50% to 80-95%. Preemption provides a safety valve against OOM crashes under memory pressure.
Chunked Prefill prevents long prompts from blocking decode. Splitting prefill into smaller chunks with interleaved decode steps trades a modest TTFT increase for dramatically more stable TPOT. Chunk size (typically 512-1024 tokens) balances prefill efficiency against decode smoothness.
These techniques are not alternatives — they stack. Modern inference engines (SGLang, vLLM) simultaneously employ PagedAttention + Prefix Caching + Continuous Batching + Chunked Prefill. Each technique addresses a different dimension of efficiency, and together they deliver 2-10x throughput over naive implementations.

Companion Code

The companion code for this article is in code/03-inference-sglang/:

kv_cache_from_scratch.py — KV Cache and PagedAttention Block Manager built from scratch. Contains four parts: (1) demonstrating naive KV Cache waste; (2) full PagedAttention implementation with block allocation, deallocation, and CoW; (3) a mini decode loop using the block-managed KV cache; (4) head-to-head memory efficiency comparison of naive vs paged allocation. Run with: python kv_cache_from_scratch.py
continuous_batching.py — Static batching vs continuous batching comparison simulation. Contains four parts: (1) throughput comparison between static and continuous batching; (2) ASCII scheduling timeline visualization; (3) preemption behavior under memory pressure; (4) scaling analysis across different batch sizes. Run with: python continuous_batching.py

All code runs in CPU mode (no GPU required), making it easy to study the core logic in any environment.

References

Efficient Memory Management for Large Language Model Serving with PagedAttention — Kwon et al., SOSP'23 (vLLM) — The original PagedAttention paper, bringing OS virtual memory concepts to KV Cache management.
SGLang: Efficient Execution of Structured Language Model Programs — Zheng et al., 2024 (LMSYS) — The original paper introducing RadixAttention and cache-aware scheduling.
Orca: A Distributed Serving System for Transformer-Based Generative Models — Yu et al., OSDI'22 — The pioneering work on continuous batching (iteration-level scheduling).
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness — Dao et al., NeurIPS'22 — The core attention kernel optimization used in inference engines.
vLLM Documentation — docs.vllm.ai — Official vLLM documentation.
SGLang Documentation — sgl-project.github.io — Official SGLang documentation.
Sarathi-Serve: Chunked-Prefills for Efficient LLM Serving — Agrawal et al., 2024 — Systematic analysis of chunked prefill strategies.

Motivation#

Prerequisites#

KV Cache Fundamentals#

Why KV Cache Exists#

How Large Is the KV Cache?#

The Waste Problem with Naive Allocation#

PagedAttention#

Borrowing Wisdom from Operating Systems#

Block Allocation Mechanism#

From 86% Waste to Under 3%#

Copy-on-Write: Efficient Sequence Forking#

RadixAttention (SGLang’s Core Innovation)#

From PagedAttention to Prefix Caching#

Radix Tree: A Data Structure for Indexing Arbitrary Prefixes#

Workflow#

LRU Eviction#

Cache-Aware Scheduling#

Comparison with vLLM#

Continuous Batching#

The Problem with Static Batching#

Iteration-Level Scheduling#

Preemption: Graceful Handling of Memory Pressure#

Chunked Prefill#

Long Prompts Blocking Decode#

Splitting Prefill into Chunks#

Choosing the Chunk Size#

System Architecture Overview#

Key Takeaways#

Companion Code#

References#