Motivation
Training a large model is expensive, but serving one can be even more so.
Training is a one-time cost — you throw a few thousand GPUs at the problem for a few weeks and you are done. Inference, on the other hand, runs 24/7: every user query, every API call demands a real-time response from the model. OpenAI processes billions of tokens per day, and inference accounts for the lion’s share of operational costs. If you can double inference efficiency, you either halve costs or serve twice as many users on the same budget.
Inference optimization is also a fundamentally different game from training. Training optimizes aggregate compute throughput — burn through the data as fast as possible. Inference must simultaneously optimize two metrics that are often in tension:
- Throughput: How many tokens per second can the system generate? This determines how many concurrent users you can serve.
- Latency: How long does each user wait? The key metrics are TTFT (Time To First Token) and TPOT (Time Per Output Token).
More importantly, inference has completely different compute characteristics than training. As we discussed in Article 1, autoregressive decoding is essentially a matrix-vector multiply — arithmetic intensity around 1 FLOPs/byte, GPU utilization below 1%. Inference is not compute-bound. It is memory-bandwidth-bound — and the KV Cache is the single largest consumer of that memory.
This article uses SGLang as a case study to dissect four core techniques in modern inference systems:
- KV Cache — the dominant memory consumer during inference, and why it becomes a bottleneck
- PagedAttention — borrowing the OS virtual memory paradigm to eliminate KV Cache fragmentation
- RadixAttention — SGLang’s core innovation, using a radix tree for cross-request KV Cache reuse
- Continuous Batching — iteration-level scheduling to maximize GPU utilization
- Chunked Prefill — preventing long prompts from blocking decode operations
By the end, you should understand why vLLM and SGLang achieve 2-10x throughput improvements over naive implementations, and how their design philosophies differ.
Prerequisites
- GPU Memory Model and Distributed Communication Fundamentals (Article 1) — especially the HBM bandwidth bottleneck and the Roofline model
- Basic understanding of Transformer attention (Self-Attention, Q/K/V projections)
- Familiarity with autoregressive generation (token-by-token decoding)
KV Cache Fundamentals
Why KV Cache Exists
When an autoregressive language model generates token $t$, the attention computation is:
$$\text{Attention}(Q_t, K_{1:t}, V_{1:t}) = \text{softmax}\left(\frac{Q_t \cdot K_{1:t}^T}{\sqrt{d_k}}\right) V_{1:t}$$
The key observation: the current token’s Query must attend to the Keys and Values of every previously generated token. If you recompute all K and V from scratch at each step, total computation is $O(t^2)$ — generating 1000 tokens requires $1 + 2 + \cdots + 1000 = 500{,}500$ attention computations.
The KV Cache insight is simple: previous tokens’ K and V never change, so compute them once and cache them for reuse.
flowchart TD
subgraph NO["Autoregressive Decoding (without KV Cache)"]
N1["Step 1: Input [t1] → Compute K1,V1 → Attention → Output t2"]
N2["Step 2: Input [t1,t2] → Recompute K1,V1,K2,V2 → Attention"]
N3["Step 3: Input [t1,t2,t3] → Recompute K1,K2,K3... → Attention"]
NN["Step N: Input [t1,...,tN] → Recompute all K,V → O(N²) total"]
N1 --> N2 --> N3 -.-> NN
end
subgraph YES["Autoregressive Decoding (with KV Cache)"]
Y1["Step 1: Input [t1] → Compute K1,V1 → Cache → Attention → Output t2"]
Y2["Step 2: Input [t2] → Compute K2,V2 → Append to cache → Attention"]
Y3["Step 3: Input [t3] → Compute K3,V3 → Append to cache → Attention"]
YN["Step N: Input [tN] → Compute KN,VN → Append to cache → O(N) total"]
Y1 --> Y2 --> Y3 -.-> YN
YNOTE["Each step computes only 1 token's K,V, then attends over the cache"]
end
NO --> YES
style NO fill:#fff3cd,stroke:#856404
style YES fill:#d4edda,stroke:#155724
KV Cache reduces total computation from $O(N^2)$ to $O(N)$ — this is the first principle of inference optimization. Every modern inference engine implements it.
How Large Is the KV Cache?
The KV Cache size per token is:
$$\text{KV per token} = 2 \times n_{\text{layers}} \times n_{\text{heads}} \times d_{\text{head}} \times \text{dtype_size}$$
The factor of 2 accounts for storing both K and V. Here are concrete numbers for several popular models (FP16, 2 bytes per element):
| Model | $n_{\text{layers}}$ | $n_{\text{heads}}$ | $d_{\text{head}}$ | KV per token | 2048 tokens |
|---|---|---|---|---|---|
| LLaMA-7B | 32 | 32 | 128 | 0.5 MB | 1.0 GB |
| LLaMA-13B | 40 | 40 | 128 | 0.8 MB | 1.6 GB |
| LLaMA-70B | 80 | 64 | 128 | 2.5 MB | 5.0 GB |
| GPT-3 175B | 96 | 96 | 128 | 4.5 MB | 9.2 GB |
A single LLaMA-70B request generating 2048 tokens consumes 5 GB of KV Cache alone. If you want to serve 16 such requests concurrently, the KV Cache requires 80 GB — the entire memory of an A100. And the model weights themselves take another 140 GB (in FP16).
This is the fundamental tension of inference memory management: model weights are a fixed cost, KV Cache is a dynamic cost, and KV Cache grows linearly with both the number of concurrent requests and sequence length — easily becoming the bottleneck.
How many requests you can batch simultaneously depends on how much KV Cache you can fit — and batch size directly determines throughput. So the efficiency of KV Cache memory management sets the performance ceiling for the entire inference system.
The Waste Problem with Naive Allocation
The most straightforward approach is to pre-allocate a contiguous memory region of size max_seq_len for each request’s KV Cache. The problem is that you do not know how many tokens a request will actually generate — it could be 10 or it could be 2000.
| |
The companion code demonstrates that when request lengths are non-uniform (which is nearly always the case in practice), naive pre-allocation wastes 60-80% of GPU memory:
5 requests, max_seq_len=2048
Actual lengths: [37, 152, 8, 420, 91]
Memory allocated: 160.0 MB
Memory actually used: 22.1 MB
Waste: 86.2%
--> This is why we need PagedAttention!
Wasting 86% of memory means a GPU that could handle 7 concurrent requests is limited to just 1. What you are wasting is not just memory — it is throughput and revenue.

PagedAttention
Borrowing Wisdom from Operating Systems
PagedAttention is the core contribution of vLLM (SOSP'23), and its inspiration comes from a technology that has been battle-tested for 50 years — OS virtual memory.
In an operating system, each process sees a virtual address space that appears contiguous and private. But physical memory is allocated in fixed-size pages, and a page table maps virtual addresses to physical addresses. The process never needs contiguous physical memory — the OS stitches together scattered physical pages into what looks like a contiguous virtual space behind the scenes.
PagedAttention transplants this idea to KV Cache management:
flowchart LR
subgraph OS["OS Virtual Memory"]
A1["Process"]
A2["Virtual page"]
A3["Physical page frame"]
A4["Page table"]
A5["Demand paging (page fault)"]
A6["Fixed page size (4 KB)"]
A7["CoW (fork)"]
A8["Page swap"]
end
subgraph PA["PagedAttention"]
B1["Request (sequence)"]
B2["Logical KV block"]
B3["Physical KV block"]
B4["Block table"]
B5["On-demand allocation (new token)"]
B6["Fixed block size (16 tokens)"]
B7["CoW (beam search)"]
B8["Preemption (swap out request)"]
end
A1 -.-> B1
A2 -.-> B2
A3 -.-> B3
A4 -.-> B4
A5 -.-> B5
A6 -.-> B6
A7 -.-> B7
A8 -.-> B8
style OS fill:#cce5ff,stroke:#004085
style PA fill:#d4edda,stroke:#155724
Block Allocation Mechanism
PagedAttention divides GPU KV Cache memory into a pool of fixed-size blocks. Each block stores the KV data for a fixed number of tokens (typically 16 or 32). Each request maintains a block table that maps its logical blocks to physical blocks.
flowchart TD
subgraph POOL["Physical Block Pool (GPU Memory)"]
B0["B0<br/>Seq0"]
B1["B1<br/>Seq1"]
B2["B2<br/>Free"]
B3["B3<br/>Seq0"]
B4["B4<br/>Seq1"]
B5["B5<br/>Free"]
B6["B6<br/>Seq0"]
B7["B7<br/>Free"]
end
subgraph T0["Seq 0 Block Table"]
L0_0["Logic 0 → B0"]
L0_1["Logic 1 → B3"]
L0_2["Logic 2 → B6"]
end
subgraph T1["Seq 1 Block Table"]
L1_0["Logic 0 → B1"]
L1_1["Logic 1 → B4"]
end
L0_0 --> B0
L0_1 --> B3
L0_2 --> B6
L1_0 --> B1
L1_1 --> B4
NOTE["Physical blocks need NOT be contiguous — only logical order matters"]
style POOL fill:#cce5ff,stroke:#004085
style B2 fill:#f8f9fa,stroke:#6c757d
style B5 fill:#f8f9fa,stroke:#6c757d
style B7 fill:#f8f9fa,stroke:#6c757d
The workflow:
- New request arrives: Allocate an empty block table — no blocks pre-allocated
- First token generated: Grab a block from the free pool, store the KV, update the block table
- Current block fills up (16 tokens worth of KV written): Grab another block from the free pool
- Request completes: Return all blocks to the free pool, immediately available for other requests
| |
From 86% Waste to Under 3%
With PagedAttention, memory waste comes from exactly one source: internal fragmentation in each request’s last block. If block_size=16, the last block is on average half-full, wasting roughly $\frac{1}{2} \times \frac{1}{N_{\text{blocks}}}$ of total memory.
The companion code shows the comparison:
Config: 32 heads, head_dim=128, max_seq=2048, block_size=16
10 requests with lengths: [47, 183, 12, 891, 256, 5, 1024, 73, 330, 15]
Metric Naive Paged Actual
-----------------------------------------------------------------------
KV cache memory (MB) 320.0 45.5 44.3
Memory utilization 13.8% 97.4% 100.0%
Waste (MB) 275.7 1.2 0.0
Waste (%) 86.2% 2.6% 0.0%
Naive allocation wastes 86%; PagedAttention wastes just 2.6%. This means the same GPU can batch 2-4x more requests, directly translating to a 2-4x throughput increase.

Copy-on-Write: Efficient Sequence Forking
In beam search and parallel sampling, a request forks into multiple candidate sequences. These candidates share the prefix’s KV Cache — copying the prefix KV for every candidate would multiply memory consumption.
PagedAttention solves this with Copy-on-Write (CoW), again borrowing from the OS fork() system call:
flowchart TD
subgraph BEFORE["Before fork"]
S0a["Seq 0: Block A → Block B → Block C (ref=1)"]
end
subgraph AFTER["After fork (Seq 0 + Seq 1)"]
S0b["Seq 0: Block A → Block B → Block C"]
S1b["Seq 1: Block A → Block B → Block C"]
REF["Block C ref=2 (shared, no copy!)"]
end
subgraph WRITE["Seq 1 writes new token to Block C"]
S0c["Seq 0: Block A → Block B → Block C (ref=1)"]
S1c["Seq 1: Block A → Block B → Block C' (ref=1, copied now)"]
NOTE2["Only the modified block is copied; earlier blocks remain shared"]
end
BEFORE --> AFTER --> WRITE
style BEFORE fill:#cce5ff,stroke:#004085
style AFTER fill:#fff3cd,stroke:#856404
style WRITE fill:#d4edda,stroke:#155724
The core logic lives in append_token: before writing, check the block’s reference count. If it exceeds 1 (the block is shared), copy it first, then write:
| |
CoW pays off most dramatically in beam search: with beam_size=4, four candidate sequences share the vast majority of their prefix KV. Only the last block (the part being actively generated) needs independent storage. Memory cost approaches that of 1 sequence rather than 4.
RadixAttention (SGLang’s Core Innovation)
From PagedAttention to Prefix Caching
PagedAttention solves memory fragmentation within a single request. But it misses a larger optimization opportunity: KV Cache reuse across requests.
In real-world LLM applications, many requests share identical prefixes:
flowchart TD
subgraph S1["Scenario 1: System Prompt"]
SP["Shared prefix: You are a helpful assistant..."]
UA["+ User A: What is machine learning?"]
UB["+ User B: Write a Python function for..."]
UC["+ User C: Translate this text..."]
SP --> UA
SP --> UB
SP --> UC
end
subgraph S2["Scenario 2: Few-shot Learning"]
FS["Shared prefix: Here are some examples x5"]
Q1["+ Input: new query 1"]
Q2["+ Input: new query 2"]
FS --> Q1
FS --> Q2
end
subgraph S3["Scenario 3: Multi-turn Chat"]
MT["Shared prefix: System + Turn 1 + Turn 2"]
T3["+ User Turn 3 (new message)"]
MT --> T3
end
style S1 fill:#cce5ff,stroke:#004085
style S2 fill:#d4edda,stroke:#155724
style S3 fill:#fff3cd,stroke:#856404
If a system prompt is 500 tokens long and every request recomputes and stores those 500 tokens’ KV Cache independently, then 100 concurrent requests store 100 identical copies — a massive waste.
The prefix caching insight: if two requests share the same token prefix, their KV Cache at those positions is identical and can be shared directly, with no recomputation needed.
vLLM eventually added prefix caching too, but its implementation relies on predefined prefix matching — you need to explicitly declare which requests share prefixes. SGLang’s RadixAttention offers a more flexible and elegant approach.
Radix Tree: A Data Structure for Indexing Arbitrary Prefixes
SGLang (LMSYS, 2024) introduces a Radix Tree (radix trie) to index and manage all cached KV data.
A radix tree is a compressed trie. In a standard trie, each node represents a single character; a radix tree compresses single-child paths into a single node, reducing tree depth. In SGLang, each node represents a token sequence segment, and nodes store pointers to the corresponding KV Cache blocks.
graph TD
ROOT["Root"]
SYS["System Prompt<br/>KV blocks: B0, B1, B2"]
ML["User: What is ML?<br/>KV blocks: B3, B4"]
PY["User: Write Python<br/>KV blocks: B5, B6"]
AST["Asst: ML is...<br/>KV blocks: B7"]
ROOT --> SYS
SYS --> ML
SYS --> PY
ML --> AST
REQD["New Req D: System Prompt + What is ML? + new query<br/>Matches B0-B4 directly, only new query needs fresh KV"]
style SYS fill:#cce5ff,stroke:#004085
style ML fill:#d4edda,stroke:#155724
style PY fill:#d4edda,stroke:#155724
style AST fill:#fff3cd,stroke:#856404
style REQD fill:#f8f9fa,stroke:#6c757d

Workflow
When a new request arrives, SGLang’s scheduler executes the following steps:
flowchart TD
subgraph FLOW["RadixAttention: Processing a New Request"]
A["1. PREFIX MATCHING<br/>Search Radix Tree for longest matching prefix<br/>e.g., sys_prompt KV blocks are reused directly"]
B["2. REUSE CACHED KV<br/>Skip prefill for the matched prefix<br/>Only compute KV for the unmatched suffix"]
C["3. INSERT INTO TREE<br/>After completion, insert new KV into Radix Tree<br/>Future requests with the same prefix can reuse it"]
A --> B --> C
end
subgraph TIMELINE["Timeline Comparison"]
T1["No cache: ====== prefill 500 tokens ====== decode"]
T2["With cache: == prefill 100 tokens == decode<br/>400 tokens KV reused from cache, skipped"]
end
FLOW --> TIMELINE
style FLOW fill:#cce5ff,stroke:#004085
style T1 fill:#fff3cd,stroke:#856404
style T2 fill:#d4edda,stroke:#155724
In the system prompt scenario (where nearly every request shares the same prompt), prefix caching skips the bulk of prefill computation, reducing TTFT (Time To First Token) by 50-90%.
LRU Eviction
GPU memory is finite — you cannot cache every prefix you have ever seen. When the block pool is full, some cached KV data must be evicted to make room for new requests.
SGLang uses LRU (Least Recently Used) eviction: the KV Cache node that has gone the longest without being accessed is evicted first.
flowchart TD
subgraph NODES["Radix Tree Nodes (sorted by last access time)"]
N1["sys_prompt + user_A — 10 sec ago — keep"]
N2["sys_prompt + user_B — 30 sec ago — keep"]
N3["sys_prompt + user_C — 120 sec ago — eviction candidate"]
N4["old_prompt + old_query — 300 sec ago — evict first"]
end
subgraph EVICT["Eviction Process"]
E1["1. Take least-recently-used node from LRU list"]
E2["2. Free that node's KV blocks"]
E3["3. Remove node from Radix Tree"]
E4["4. If parent becomes childless leaf, recursively clean up"]
E1 --> E2 --> E3 --> E4
end
style N1 fill:#d4edda,stroke:#155724
style N2 fill:#d4edda,stroke:#155724
style N3 fill:#fff3cd,stroke:#856404
style N4 fill:#f8d7da,stroke:#721c24
LRU works well here because of temporal locality: recently accessed prefixes are likely to be accessed again (the same user continuing a multi-turn conversation, or a popular system prompt being reused across many concurrent requests).
Cache-Aware Scheduling
With the Radix Tree in place, the scheduler can make a remarkably effective optimization: prioritize requests that have longer prefix matches in the cache.
flowchart TD
subgraph QUEUE["Waiting Queue"]
RA["Req A: Matches 500 cached tokens → prefill only 100 tokens"]
RB["Req B: Matches 0 tokens → prefill needs 600 tokens"]
RC["Req C: Matches 300 cached tokens → prefill only 200 tokens"]
end
subgraph SCHED["Scheduling Strategy Comparison"]
FCFS["Naive FCFS: A → B → C"]
CACHE["Cache-aware: A → C → B (prioritize high cache hits)"]
end
subgraph BENEFIT["Benefits"]
B1["1. High cache-hit requests prefill faster → lower TTFT"]
B2["2. KV stays in tree → similar future requests also hit → positive feedback"]
B3["3. Less total prefill compute → more GPU cycles for decode"]
end
QUEUE --> SCHED --> BENEFIT
style RA fill:#d4edda,stroke:#155724
style RB fill:#f8d7da,stroke:#721c24
style RC fill:#fff3cd,stroke:#856404
style CACHE fill:#d4edda,stroke:#155724
This scheduling strategy gives SGLang a significant edge in workloads with shared prefixes — few-shot learning, multi-turn chat, and any application using system prompts.
Comparison with vLLM
vLLM (v0.3+) later added prefix caching as well, but with a different design philosophy:
| Feature | SGLang (RadixAttention) | vLLM (Prefix Caching) |
|---|---|---|
| Data structure | Radix Tree | Hash Map |
| Prefix matching | Automatic matching of any shared prefix | Token block hash-based matching |
| Matching granularity | Token-level | Block-level (block_size-aligned) |
| Eviction policy | LRU on tree nodes | LRU on blocks |
| Cache-aware scheduling | Native support | Added in later versions |
| Multi-turn conversation | Natural fit (tree structure) | Requires hash matching |
SGLang’s Radix Tree approach has clear advantages in flexibility:
- Arbitrary prefix sharing: Not limited to predefined prompt templates — any two requests with a common prefix are automatically matched
- Tree structure naturally fits multi-turn dialogue: Conversation history naturally forms a tree, with different branches representing different follow-up exchanges
- Token-level precision: Matching can occur at exact token boundaries, not just at block-aligned positions
vLLM’s hash-based approach has advantages in implementation simplicity and stability at scale. In practice, the performance gap depends heavily on the specific workload characteristics.
Continuous Batching
The Problem with Static Batching
Now that we understand KV Cache memory management, let us look at a different dimension of optimization: scheduling.
The most straightforward inference approach is static batching: assemble a group of requests into a batch, run prefill together, then decode together, and wait until every request in the batch finishes before processing the next batch.
The problem: output lengths vary wildly across requests.
flowchart TD
subgraph STATIC["Static Batching — Batch = Req 0-3, Output lengths: 3, 8, 2, 5"]
direction LR
R0["Req 0: ## ## ## .. .. .. .. ..<br/>Done at step 3"]
R1["Req 1: ## ## ## ## ## ## ## ##<br/>Longest — everyone waits"]
R2["Req 2: ## ## .. .. .. .. .. ..<br/>Done at step 2"]
R3["Req 3: ## ## ## ## ## .. .. ..<br/>Done at step 5"]
end
WASTE["## = useful compute · .. = GPU idle<br/>GPU utilization: 18/32 = 56.25%<br/>43.75% of GPU cycles wasted<br/>Req 4, 5 queued — must wait for Req 1 to finish"]
style STATIC fill:#fff3cd,stroke:#856404
style WASTE fill:#f8d7da,stroke:#721c24
When output length variance is high (and it nearly always is in practice — one user asks “what time is it?” while another requests a long essay), static batching can drive GPU utilization as low as 20-50%.
Iteration-Level Scheduling
Continuous Batching (Orca, OSDI'22) makes one fundamental change: it drops the scheduling granularity from batch level to iteration level.
After every single decode step, the scheduler checks:
- Have any requests finished (generated EOS or hit max_length)? If so, evict them and free their KV Cache.
- Are there waiting requests that can be admitted? If so, admit them and begin their prefill.
flowchart TD
subgraph CB["Continuous Batching — Iteration-Level Scheduling"]
direction LR
R0["Req 0: ## ## ## — done step 3"]
R1["Req 1: ## ## ## ## ## ## ## ## — done step 8"]
R2["Req 2: ## ## — done step 2"]
R3["Req 3: ## ## ## ## ## — done step 5"]
R4["Req 4: admitted step 3 → ## ## ## ##"]
R5["Req 5: admitted step 4 → ## ## ## ## ## ##"]
end
RESULT["Active count: 4 4 4 4 4 3 3 3 1<br/>GPU stays nearly full! Finished requests immediately replaced"]
style CB fill:#d4edda,stroke:#155724
style RESULT fill:#cce5ff,stroke:#004085
The companion code demonstrates that continuous batching typically delivers a 2-3x throughput improvement:
| |

Preemption: Graceful Handling of Memory Pressure
Continuous batching introduces another critical mechanism: preemption. When there are not enough free KV Cache blocks to allocate for all running requests, the scheduler can:
- Pause one or more running requests (typically the most recently admitted — LIFO strategy, since they have done the least work)
- Free the paused request’s KV Cache blocks
- Use the reclaimed space to continue serving other requests
- Resume the paused request later (re-prefill to rebuild its KV Cache)
flowchart TD
A["Normal operation<br/>Running: Req 0, 1, 2, 3<br/>KV blocks: 200/256 (78%)"]
B["Req 1 generates long response, KV keeps growing...<br/>KV blocks: 253/256 (99%) — about to OOM!"]
C["Preemption triggered<br/>Pause Req 3 (LIFO: newest, least work done)<br/>Free Req 3's 20 blocks"]
D["KV blocks: 233/256 (91%)<br/>Room to continue"]
E["Later, Req 0/1 finishes, space freed<br/>→ Resume Req 3, re-prefill to rebuild KV"]
F["Alternative: no preemption → OOM crash → ALL requests fail!"]
A --> B --> C --> D --> E
B -.-> F
style A fill:#d4edda,stroke:#155724
style B fill:#f8d7da,stroke:#721c24
style C fill:#fff3cd,stroke:#856404
style D fill:#d4edda,stroke:#155724
style F fill:#f8d7da,stroke:#721c24
Preemption has a cost: the paused request must re-prefill (rebuild its KV Cache) when resumed, increasing that request’s latency. But compared to an OOM crash that kills every request in flight, this is a very reasonable trade-off.
The companion code’s preemption demo shows that even under extremely constrained memory (32 blocks), preemption ensures all requests eventually complete:
Config: 20 requests, 32 KV blocks, max_batch=4
Preemptions triggered: 12
Preemption allows the system to handle load spikes without OOM crashes.
All 20 requests completed: True
Chunked Prefill
Long Prompts Blocking Decode
The prefill phase (processing the user’s input prompt) and the decode phase (generating tokens one at a time) have fundamentally different compute characteristics:
- Prefill: Processes the entire prompt at once (potentially thousands of tokens) — a compute-bound large GEMM operation
- Decode: Generates 1 token per step — a memory-bound GEMV operation
The problem: if a request’s prompt is 8000 tokens, prefill might take several hundred milliseconds. During that time, every decode request is blocked — they are all waiting for the GPU to finish this one long prefill.
flowchart TD
subgraph NOCHUNK["Without Chunked Prefill"]
GPU1["GPU: ====== Prefill 8K tokens (200ms) ====== → D D D D"]
RA1["Req A: waiting 200ms... then decode"]
RB1["Req B: waiting 200ms... then decode"]
end
PROBLEM["Problem: Req A, B TPOT spikes from 10ms to 200ms+<br/>User experience: streaming text suddenly freezes for 200ms"]
style NOCHUNK fill:#fff3cd,stroke:#856404
style PROBLEM fill:#f8d7da,stroke:#721c24
For a user in the middle of a streaming conversation, TPOT jumping from 10ms to 200ms produces a very noticeable stutter — the text was flowing smoothly, then it just… stops.
Splitting Prefill into Chunks
Chunked Prefill takes a straightforward approach: break the long prompt’s prefill into multiple smaller chunks (e.g., 512 tokens each), and interleave decode steps between chunks.
flowchart TD
subgraph CHUNK["Chunked Prefill — 8192 tokens, chunk_size=512, 16 chunks"]
GPU2["GPU: P1 → D → P2 → D → P3 → D → P4 → D → P5 ..."]
LEGEND["P = prefill chunk (512 tokens, ~12ms) · D = decode step (~3ms)"]
end
subgraph DECODE["From decode requests' perspective"]
RA2["Req A: D — D — D — D — D (every ~15ms)"]
RB2["Req B: D — D — D — D — D (every ~15ms)"]
end
RESULT2["TPOT: ~15ms (stable!) vs 200ms spike without chunking"]
CHUNK --> DECODE
DECODE --> RESULT2
style CHUNK fill:#d4edda,stroke:#155724
style RESULT2 fill:#cce5ff,stroke:#004085
The trade-off:
- Total prefill time increases: Because decode steps are interleaved between chunks, an 8K-token prefill goes from 200ms to roughly 240ms (TTFT goes up slightly)
- Decode latency becomes stable: TPOT drops from a 200ms spike to a consistent ~15ms
This is a TTFT vs TPOT trade-off:
| Metric | Without Chunked Prefill | With Chunked Prefill |
|---|---|---|
| TTFT (first token) | 200ms | ~240ms (+20%) |
| TPOT (worst case) | 200ms (blocked) | ~15ms (stable) |
| TPOT (P99) | Very poor | Well-controlled |
| User experience | Occasional severe stutter | Smooth and consistent |
For most online serving scenarios, stable TPOT matters far more than slightly lower TTFT — users care more about “smooth streaming” than “seeing the first character 40ms sooner.”
Choosing the Chunk Size
Chunk size requires balancing competing concerns:
- Too small (e.g., 64 tokens): Prefill GPU utilization suffers (small matrix multiplies are inefficient), and TTFT increases significantly
- Too large (e.g., 4096 tokens): Approaches the unchunked case, and decode requests still experience long blocks
- Typical range: 512-1024 tokens, balancing prefill efficiency with decode stability
Both SGLang and vLLM support chunked prefill and can dynamically adjust chunk size based on the current running batch — if no decode requests are active, there is no need to chunk at all, so the system does a full prefill to minimize TTFT.
System Architecture Overview
Putting all the pieces together, here is the overall architecture of a modern LLM inference engine like SGLang:
flowchart TD
HTTP["HTTP Server<br/>Receives user requests"]
subgraph SCHED["Scheduler (the brain)"]
WQ["Waiting Queue"]
RT["Radix Tree<br/>(Prefix Cache Index)"]
STEPS["Every iteration:<br/>1. Evict completed requests<br/>2. Match new request prefixes<br/>3. Cache-aware ranking + admit<br/>4. Preempt if memory insufficient<br/>5. Chunked prefill scheduling"]
end
subgraph BM["Block Manager (the memory)"]
POOL2["Physical Block Pool: B0 B1 B2 B3 B4 B5 ..."]
BMOPS["On-demand alloc/free · CoW · LRU eviction"]
end
subgraph ME["Model Executor (the compute)"]
ATTN["Attention kernel (FlashAttention / FlashInfer)"]
KV["Reads KV from Block Table (non-contiguous phys.)"]
PAR["TP / PP parallel inference"]
end
HTTP --> SCHED --> BM --> ME
style SCHED fill:#cce5ff,stroke:#004085
style BM fill:#fff3cd,stroke:#856404
style ME fill:#d4edda,stroke:#155724
The Scheduler is the brain — it decides which requests to run at each step, whether to prefill or decode, and whether preemption is needed. The Block Manager is the memory manager — it allocates, frees, and shares KV Cache blocks. The Model Executor is the compute engine — it takes the scheduler’s decisions and the block table mappings and executes the actual attention computation.
These three components work in concert to form an efficient inference pipeline.
Key Takeaways
KV Cache is the memory bottleneck of inference. Each token requires $2 \times n_{\text{layers}} \times n_{\text{heads}} \times d_{\text{head}} \times \text{dtype_size}$ bytes of KV data. For LLaMA-70B, that is ~2.5 MB per token. With 100 concurrent requests at 2048 tokens each, the KV Cache alone needs 500 GB — far beyond any single GPU. How many requests you can batch and how long your sequences can be depends entirely on how efficiently you manage the KV Cache.
PagedAttention applies OS virtual memory to eliminate memory fragmentation. Fixed-size blocks are allocated on demand, a block table provides logical-to-physical mapping, and CoW supports efficient sequence forking. Memory waste drops from 60-80% to under 5%, allowing the same GPU to batch 2-4x more requests.
RadixAttention enables cross-request KV Cache reuse. A Radix Tree indexes all cached token prefixes; new requests automatically match their longest cached prefix and reuse the corresponding KV. Combined with LRU eviction and cache-aware scheduling, this dramatically reduces TTFT in shared-prefix workloads (system prompts, few-shot examples, multi-turn conversations).
Continuous Batching drops scheduling granularity from batch level to iteration level. Finished requests are evicted and waiting requests are admitted at every decode step, pushing GPU utilization from 20-50% to 80-95%. Preemption provides a safety valve against OOM crashes under memory pressure.
Chunked Prefill prevents long prompts from blocking decode. Splitting prefill into smaller chunks with interleaved decode steps trades a modest TTFT increase for dramatically more stable TPOT. Chunk size (typically 512-1024 tokens) balances prefill efficiency against decode smoothness.
These techniques are not alternatives — they stack. Modern inference engines (SGLang, vLLM) simultaneously employ PagedAttention + Prefix Caching + Continuous Batching + Chunked Prefill. Each technique addresses a different dimension of efficiency, and together they deliver 2-10x throughput over naive implementations.
Companion Code
The companion code for this article is in code/03-inference-sglang/:
kv_cache_from_scratch.py— KV Cache and PagedAttention Block Manager built from scratch. Contains four parts: (1) demonstrating naive KV Cache waste; (2) full PagedAttention implementation with block allocation, deallocation, and CoW; (3) a mini decode loop using the block-managed KV cache; (4) head-to-head memory efficiency comparison of naive vs paged allocation. Run with:python kv_cache_from_scratch.pycontinuous_batching.py— Static batching vs continuous batching comparison simulation. Contains four parts: (1) throughput comparison between static and continuous batching; (2) ASCII scheduling timeline visualization; (3) preemption behavior under memory pressure; (4) scaling analysis across different batch sizes. Run with:python continuous_batching.py
All code runs in CPU mode (no GPU required), making it easy to study the core logic in any environment.
References
- Efficient Memory Management for Large Language Model Serving with PagedAttention — Kwon et al., SOSP'23 (vLLM) — The original PagedAttention paper, bringing OS virtual memory concepts to KV Cache management.
- SGLang: Efficient Execution of Structured Language Model Programs — Zheng et al., 2024 (LMSYS) — The original paper introducing RadixAttention and cache-aware scheduling.
- Orca: A Distributed Serving System for Transformer-Based Generative Models — Yu et al., OSDI'22 — The pioneering work on continuous batching (iteration-level scheduling).
- FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness — Dao et al., NeurIPS'22 — The core attention kernel optimization used in inference engines.
- vLLM Documentation — docs.vllm.ai — Official vLLM documentation.
- SGLang Documentation — sgl-project.github.io — Official SGLang documentation.
- Sarathi-Serve: Chunked-Prefills for Efficient LLM Serving — Agrawal et al., 2024 — Systematic analysis of chunked prefill strategies.