[ ABORT TO HUD ]
SEQ. 1
SEQ. 2

KV Cache & Flash Attention

🧠 Transformer Architecture10 min75 BASE XP

The KV Cache

During autoregressive generation, each new token requires attending to all previous tokens. Without caching, the model would recompute K and V for the entire history at every step.

The KV Cache stores computed K/V vectors so only the new token's Q/K/V needs calculation. This is essential for performance but creates a memory bottleneck:

KV Cache Size = 2 × layers × heads × head_dim × seq_len × batch × bytes_per_param

For a 70B model at 4K context: ~5-10GB of VRAM just for the cache.

⚠️ Critical: At long contexts (32K+ tokens), the KV cache often consumes more VRAM than the model weights themselves.

Flash Attention

Standard attention materializes a massive N×N score matrix in GPU HBM (slow memory). Flash Attention uses tiling to break this into small blocks processed in fast on-chip SRAM.

Flash Attention Evolution

VersionKey Feature
FA-1Tiling + fused kernels, 2-4x speedup
FA-2Better parallelism, variable-length sequences
FA-3Hopper/Blackwell native, FP8, async compute

Flash Attention is exact — it produces identical results to standard attention, just faster and with less memory.

KNOWLEDGE CHECK
QUERY 1 // 2
Why does the KV cache become a bottleneck at long contexts?
It slows down tokenization
It can consume more VRAM than model weights
It reduces model accuracy
It requires CPU computation
Watch: 139x Rust Speedup
KV Cache & Flash Attention | Transformer Architecture — Open Source AI Academy