During autoregressive generation, each new token requires attending to all previous tokens. Without caching, the model would recompute K and V for the entire history at every step.
The KV Cache stores computed K/V vectors so only the new token's Q/K/V needs calculation. This is essential for performance but creates a memory bottleneck:
KV Cache Size = 2 × layers × heads × head_dim × seq_len × batch × bytes_per_param
For a 70B model at 4K context: ~5-10GB of VRAM just for the cache.
Standard attention materializes a massive N×N score matrix in GPU HBM (slow memory). Flash Attention uses tiling to break this into small blocks processed in fast on-chip SRAM.
| Version | Key Feature |
|---|---|
| FA-1 | Tiling + fused kernels, 2-4x speedup |
| FA-2 | Better parallelism, variable-length sequences |
| FA-3 | Hopper/Blackwell native, FP8, async compute |
Flash Attention is exact — it produces identical results to standard attention, just faster and with less memory.