Every modern language model is built on the Transformer architecture (Vaswani et al., 2017). At its core is the Self-Attention mechanism.
For every token in a sequence, the model computes three vectors:
The attention score is: Attention(Q,K,V) = softmax(QK^T / √d_k) × V
This lets each token "attend" to every other token, capturing long-range dependencies.
Instead of one attention computation, MHA runs multiple heads in parallel — each learning different relationship types (syntax, semantics, coreference). A typical model uses 32-128 heads.
| Variant | What It Does | Used By |
|---|---|---|
| MHA | Full Q/K/V per head | Original Transformer |
| GQA | Groups share K/V heads (reduces memory) | Llama 3/4, Mistral |
| MLA | Compresses KV cache via latent projection | DeepSeek-V3/R1 |