[ ABORT TO HUD ]
SEQ. 1
SEQ. 2

Self-Attention & Multi-Head Attention

🧠 Transformer Architecture10 min75 BASE XP

The Engine Behind Every LLM

Every modern language model is built on the Transformer architecture (Vaswani et al., 2017). At its core is the Self-Attention mechanism.

How Attention Works

For every token in a sequence, the model computes three vectors:

  • Query (Q): "What information am I looking for?"
  • Key (K): "What information do I contain?"
  • Value (V): "What information do I provide?"

The attention score is: Attention(Q,K,V) = softmax(QK^T / √d_k) × V

This lets each token "attend" to every other token, capturing long-range dependencies.

Multi-Head Attention (MHA)

Instead of one attention computation, MHA runs multiple heads in parallel — each learning different relationship types (syntax, semantics, coreference). A typical model uses 32-128 heads.

Modern Variants

VariantWhat It DoesUsed By
MHAFull Q/K/V per headOriginal Transformer
GQAGroups share K/V heads (reduces memory)Llama 3/4, Mistral
MLACompresses KV cache via latent projectionDeepSeek-V3/R1
💡 Pro Tip: GQA is the current industry default — it provides 90%+ of MHA quality with significantly less memory usage.
KNOWLEDGE CHECK
QUERY 1 // 2
What are the three vectors computed in self-attention?
Input, Output, Hidden
Query, Key, Value
Encoder, Decoder, Attention
Weight, Bias, Gradient
Watch: 139x Rust Speedup
Self-Attention & Multi-Head Attention | Transformer Architecture — Open Source AI Academy