DeepSeek stunned the industry by producing models rivaling GPT-4-class performance at a fraction of the training cost, all released under the MIT license.
| Innovation | What It Does | Why It Matters |
|---|---|---|
| DeepSeekMoE | 671B total, 37B active per token | Massive quality, efficient inference |
| Multi-head Latent Attention (MLA) | Compresses KV cache via learned projections | Dramatically reduces memory for long contexts |
| Multi-Token Prediction (MTP) | Predicts multiple future tokens simultaneously | Denser training signals, better understanding |
| Auxiliary-loss-free Load Balancing | Balances expert usage without quality penalty | Avoids performance degradation from forced balancing |
<think> tags), trained via GRPO reinforcement learning