Meta's Llama 4 (April 2025) introduced a Mixture-of-Experts (MoE) architecture — a paradigm shift from previous dense Llama models.
| Model | Total Params | Active Params | Experts | Context |
|---|---|---|---|---|
| Scout | 109B | 17B | 16 | 10M tokens |
| Maverick | 400B | 17B | 128 | 1M tokens |
| Behemoth | ~2T | 288B | — | Unreleased |
In a dense model, every parameter activates for every token. In MoE, a router network selects only a few "expert" sub-networks per token. This means:
| Model | Quantization | Min VRAM | Recommended |
|---|---|---|---|
| Scout | Q4_K_M | ~48GB | 2× RTX 4090 or 1× A100 80GB |
| Maverick | Q4_K_M | ~200GB | Multi-GPU cluster (4-8× A100) |