[ ABORT TO HUD ]
SEQ. 1
SEQ. 2

Llama 4: Architecture & Models

🦙 The Meta Llama Family12 min100 BASE XP

The Llama 4 Family

Meta's Llama 4 (April 2025) introduced a Mixture-of-Experts (MoE) architecture — a paradigm shift from previous dense Llama models.

Model Comparison

ModelTotal ParamsActive ParamsExpertsContext
Scout109B17B1610M tokens
Maverick400B17B1281M tokens
Behemoth~2T288BUnreleased

Mixture-of-Experts Explained

In a dense model, every parameter activates for every token. In MoE, a router network selects only a few "expert" sub-networks per token. This means:

  • Maverick has 400B total parameters but only runs 17B per token
  • Inference cost is proportional to active parameters, not total
  • You get large-model quality at small-model speed
💡 Scout's 10M Token Context: The largest context window of any open model — you can ingest entire codebases or book collections in a single prompt.

Hardware Requirements

ModelQuantizationMin VRAMRecommended
ScoutQ4_K_M~48GB2× RTX 4090 or 1× A100 80GB
MaverickQ4_K_M~200GBMulti-GPU cluster (4-8× A100)
KNOWLEDGE CHECK
QUERY 1 // 2
How many parameters does Llama 4 Maverick activate per token?
400B
288B
17B
109B
Watch: 139x Rust Speedup
Llama 4: Architecture & Models | The Meta Llama Family — Open Source AI Academy