Raw pre-trained models are "completion engines" — they continue text, not follow instructions. Alignment transforms them into useful assistants.
| Stage | Purpose | Technique |
|---|---|---|
| 1. SFT | Instruction following, format, conversational style | Supervised Fine-Tuning on instruction datasets |
| 2. Preference | Align with human values and preferences | DPO, KTO, SimPO (no reward model needed) |
| 3. RL | Push beyond training data for reasoning | GRPO, RLVR (for math/code verification) |
| Method | Complexity | Memory | Best For |
|---|---|---|---|
| RLHF (PPO) | High (needs reward model + critic) | ~4x model size | Classic, proven approach |
| DPO | Low (direct from preference pairs) | ~2x model size | Simple, stable preference alignment |
| GRPO | Medium (group-wise comparison) | ~2x model size | Reasoning, no critic needed |
GRPO (popularized by DeepSeek-R1) generates multiple answers per prompt, compares them within the group, and optimizes accordingly — eliminating the separate "critic" model that PPO requires.