[ ABORT TO HUD ]
SEQ. 1
SEQ. 2

The Alignment Stack: RLHF → DPO → GRPO

🏗️ Training & Alignment12 min150 BASE XP

Modern Post-Training Pipeline

Raw pre-trained models are "completion engines" — they continue text, not follow instructions. Alignment transforms them into useful assistants.

The Three Stages

StagePurposeTechnique
1. SFTInstruction following, format, conversational styleSupervised Fine-Tuning on instruction datasets
2. PreferenceAlign with human values and preferencesDPO, KTO, SimPO (no reward model needed)
3. RLPush beyond training data for reasoningGRPO, RLVR (for math/code verification)

Technique Comparison

MethodComplexityMemoryBest For
RLHF (PPO)High (needs reward model + critic)~4x model sizeClassic, proven approach
DPOLow (direct from preference pairs)~2x model sizeSimple, stable preference alignment
GRPOMedium (group-wise comparison)~2x model sizeReasoning, no critic needed

GRPO (popularized by DeepSeek-R1) generates multiple answers per prompt, compares them within the group, and optimizes accordingly — eliminating the separate "critic" model that PPO requires.

💡 2026 Consensus: The era of one-size-fits-all alignment is over. Modern stacks are modular: SFT for format → DPO for preferences → GRPO for reasoning. Mix and match based on your use case.
KNOWLEDGE CHECK
QUERY 1 // 2
What does DPO eliminate compared to RLHF?
The training data
The need for a separate reward model
The base model
The tokenizer
Watch: 139x Rust Speedup
The Alignment Stack: RLHF → DPO → GRPO | Training & Alignment — Open Source AI Academy