[ ABORT TO HUD ]
SEQ. 1

Quantization Formats Compared

📐 Quantization Mastery12 min125 BASE XP

Why Quantize?

A 70B parameter model in FP16 requires ~140GB VRAM. Quantization reduces precision to fit models on smaller hardware while preserving quality.

The Decision Matrix

FormatBest ForKey AdvantageHardware
GGUFLocal / CPU / hybridRuns on anything (CPU, Mac, consumer GPU)Universal
AWQProduction GPU servingBest quality at 4-bit, vLLM optimizedNVIDIA GPUs
GPTQBroad GPU inferenceWide ecosystem support, matureNVIDIA GPUs
EXL2Maximum speed (single GPU)Lowest latency for local high-end setupsHigh-end NVIDIA

GGUF Quality Tiers

QuantBits/WeightQuality70B VRAM
Q8_08-bitNear-lossless~70GB
Q6_K6-bitExcellent~54GB
Q4_K_M4-bitGreat (recommended)~40GB
Q3_K_S3-bitAcceptable~30GB
Q2_K2-bitQuality cliff ⚠️~20GB
⚠️ The 4-Bit Rule: In 2026, 4-bit quantization is the industry standard. Going below 3-bit causes significant quality degradation (the "quality cliff"). If you have VRAM headroom, prefer Q6_K.

Calibration Best Practice

Post-training quantization quality depends on calibration data. For domain-specific use (medical, legal, coding), always calibrate with a sample of your actual production data rather than generic datasets.

KNOWLEDGE CHECK
QUERY 1 // 3
Which quantization format is recommended for production GPU serving with vLLM?
GGUF
AWQ
EXL2
FP16
Watch: 139x Rust Speedup
Quantization Formats Compared | Quantization Mastery — Open Source AI Academy