A 70B parameter model in FP16 requires ~140GB VRAM. Quantization reduces precision to fit models on smaller hardware while preserving quality.
| Format | Best For | Key Advantage | Hardware |
|---|---|---|---|
| GGUF | Local / CPU / hybrid | Runs on anything (CPU, Mac, consumer GPU) | Universal |
| AWQ | Production GPU serving | Best quality at 4-bit, vLLM optimized | NVIDIA GPUs |
| GPTQ | Broad GPU inference | Wide ecosystem support, mature | NVIDIA GPUs |
| EXL2 | Maximum speed (single GPU) | Lowest latency for local high-end setups | High-end NVIDIA |
| Quant | Bits/Weight | Quality | 70B VRAM |
|---|---|---|---|
| Q8_0 | 8-bit | Near-lossless | ~70GB |
| Q6_K | 6-bit | Excellent | ~54GB |
| Q4_K_M | 4-bit | Great (recommended) | ~40GB |
| Q3_K_S | 3-bit | Acceptable | ~30GB |
| Q2_K | 2-bit | Quality cliff ⚠️ | ~20GB |
Post-training quantization quality depends on calibration data. For domain-specific use (medical, legal, coding), always calibrate with a sample of your actual production data rather than generic datasets.