Understanding quantization theory is one thing — actually converting models is another. This lesson walks through the complete practical workflow.
The llama.cpp project provides convert_hf_to_gguf.py to convert Hugging Face models to the GGUF format:
# Clone llama.cpp and install dependencies
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
pip install -r requirements.txt
# Convert HF model to GGUF (FP16 baseline)
python convert_hf_to_gguf.py /path/to/hf-model/ \
--outfile model-f16.gguf \
--outtype f16
Once you have the FP16 GGUF, quantize it to your target precision:
# Build llama.cpp (with CUDA support)
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release
# Quantize to Q4_K_M (recommended default)
./build/bin/llama-quantize model-f16.gguf model-Q4_K_M.gguf Q4_K_M
# Quantize to Q6_K (higher quality)
./build/bin/llama-quantize model-f16.gguf model-Q6_K.gguf Q6_K
# Quantize with importance matrix (imatrix) for better quality
./build/bin/llama-imatrix -m model-f16.gguf -f calibration_data.txt -o imatrix.dat
./build/bin/llama-quantize --imatrix imatrix.dat model-f16.gguf model-IQ4_XS.gguf IQ4_XS
For GPU-optimized quantization, use auto-gptq with a calibration dataset:
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from transformers import AutoTokenizer
from datasets import load_dataset
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-Small-4")
quantize_config = BaseQuantizeConfig(
bits=4,
group_size=128,
desc_act=True, # Activation-order quantization
)
# Load model and prepare calibration data
model = AutoGPTQForCausalLM.from_pretrained(
"mistralai/Mistral-Small-4", quantize_config
)
# Use domain-specific calibration data for best results
calib_data = load_dataset("wikitext", "wikitext-2-raw-v1", split="train[:256]")
calib_texts = [tokenizer(t["text"], return_tensors="pt", max_length=2048, truncation=True)
for t in calib_data if len(t["text"]) > 100]
model.quantize(calib_texts)
model.save_quantized("./Mistral-Small-4-GPTQ")
| Your Situation | Recommended Quant | Why |
|---|---|---|
| Plenty of VRAM | Q6_K or Q8_0 | Minimal quality loss, best results |
| Consumer GPU (24GB) | Q4_K_M | Best quality/size balance (golden standard) |
| Low VRAM (8-12GB) | IQ4_XS with imatrix | Importance-weighted compression preserves critical weights |
| Production GPU serving | AWQ 4-bit | Optimized for vLLM/TGI throughput |
| Extreme constraints | Q3_K_S (with caution) | Last resort — test quality carefully |