[ ABORT TO HUD ]
SEQ. 1
SEQ. 2

Practical Quantization Workflow

📐 Quantization Mastery14 min125 BASE XP

From Hugging Face to Quantized Model

Understanding quantization theory is one thing — actually converting models is another. This lesson walks through the complete practical workflow.

Step 1: HF Safetensors → GGUF

The llama.cpp project provides convert_hf_to_gguf.py to convert Hugging Face models to the GGUF format:

# Clone llama.cpp and install dependencies
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
pip install -r requirements.txt

# Convert HF model to GGUF (FP16 baseline)
python convert_hf_to_gguf.py /path/to/hf-model/ \
  --outfile model-f16.gguf \
  --outtype f16

Step 2: Quantize with llama-quantize

Once you have the FP16 GGUF, quantize it to your target precision:

# Build llama.cpp (with CUDA support)
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release

# Quantize to Q4_K_M (recommended default)
./build/bin/llama-quantize model-f16.gguf model-Q4_K_M.gguf Q4_K_M

# Quantize to Q6_K (higher quality)
./build/bin/llama-quantize model-f16.gguf model-Q6_K.gguf Q6_K

# Quantize with importance matrix (imatrix) for better quality
./build/bin/llama-imatrix -m model-f16.gguf -f calibration_data.txt -o imatrix.dat
./build/bin/llama-quantize --imatrix imatrix.dat model-f16.gguf model-IQ4_XS.gguf IQ4_XS

GPTQ Quantization with auto-gptq

For GPU-optimized quantization, use auto-gptq with a calibration dataset:

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from transformers import AutoTokenizer
from datasets import load_dataset

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-Small-4")
quantize_config = BaseQuantizeConfig(
    bits=4,
    group_size=128,
    desc_act=True,  # Activation-order quantization
)

# Load model and prepare calibration data
model = AutoGPTQForCausalLM.from_pretrained(
    "mistralai/Mistral-Small-4", quantize_config
)

# Use domain-specific calibration data for best results
calib_data = load_dataset("wikitext", "wikitext-2-raw-v1", split="train[:256]")
calib_texts = [tokenizer(t["text"], return_tensors="pt", max_length=2048, truncation=True)
               for t in calib_data if len(t["text"]) > 100]

model.quantize(calib_texts)
model.save_quantized("./Mistral-Small-4-GPTQ")

Choosing the Right Quant Level

Your SituationRecommended QuantWhy
Plenty of VRAMQ6_K or Q8_0Minimal quality loss, best results
Consumer GPU (24GB)Q4_K_MBest quality/size balance (golden standard)
Low VRAM (8-12GB)IQ4_XS with imatrixImportance-weighted compression preserves critical weights
Production GPU servingAWQ 4-bitOptimized for vLLM/TGI throughput
Extreme constraintsQ3_K_S (with caution)Last resort — test quality carefully
⚠️ Calibration Matters: For domain-specific deployments (medical, legal, coding), always quantize with calibration data from your actual domain. Generic calibration (WikiText) may degrade quality on specialized tasks. Even 256 representative samples can significantly improve quantized model quality.
KNOWLEDGE CHECK
QUERY 1 // 3
What is the first step in creating a GGUF quantized model?
Run llama-quantize directly on HF model
Convert HF safetensors to FP16 GGUF with convert_hf_to_gguf.py
Download a pre-quantized model
Install Ollama
Watch: 139x Rust Speedup
Practical Quantization Workflow | Quantization Mastery — Open Source AI Academy