Full fine-tuning updates all billions of parameters — requiring massive GPU clusters. PEFT freezes the base model and trains only a tiny fraction of parameters.
LoRA injects small trainable matrices into frozen model layers. Instead of updating a 4096×4096 weight matrix, you train two small matrices (e.g., 4096×16 and 16×4096) — reducing trainable parameters by 99.9%.
QLoRA goes further: quantize the frozen base to 4-bit (NF4), then apply LoRA on top. This cuts memory by ~75%:
| Method | 70B Model VRAM | Trainable Params |
|---|---|---|
| Full Fine-Tune | ~280GB (multi-GPU) | 70B (100%) |
| LoRA (FP16) | ~140GB | ~50M (0.07%) |
| QLoRA (4-bit) | ~36GB (1× A100) | ~50M (0.07%) |
| Tool | Strength | Best For |
|---|---|---|
| Unsloth | 2-5x faster via hand-written Triton kernels | Speed and efficiency |
| Axolotl | YAML-driven config, multi-GPU | Reproducible, complex pipelines |
| HF trl | Official HF library for SFT + RLHF | Integration with HF ecosystem |
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Qwen3-8B-bnb-4bit",
max_seq_length=8192, load_in_4bit=True
)
model = FastLanguageModel.get_peft_model(model,
r=16, target_modules=["q_proj","k_proj","v_proj","o_proj",
"gate_proj","up_proj","down_proj"],
lora_alpha=16, lora_dropout=0
)