[ ABORT TO HUD ]
SEQ. 1

LoRA & QLoRA Explained

🎯 Fine-Tuning & LoRA15 min150 BASE XP

Parameter-Efficient Fine-Tuning (PEFT)

Full fine-tuning updates all billions of parameters — requiring massive GPU clusters. PEFT freezes the base model and trains only a tiny fraction of parameters.

LoRA: Low-Rank Adaptation

LoRA injects small trainable matrices into frozen model layers. Instead of updating a 4096×4096 weight matrix, you train two small matrices (e.g., 4096×16 and 16×4096) — reducing trainable parameters by 99.9%.

QLoRA: Quantized LoRA

QLoRA goes further: quantize the frozen base to 4-bit (NF4), then apply LoRA on top. This cuts memory by ~75%:

Method70B Model VRAMTrainable Params
Full Fine-Tune~280GB (multi-GPU)70B (100%)
LoRA (FP16)~140GB~50M (0.07%)
QLoRA (4-bit)~36GB (1× A100)~50M (0.07%)

Tooling

ToolStrengthBest For
Unsloth2-5x faster via hand-written Triton kernelsSpeed and efficiency
AxolotlYAML-driven config, multi-GPUReproducible, complex pipelines
HF trlOfficial HF library for SFT + RLHFIntegration with HF ecosystem

Best Practices

  • Apply LoRA to all linear layers (q, k, v, o, gate, up, down) — not just attention
  • Data quality > quantity: 1000 high-quality examples often beats 100K noisy ones
  • After training, merge adapters into base model for zero-latency inference
  • Export merged model to GGUF for local deployment
💡 Quick Example (Unsloth):
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen3-8B-bnb-4bit",
    max_seq_length=8192, load_in_4bit=True
)
model = FastLanguageModel.get_peft_model(model,
    r=16, target_modules=["q_proj","k_proj","v_proj","o_proj",
                          "gate_proj","up_proj","down_proj"],
    lora_alpha=16, lora_dropout=0
)
KNOWLEDGE CHECK
QUERY 1 // 3
What percentage of parameters does QLoRA typically train?
100%
10%
1%
Less than 0.1%
Watch: 139x Rust Speedup
LoRA & QLoRA Explained | Fine-Tuning & LoRA — Open Source AI Academy