[ ABORT TO HUD ]
SEQ. 1

llama.cpp Deep Dive

llama.cpp Engine15 min125 BASE XP

The Universal Inference Engine

llama.cpp is the industry-standard engine for running LLMs on any hardware — from Raspberry Pis to multi-GPU servers.

Architecture

  • GGML: Custom tensor library optimized for quantized inference
  • GGUF: Universal model format supporting all major architectures
  • Backends: CUDA, Metal, ROCm, Vulkan, OpenVINO (Intel NPUs)

Inference Optimization Techniques

TechniqueWhat It DoesSpeedup
GPU Layer OffloadingOffload N layers to GPU, rest on CPU2-10x vs CPU-only
Speculative DecodingDraft model proposes tokens, main model verifies1.5-3x throughput
Speculative CheckpointingExtends speculative decoding to MoE modelsVariable (MoE-specific)
Flash AttentionMemory-efficient attention computation2x+ for long contexts
Batch ProcessingProcess multiple requests simultaneouslyLinear with batch size
Mmap LoadingMemory-map model files (instant cold start)Near-zero startup

llama-server (HTTP API)

# Basic server
llama-server -m model.gguf --host 0.0.0.0 --port 8080

# Optimized production server
llama-server -m model.gguf \
  --host 0.0.0.0 --port 8080 \
  -ngl 99 \           # Offload all layers to GPU
  --ctx-size 32768 \  # Context window
  -np 4 \             # 4 parallel request slots
  --flash-attn \      # Enable Flash Attention
  --cont-batching      # Continuous batching

MCP Integration

llama-server now supports Model Context Protocol natively — enabling direct tool calling from your local model.

🐳 Production Container:
docker run -d --gpus all \
  -v ./models:/models \
  -p 8080:8080 \
  --name llama-server \
  ghcr.io/ggml-org/llama.cpp:server \
  -m /models/mistral-large-Q4_K_M.gguf \
  --host 0.0.0.0 -ngl 99 --flash-attn \
  -np 8 --cont-batching
KNOWLEDGE CHECK
QUERY 1 // 2
What does the -ngl 99 flag do in llama-server?
Limits to 99 tokens
Sets 99 parallel slots
Offloads all layers to GPU
Sets context to 99K
Watch: 139x Rust Speedup
llama.cpp Deep Dive | llama.cpp Engine — Open Source AI Academy