llama.cpp is the industry-standard engine for running LLMs on any hardware — from Raspberry Pis to multi-GPU servers.
| Technique | What It Does | Speedup |
|---|---|---|
| GPU Layer Offloading | Offload N layers to GPU, rest on CPU | 2-10x vs CPU-only |
| Speculative Decoding | Draft model proposes tokens, main model verifies | 1.5-3x throughput |
| Speculative Checkpointing | Extends speculative decoding to MoE models | Variable (MoE-specific) |
| Flash Attention | Memory-efficient attention computation | 2x+ for long contexts |
| Batch Processing | Process multiple requests simultaneously | Linear with batch size |
| Mmap Loading | Memory-map model files (instant cold start) | Near-zero startup |
# Basic server llama-server -m model.gguf --host 0.0.0.0 --port 8080 # Optimized production server llama-server -m model.gguf \ --host 0.0.0.0 --port 8080 \ -ngl 99 \ # Offload all layers to GPU --ctx-size 32768 \ # Context window -np 4 \ # 4 parallel request slots --flash-attn \ # Enable Flash Attention --cont-batching # Continuous batching
llama-server now supports Model Context Protocol natively — enabling direct tool calling from your local model.
docker run -d --gpus all \ -v ./models:/models \ -p 8080:8080 \ --name llama-server \ ghcr.io/ggml-org/llama.cpp:server \ -m /models/mistral-large-Q4_K_M.gguf \ --host 0.0.0.0 -ngl 99 --flash-attn \ -np 8 --cont-batching