Llama models are available on Hugging Face and can be run via multiple engines:
| Method | Command | Best For |
|---|---|---|
| Ollama | ollama run llama4-scout | Quick local experimentation |
| llama.cpp | llama-server -m scout-Q4.gguf | CPU/hybrid inference, max flexibility |
| vLLM | vllm serve meta-llama/Llama-4-Scout | Production GPU serving |
Choose your quality vs memory tradeoff:
docker run --gpus all -v ./models:/models -p 8080:8080 ghcr.io/ggml-org/llama.cpp:server -m /models/scout-Q4.gguf --host 0.0.0.0