Ollama is the easiest way to run open-source models locally. It handles downloading, quantization, GPU detection, and API serving automatically.
# Install (macOS/Linux) curl -fsSL https://ollama.com/install.sh | sh # Run a model (auto-downloads on first use) ollama run llama4-scout ollama run mistral-large ollama run qwen3.5:32b ollama run gemma4:4b
Ollama exposes an API on localhost:11434 that's compatible with the OpenAI SDK — just change the base URL:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
model="mistral-large",
messages=[{"role": "user", "content": "Explain Docker networking"}]
)
print(response.choices[0].message.content)
# Modelfile FROM mistral-small:latest SYSTEM "You are a senior DevOps engineer. Always provide Docker and Kubernetes examples." PARAMETER temperature 0.3 PARAMETER num_ctx 32768
Build: ollama create devops-assistant -f Modelfile
docker run -d --gpus all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollamadocker exec ollama ollama pull mistral-largehttp://host:11434/v1/chat/completions