Ollama is more than a chat interface — it's a full-featured local AI platform with embedding, vision, and multi-model capabilities.
Generate vector embeddings for RAG, search, and similarity matching:
# Generate embeddings via CLI
curl http://localhost:11434/api/embed -d '{
"model": "nomic-embed-text",
"input": ["Explain quantum computing", "What is Docker?"]
}'
# Python with OpenAI SDK
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.embeddings.create(
model="nomic-embed-text",
input=["Explain quantum computing"]
)
print(response.data[0].embedding[:5]) # [0.0123, -0.045, ...]
Run multimodal models that understand images:
# Run a vision model
ollama run llama4-scout
# Send image via API
curl http://localhost:11434/api/chat -d '{
"model": "llama4-scout",
"messages": [{
"role": "user",
"content": "What is in this image?",
"images": ["/path/to/image.jpg"]
}]
}'
Ollama can serve multiple models simultaneously. Models are loaded/unloaded from VRAM on demand:
# Pull multiple models
ollama pull mistral-large
ollama pull qwen3.5:32b
ollama pull nomic-embed-text
# Each request specifies which model to use
# Ollama manages GPU memory automatically
curl http://localhost:11434/v1/chat/completions -d '{
"model": "mistral-large",
"messages": [{"role": "user", "content": "Hello"}]
}'
# Set OLLAMA_NUM_PARALLEL for concurrent requests per model
# Set OLLAMA_MAX_LOADED_MODELS to control how many stay in VRAM
Import any GGUF model directly into Ollama:
# Method 1: Direct import
ollama create my-model -f /path/to/model.gguf
# Method 2: Via Modelfile
# Modelfile
FROM ./mistral-custom-Q4_K_M.gguf
SYSTEM "You are a helpful coding assistant specializing in Python."
PARAMETER temperature 0.2
PARAMETER num_ctx 32768
PARAMETER stop "<|im_end|>"
TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
"""
# Build from Modelfile
ollama create coding-assistant -f Modelfile
Ollama runs natively on Windows with full GPU acceleration:
ollama.com/download/windows or use winget install Ollama.OllamaC:UsersUSERNAME.ollamamodelslocalhost:11434OLLAMA_ORIGINS=* to allow cross-origin requests from web apps. Set OLLAMA_HOST=0.0.0.0 to expose the API on your network (with caution — add firewall rules).