[ ABORT TO HUD ]
SEQ. 1
SEQ. 2

Advanced Ollama Features

🦙 Ollama: Local AI10 min100 BASE XP

Beyond Basic Chat

Ollama is more than a chat interface — it's a full-featured local AI platform with embedding, vision, and multi-model capabilities.

Embedding API

Generate vector embeddings for RAG, search, and similarity matching:

# Generate embeddings via CLI
curl http://localhost:11434/api/embed -d '{
  "model": "nomic-embed-text",
  "input": ["Explain quantum computing", "What is Docker?"]
}'

# Python with OpenAI SDK
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

response = client.embeddings.create(
    model="nomic-embed-text",
    input=["Explain quantum computing"]
)
print(response.data[0].embedding[:5])  # [0.0123, -0.045, ...]

Vision Model Support

Run multimodal models that understand images:

# Run a vision model
ollama run llama4-scout

# Send image via API
curl http://localhost:11434/api/chat -d '{
  "model": "llama4-scout",
  "messages": [{
    "role": "user",
    "content": "What is in this image?",
    "images": ["/path/to/image.jpg"]
  }]
}'

Multi-Model Serving

Ollama can serve multiple models simultaneously. Models are loaded/unloaded from VRAM on demand:

# Pull multiple models
ollama pull mistral-large
ollama pull qwen3.5:32b
ollama pull nomic-embed-text

# Each request specifies which model to use
# Ollama manages GPU memory automatically
curl http://localhost:11434/v1/chat/completions -d '{
  "model": "mistral-large",
  "messages": [{"role": "user", "content": "Hello"}]
}'

# Set OLLAMA_NUM_PARALLEL for concurrent requests per model
# Set OLLAMA_MAX_LOADED_MODELS to control how many stay in VRAM

Creating Models from GGUF

Import any GGUF model directly into Ollama:

# Method 1: Direct import
ollama create my-model -f /path/to/model.gguf

# Method 2: Via Modelfile
# Modelfile
FROM ./mistral-custom-Q4_K_M.gguf
SYSTEM "You are a helpful coding assistant specializing in Python."
PARAMETER temperature 0.2
PARAMETER num_ctx 32768
PARAMETER stop "<|im_end|>"
TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
"""

# Build from Modelfile
ollama create coding-assistant -f Modelfile

Ollama on Windows

Ollama runs natively on Windows with full GPU acceleration:

  • Install: Download from ollama.com/download/windows or use winget install Ollama.Ollama
  • GPU Support: Automatic NVIDIA CUDA and AMD ROCm detection
  • Models stored at: C:UsersUSERNAME.ollamamodels
  • Service: Runs as a background service, accessible via localhost:11434
  • WSL2: Also works inside WSL2 with GPU passthrough for Linux-native workflows
💡 Pro Tip: Set OLLAMA_ORIGINS=* to allow cross-origin requests from web apps. Set OLLAMA_HOST=0.0.0.0 to expose the API on your network (with caution — add firewall rules).
KNOWLEDGE CHECK
QUERY 1 // 3
How do you generate vector embeddings with Ollama?
Use a separate embedding server
Use the /api/embed endpoint with an embedding model
Export model weights manually
Use the chat endpoint with a special flag
Watch: 139x Rust Speedup
Advanced Ollama Features | Ollama: Local AI — Open Source AI Academy