From Prototype to Production
Model Selection Framework
| Requirement | Recommended Model | Engine |
| Quick prototyping | Mistral Small 4 / Qwen3-8B | Ollama |
| Production chat (single GPU) | Qwen3-32B-AWQ / Mistral-24B | vLLM |
| Enterprise multi-user | Mistral Large 3 / Llama 4 Scout | vLLM + Kubernetes |
| Edge / IoT | Gemma 4 E2B / Ministral 3B | llama.cpp / Ollama |
| RAG / agents | DeepSeek-V3 / Qwen3-72B | SGLang |
Container-Based Production Stack
# docker-compose.yml — Full production stack
services:
inference:
image: vllm/vllm-openai:latest
runtime: nvidia
ports: ["8000:8000"]
volumes: ["./models:/models"]
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
command: >
--model /models/Mistral-Large-3-AWQ
--quantization awq
--tensor-parallel-size 2
--gpu-memory-utilization 0.9
--max-model-len 32768
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
proxy:
image: nginx:alpine
ports: ["443:443", "80:80"]
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf
- ./certs:/etc/nginx/certs
depends_on: [inference]
monitoring:
image: grafana/grafana:latest
ports: ["3000:3000"]
volumes: ["./grafana:/var/lib/grafana"]
Key Metrics to Monitor
- Throughput: Tokens/second (aggregate and per-request)
- Latency: P50, P95, P99 response times
- VRAM Usage: Model weights + KV cache + overhead
- Queue Depth: Pending requests (indicates capacity limits)
- Cost/Token: Hardware amortization per token generated
Security Checklist
- ✅ Reverse proxy with TLS termination
- ✅ API key authentication at proxy layer
- ✅ Rate limiting per client
- ✅ Input sanitization (prompt injection defense)
- ✅ Output filtering (PII, harmful content)
- ✅ Network isolation (no direct internet access for inference)
- ✅ Regular model updates and security patches
💡 For teams without heavy iron: Start with a single NVIDIA GPU (RTX 4090 = 24GB VRAM). Run Mistral Small 4 or Qwen3-8B in a Docker container. This handles most small-team production workloads at near-zero marginal cost. Scale to multi-GPU with Kubernetes + vLLM only when throughput demands it.