[ ABORT TO HUD ]
SEQ. 1

Production Architecture

🏭 Production MLOps15 min150 BASE XP

From Prototype to Production

Model Selection Framework

RequirementRecommended ModelEngine
Quick prototypingMistral Small 4 / Qwen3-8BOllama
Production chat (single GPU)Qwen3-32B-AWQ / Mistral-24BvLLM
Enterprise multi-userMistral Large 3 / Llama 4 ScoutvLLM + Kubernetes
Edge / IoTGemma 4 E2B / Ministral 3Bllama.cpp / Ollama
RAG / agentsDeepSeek-V3 / Qwen3-72BSGLang

Container-Based Production Stack

# docker-compose.yml — Full production stack
services:
  inference:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    ports: ["8000:8000"]
    volumes: ["./models:/models"]
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    command: >
      --model /models/Mistral-Large-3-AWQ
      --quantization awq
      --tensor-parallel-size 2
      --gpu-memory-utilization 0.9
      --max-model-len 32768
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3

  proxy:
    image: nginx:alpine
    ports: ["443:443", "80:80"]
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf
      - ./certs:/etc/nginx/certs
    depends_on: [inference]

  monitoring:
    image: grafana/grafana:latest
    ports: ["3000:3000"]
    volumes: ["./grafana:/var/lib/grafana"]

Key Metrics to Monitor

  • Throughput: Tokens/second (aggregate and per-request)
  • Latency: P50, P95, P99 response times
  • VRAM Usage: Model weights + KV cache + overhead
  • Queue Depth: Pending requests (indicates capacity limits)
  • Cost/Token: Hardware amortization per token generated

Security Checklist

  • ✅ Reverse proxy with TLS termination
  • ✅ API key authentication at proxy layer
  • ✅ Rate limiting per client
  • ✅ Input sanitization (prompt injection defense)
  • ✅ Output filtering (PII, harmful content)
  • ✅ Network isolation (no direct internet access for inference)
  • ✅ Regular model updates and security patches
💡 For teams without heavy iron: Start with a single NVIDIA GPU (RTX 4090 = 24GB VRAM). Run Mistral Small 4 or Qwen3-8B in a Docker container. This handles most small-team production workloads at near-zero marginal cost. Scale to multi-GPU with Kubernetes + vLLM only when throughput demands it.
KNOWLEDGE CHECK
QUERY 1 // 3
What is the recommended starting point for teams without hyperscaler-level hardware?
Don't use open source
Single GPU (e.g., RTX 4090) with Docker container
Cloud APIs only
Buy an A100 cluster
Watch: 139x Rust Speedup
Production Architecture | Production MLOps — Open Source AI Academy