You cannot rely on the LLM's built-in safety alone. You must build defenses into the orchestrator:
Sandboxing: Run all agent code in isolated environments without network access to internal systems.
Least Privilege: Only give the agent the exact tools it needs. Don't give a read-only agent a delete_row tool.
Human-in-the-Loop (HITL): Require a human to click "Approve" before any irreversible action (e.g., sending an email, dropping a table).
Input/Output Filters: Pass the agent's planned action through a smaller, fast model trained specifically to detect malicious intent before executing it.
SYNAPSE VERIFICATION
QUERY 1 // 1
What is the most effective defense against an agent making a catastrophic, irreversible mistake?
A very strong system prompt
Human-in-the-Loop (HITL) approval gates for sensitive tools
Using temperature 0
Prompt caching
Watch: 139x Rust Speedup
Defense in Depth | Safety & Guardrails — AI Agents Academy