[ ABORT TO HUD ]
SEQ. 1
SEQ. 2
SEQ. 3
SEQ. 4
SEQ. 5

Error Handling & Recovery

🔧 Tool Use & Function Calling10 min80 BASE XP

Making Agents Resilient

In production, things break constantly. APIs time out, databases go down, and rate limits are hit. A robust agent must handle these failures gracefully.

The Error Handling Pyramid

LayerWho HandlesStrategyExample
1. Tool LevelYour codeRetry with backoff, circuit breakersRetry API call 3 times with exponential backoff
2. Orchestrator LevelYour codeCatch exceptions, format errors for the LLMCatch timeout, send "Tool timed out. Try alternative."
3. Agent LevelThe LLMReason about the error and try a different approach"API returned 404. Let me try searching by name instead of ID."
4. Human LevelThe userEscalate when all else fails"I cannot complete this task. Here's what I tried..."

Implementation Pattern

async function executeToolSafely(toolName, args, maxRetries = 3) {
  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      const result = await executeTool(toolName, args);
      return { success: true, data: result };
    } catch (error) {
      if (attempt === maxRetries) {
        // Format error for the LLM to reason about
        return {
          success: false,
          error: error.message,
          suggestion: `Tool '${toolName}' failed after ${maxRetries} attempts. Error: ${error.message}. Consider trying an alternative approach.`
        };
      }
      // Exponential backoff
      await sleep(Math.pow(2, attempt) * 1000);
    }
  }
}

Key Error Recovery Strategies

  • Graceful Degradation: If the primary tool fails, have a fallback. If the database search fails, try web search.
  • Error Context Injection: When sending an error back to the agent, include what failed, why it failed, and what to try instead.
  • Circuit Breaker: If a tool fails 5 times in a row, stop calling it entirely and inform the agent it's unavailable.
  • Checkpoint Recovery: In graph-based agents, save state before risky operations. If they fail, roll back to the last checkpoint.
🚧 Critical Rule: Never send raw stack traces to the LLM. They waste tokens and confuse the model. Always format errors into a structured, human-readable summary with actionable suggestions.
SYNAPSE VERIFICATION
QUERY 1 // 2
At which layer of the Error Handling Pyramid does the LLM reason about the error and try a different approach?
Tool Level
Orchestrator Level
Agent Level
Human Level
Watch: 139x Rust Speedup
Error Handling & Recovery | Tool Use & Function Calling — AI Agents Academy