In production, things break constantly. APIs time out, databases go down, and rate limits are hit. A robust agent must handle these failures gracefully.
| Layer | Who Handles | Strategy | Example |
|---|---|---|---|
| 1. Tool Level | Your code | Retry with backoff, circuit breakers | Retry API call 3 times with exponential backoff |
| 2. Orchestrator Level | Your code | Catch exceptions, format errors for the LLM | Catch timeout, send "Tool timed out. Try alternative." |
| 3. Agent Level | The LLM | Reason about the error and try a different approach | "API returned 404. Let me try searching by name instead of ID." |
| 4. Human Level | The user | Escalate when all else fails | "I cannot complete this task. Here's what I tried..." |
async function executeToolSafely(toolName, args, maxRetries = 3) {
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
const result = await executeTool(toolName, args);
return { success: true, data: result };
} catch (error) {
if (attempt === maxRetries) {
// Format error for the LLM to reason about
return {
success: false,
error: error.message,
suggestion: `Tool '${toolName}' failed after ${maxRetries} attempts. Error: ${error.message}. Consider trying an alternative approach.`
};
}
// Exponential backoff
await sleep(Math.pow(2, attempt) * 1000);
}
}
}