Without caching and compaction, agent costs grow linearly with conversation length. A 50-turn conversation can cost 100x what it should.
| Strategy | How It Works | Savings | Trade-off |
|---|---|---|---|
| Prompt Caching | Cache the system prompt + tool definitions (Anthropic charges 90% less for cached prefixes) | 60-90% on repeated calls | Must maintain prefix stability |
| Result Caching | Cache tool results (e.g., same API call = cached response) | 100% for repeated queries | Stale data risk |
| Embedding Caching | Cache query embeddings to skip re-embedding identical queries | 50-70% on embedding costs | Cache invalidation complexity |
When a conversation exceeds 80% of the context window, compact it:
// Conversation Compaction Strategy:
// Before: 120 messages (80K tokens)
// After: 1 summary (2K tokens) + last 10 messages
async function compactConversation(messages) {
if (tokenCount(messages) < MAX_TOKENS * 0.8) return messages;
const oldMessages = messages.slice(0, -10);
const recentMessages = messages.slice(-10);
const summary = await llm.generate({
system: "Summarize this conversation. Keep ALL key decisions, facts, and action items. Be thorough.",
user: JSON.stringify(oldMessages)
});
return [
{ role: "system", content: `Previous conversation summary: ${summary}` },
...recentMessages
];
}
| Technique | Implementation Effort | Typical Savings |
|---|---|---|
| Prompt Caching (Anthropic) | Low (add cache_control breakpoints) | 60-90% |
| Conversation Compaction | Medium (summarization logic) | 40-70% |
| Tool Result Caching | Low (Redis/in-memory cache) | 20-50% |
| Model Routing (Haiku for easy tasks) | Medium (classifier needed) | 50-80% |