Prompt Caching
5-minute server-side cache that cuts tokens by 90% on hits.
Mark a content block as cacheable; subsequent calls that share that block read it for ~10% of input cost. Cache lives for 5 minutes. Architect prompts so static content is first and shared across calls — that's how you get hits.
Real-world example: Internal-docs Q&A bot
A 50,000-token internal handbook is the context for an HR Q&A bot. Without caching, every employee question re-pays the 50K input cost — burning thousands of dollars a day at scale.
The architecture:
[system + handbook] ← cache breakpoint here, lives 5 min
[user question] ← cheap, ~50 tokensAfter warm-up, 97% of calls hit the cache. Input cost drops from $0.15 per question to about $0.018. Latency drops too because the cache is faster than full re-ingestion.
Why this matters: prompt caching is one of the few "free" wins in LLM ops. Architect prompts so the static block is at the start, reused across calls, and explicitly marked.
- Cache TTL is 5 minutes
- Cache key matches on prefix bytes
- Order content static → dynamic to maximize hits
- Cache hits ~10% of input cost
- Mark cache breakpoints explicitly
- Reorder kills cache hits
