Lesson 4 · 8 min

Prompt Caching

5-minute server-side cache that cuts tokens by 90% on hits.

Mark a content block as cacheable; subsequent calls that share that block read it for ~10% of input cost. Cache lives for 5 minutes. Architect prompts so static content is first and shared across calls — that's how you get hits.

Production scenario

Real-world example: Internal-docs Q&A bot

A 50,000-token internal handbook is the context for an HR Q&A bot. Without caching, every employee question re-pays the 50K input cost — burning thousands of dollars a day at scale.

The architecture:

[system + handbook]   ← cache breakpoint here, lives 5 min
[user question]       ← cheap, ~50 tokens

After warm-up, 97% of calls hit the cache. Input cost drops from $0.15 per question to about $0.018. Latency drops too because the cache is faster than full re-ingestion.

Why this matters: prompt caching is one of the few "free" wins in LLM ops. Architect prompts so the static block is at the start, reused across calls, and explicitly marked.

Knowledge points in this lesson

Cache TTL is 5 minutes
Cache key matches on prefix bytes
Order content static → dynamic to maximize hits
Cache hits ~10% of input cost
Mark cache breakpoints explicitly
Reorder kills cache hits

Quick check

Prompt EngineeringSelect one

You're using JSON mode and want bulletproof structured output. What's the additional safety layer?