Claude Certification
Prompt Engineering & Structured Output
Lesson 4 · 8 min

Prompt Caching

5-minute server-side cache that cuts tokens by 90% on hits.

Mark a content block as cacheable; subsequent calls that share that block read it for ~10% of input cost. Cache lives for 5 minutes. Architect prompts so static content is first and shared across calls — that's how you get hits.

Production scenario

Real-world example: Internal-docs Q&A bot

A 50,000-token internal handbook is the context for an HR Q&A bot. Without caching, every employee question re-pays the 50K input cost — burning thousands of dollars a day at scale.

The architecture:

[system + handbook]   ← cache breakpoint here, lives 5 min
[user question]       ← cheap, ~50 tokens

After warm-up, 97% of calls hit the cache. Input cost drops from $0.15 per question to about $0.018. Latency drops too because the cache is faster than full re-ingestion.

Why this matters: prompt caching is one of the few "free" wins in LLM ops. Architect prompts so the static block is at the start, reused across calls, and explicitly marked.

Knowledge points in this lesson
  • Cache TTL is 5 minutes
  • Cache key matches on prefix bytes
  • Order content static → dynamic to maximize hits
  • Cache hits ~10% of input cost
  • Mark cache breakpoints explicitly
  • Reorder kills cache hits
Quick check
Prompt EngineeringSelect one
You're using JSON mode and want bulletproof structured output. What's the additional safety layer?