Observability
Logging traces, costs, and quality signals.
Log every LLM call with: model, input/output tokens, latency, tool calls, cache hits, and a task ID. Aggregate to dashboards for cost per task and tail latency. Track quality with sampled human review of outputs.
Real-world example: Catching a regression after a prompt change
A growth team ships a prompt change Monday morning. Their dashboard tracks five signals per call: model, input/output tokens, latency, cache hit rate, tool errors.
By Monday afternoon, the dashboard shows:
- input tokens: +14% (cache hit rate dropped from 72% to 38%)
- p99 latency: +1.6s
- cost per task: +21%
Root cause: the new prompt moved a piece of static content from the system block into the user message, breaking the cache prefix. Roll back. Costs return to baseline within an hour.
Why this matters: observability isn't decoration. It's the only way to catch prompt regressions before they blow your unit economics.
- Log model, tokens, latency, tool calls
- Track cache hit rate alongside cost
- Dashboard tail latency (p95/p99)
- Attach a task ID to every call
- Sample human review for quality
