Where the money was going
OmniSupport's "AI Reply Assistant" was costing $11K/day on 2 million conversations. Quality was fine; cost was eating the gross margin. The team ran an audit and found three big leaks.
Leak 1 — Volatile content in the system prompt
The original system prompt looked like this:
You are the support assistant for tenant: {{tenant_name}}.
Today is: {{today}}.
Current ticket: {{ticket_summary}}.
Tools: ...today and ticket_summary changed every call — they broke cache hits constantly. The cache hit rate sat at 3%.
The fix: split *stable* and *per-call* content cleanly.
[ system, cached ]
You are the support assistant for tenant: Acme Co. (enterprise plan)
Tools: ...
Tone: warm but concise. Never make medical / legal claims.
[ user, per call ]
Today: 2026-05-12.
Ticket: Customer reports duplicate webhook events since 2026-05-08.
Question: how should we respond?Cache hit rate climbed to 78% in a week. Input cost dropped 60% on that path alone.
Leak 2 — JSON output that didn't validate
The ticket-summarizer produced JSON the parser frequently failed on — a missing comma here, an extra trailing key there. The team's "fix" had been a try/catch that re-prompted with the raw error text. Worked, but the retry rate was 22%.
The new recipe:
const schema = z.object({
category: z.enum(["billing", "tech", "feature_request", "other"]),
sentiment: z.enum(["positive", "neutral", "negative"]),
needs_human: z.boolean(),
});
const parsed = await callWithRetry({
system: SUMMARIZE_SYSTEM,
user: ticketText,
schema,
maxRetries: 1,
});JSON mode + Zod validation + a single targeted retry brought the retry rate to under 3%. The retry prompt now includes the validator's specific error: "Last attempt failed because: 'needs_human' was the string 'true' instead of a boolean."
Leak 3 — Knowledge-base questions paying for the same KB every call
The "Knowledge Search" feature passed the relevant knowledge-base article (often 8–12K tokens) inline with every customer question. Even with prompt caching enabled, the *order* was wrong — they had the KB *after* the user message, so the cache breakpoint sat after the volatile bit. Almost no hits.
The fix was structural: put the KB first and the question last. Mark the cache breakpoint after the KB. Now identical KB articles fetched within the 5-minute window cost ~10% of the original input price.
Few-shot tuning on email triage
For email triage, the team had been packing 18 examples into the prompt. Performance was middling and tokens were heavy. They cut to five carefully chosen examples — newsletter, refund request, partnership pitch, urgent outage, ambiguous upsell — and put the ambiguous one last to lean into recency bias.
| Examples | Accuracy on hard cases | Cost / call | | -------- | --------------------- | ----------- | | 18 mixed | 79% | $0.041 | | 5 chosen, hardest last | 84% | $0.014 |
Extended thinking, gated
The math-help feature (yes, support sometimes needs math) had extended thinking enabled globally. They gated it on a heuristic: only enable for messages classified as "computational." The simple lookups stayed fast and cheap; the genuinely hard ones got the budget.
Prompts in git, eval on every PR
Everything above only stuck because the team moved prompts into version control. Every prompt-changing PR runs the eval harness against a 100-conversation golden set and posts a diff comment: success rate, cost per call, p95 latency. PRs that regress any metric require explicit override.
What stuck
- Stable in system, volatile in user. That alone unlocked the biggest savings.
- Validate with a schema, retry on the validator's own error.
- Cache breakpoint placement is structural, not a flag you flip.
- Few-shot quality > quantity. Hardest example last.
- Extended thinking is a tool, not a default. Gate it.
- Prompts deserve the same review/eval/rollback rigor as application code.
