The demo worked. Production didn't.
Research Co's analyst team had a working demo of a "compile a sector brief" agent: feed it a watchlist, let it pull SEC filings, news, internal notes, and produce a 30-page deck. Demo? Beautiful. Production? Crashed at hour three on long brief jobs. Cost ran 4× the estimate. Output quality was inconsistent week-to-week.
Problem 1 — Token budget face-plants
The agent had no explicit budget. It happily ingested every tool result verbatim. Around hour two on a long job, prompts hit 190K and any new tool call put it over 200K. The team set explicit budget targets:
[ system + role ] ~2K
[ task brief skeleton ] ~10K
[ summarized older steps ] ~30K (grows over time, capped)
[ verbatim recent 20 steps ] ~80K
[ headroom for tool results ] ~60K
total ~180K (90% of 200K — still has wiggle room)The 70% target became a hard rule: when the running total crossed 140K, the next step had to be a compaction pass.
Problem 2 — Naïve compaction lost everything important
The first compaction implementation just summarized the last hour of conversation into "the user asked about ABC and the agent did XYZ." Useless. What the agent actually needed to remember was: which sources were authoritative, what hypotheses had been tested and discarded, what numbers had been pulled, and what was still open.
They rewrote the compaction step to produce a structured brief:
{
"sources_seen": ["ACME 10-K", "ACME 10-Q Q1", "Bloomberg 2026-04-22"],
"facts": ["FY revenue 2.4B (10-K p.31)", "Q1 op margin 18.4% (10-Q)"],
"hypotheses_open": ["EU lawsuit could compress margin"],
"hypotheses_closed": ["Margin compression from raw inputs — disproved by Q1"],
"next_action": "Pull sector comparables for op margin"
}Compaction was now selective summarization, not a recap. Token reclamation: similar. Usefulness: dramatically higher.
Problem 3 — Retries weren't idempotent
The pricing-data tool occasionally got 429s mid-run. The retry policy retried — and the audit log started showing double-pulls. Two solutions stacked:
const requestId = crypto.randomUUID();
async function pullPrice(args: PullArgs) {
return retryWithBackoff(
() => pricingApi.post("/v1/quote", { ...args, request_id: requestId }),
{ codes: [429, 500, 502, 503, 504], maxRetries: 4, jitter: true },
);
}The pricing API uses request_id to detect the duplicate and short-circuits. The double-pull problem disappeared in a single deploy.
Problem 4 — A prompt change quietly tanked tail latency
A "small wording tweak" Monday morning bumped p99 latency by 1.6 seconds and unit cost by 21%. Nobody noticed for nearly a week. The observability pass that followed added five required metrics per call:
- model
- input/output tokens
- latency (p50, p95, p99)
- cache hit rate
- tool-call counts and error rates
A dashboard cross-references these against prompt revisions. The next time someone broke the cache prefix, the dashboard caught it in 40 minutes.
Problem 5 — Outbound output sometimes leaked details from internal sources
Internal research notes occasionally surfaced in the customer-facing brief — names of clients, draft commentary not meant for circulation. The team added a deterministic post-call filter:
function redactOutbound(text: string): string {
return text
.replace(/Client: [^\n]+/g, "Client: [redacted]")
.replace(/<draft>[\s\S]*?<\/draft>/g, "");
}Pure regex, ~3ms in the hot path. No LLM-based judging in production. Async review queues use heavier judging asynchronously.
Problem 6 — A model upgrade went sideways
When Sonnet 4.6 shipped, the team wanted to flip everything. They didn't. The rollout was: shadow for a week (run both, compare on a golden set), 1% canary for a week, 10% ramp, 100%. A regression appeared at 10% (specific financial-language phrasing was worse), they paused, fixed the system prompt that was over-constraining, ramped again. Total elapsed: three weeks. Zero customer-facing surprises.
What stuck
- Token budgets are not aspirational. Enforce them and act on the crossing point.
- Compaction is selective summarization — keep what the agent needs to *decide*.
- Idempotency at the tool layer makes retries safe.
- Observability is the only way to catch quiet regressions before they compound.
- Guardrails fast and deterministic; LLM judges async, not on the hot path.
- Shadow → 1% → 10% → 100% with gates at each step. Always have a kill switch.
