Week 1 — Pick the shape
The first instinct was "one agent that does everything." We talked an account manager through a renewal call and counted the branches: pull the account, check usage trends, surface health-score outliers, draft talking points, check the contract terms, decide whether to offer a discount. Some steps never branch on runtime data — pulling the account, fetching usage — those are a workflow. The talking-points draft and the discount-tier decision *do* branch on what the AM hears live. Those are agent territory.
We pinned the deterministic part:
async function prepRenewal(accountId: string) {
const account = await crm.getAccount(accountId);
const usage = await analytics.getUsageTrend(account.id, "90d");
const contract = await billing.getContract(account.id);
return { account, usage, contract };
}Then we gave the agent a tool registry — request_discount_approval, book_followup, escalate_to_engineering, draft_renewal_email — and let it route from a goal.
Week 2 — Decompose with orchestrator-worker
The first version blew through 60K tokens on every call because it tried to summarize 90 days of usage data inline. We refactored: the orchestrator dispatches to specialized workers — usage-summarizer, risk-classifier, talking-points-writer, contract-clause-extractor — each running on a tight, scoped prompt with only the data it needs. The orchestrator stitches their JSON outputs into a renewal brief.
| Approach | Time | Cost (Sonnet) | Output quality | | ------------------- | ---- | ------------- | -------------- | | Single agent | 22s | $0.31 | 71% on golden set | | Orchestrator + 4 workers | 7s | $0.18 | 79% on golden set |
Week 3 — Stop one bad call from costing $50K
Two cases convinced us we needed humans in the loop. First, the agent suggested a 25% discount to retain a customer who could have accepted 5%. Second, it drafted an outbound email and we caught it five seconds before it auto-sent. We rebuilt: anything that moves money or emails a customer has to call request_human_approval first.
tools.request_human_approval({
action: "send_email",
to: "vp.ops@bigprospect.com",
subject: "Following up on yesterday's call",
body: draft,
expires_in_sec: 600,
});The AM gets a one-click approve in their UI. The action only fires after they click.
Week 4 — Multi-day cases stop blowing context
Renewals span weeks. The agent kept losing track of what had been tried two calls ago. We persisted a tight state vector:
{
"account_id": "A-1042",
"stage": "negotiation",
"decisions": ["offered 5% multi-year discount on 2026-04-15"],
"blocked_on": "customer's procurement review",
"next_action": "follow up Mon if no reply by Thu"
}Each new turn loads this — usually 200–400 tokens — plus the latest user message. Multi-day continuity without bloat.
Week 5 — Build the evaluator
A junior PM tagged 200 historical renewal calls with the "right" actions (offer this discount, escalate to engineering, etc.). The eval harness runs the agent against the 200 tasks every time the prompt changes. PRs that drop success below 80% block merge. The first time the eval blocked a PR (a "small wording tweak" that dropped success to 73%), the team realized this was the most valuable infrastructure they'd built.
Week 6 — Add an evaluator-optimizer loop on the highest-stakes step
The discount-tier recommendation is the highest-leverage step in the whole flow. We wrapped it in a three-pass loop: generator proposes, evaluator scores against a six-criterion rubric (margin impact, retention-probability lift, precedent risk, etc.), optimizer revises if any criterion < 4/5. Capped at three iterations. Quality jumped, cost rose ~25% on that step alone, and we accepted the tradeoff because that one call moves real money.
What stuck
- Wrap the deterministic 80% in a workflow. Agent-ify only the steps that need LLM judgment at runtime.
- Orchestrator-worker for fan-out when sub-prompts don't share heavy context.
- Persist state vectors, not transcripts.
- Permission tools, not prose-level "are you sure?"
- Golden-set eval is non-optional once you're shipping to production.
- Evaluator-optimizer earns its cost on the *highest-stakes* steps, not everywhere.
