Sprint 1 — Stop hammering one prompt
PrintWize was running every inbound custom-order email through a single 2,400-token prompt that tried to extract artwork metadata, sizing, shipping, and payment terms at once. Accuracy was 67% and the engineers were tired. We split it: a deterministic intake (parse email + attachments, normalize addresses, OCR the artwork) followed by four specialized extractors running in parallel from an orchestrator.
Sprint 2 — Tool design pass
Each extractor was a tool the orchestrator could call. We rewrote the schemas: extract_artwork (returns dimensions + color profile + transparency flags), extract_sizing (returns SKU + qty), extract_shipping (returns address + service level), extract_payment_terms (returns terms + PO if any). Flat schemas, enums on the bounded fields, error contracts with a hint field.
Sprint 3 — Eval + roll out
A junior PM labeled 150 historical orders. We built a tiny eval harness and ran it on every prompt change. Two prompt regressions caught in week 1 alone. Shadow → 10% → 100% over the third sprint.
What stuck
Going from "one big prompt" to orchestrator + workers was the single biggest unlock. Schema discipline closed most of the remaining error rate. Eval-gated PRs kept it from regressing.
