Evaluating Agent Quality
Golden tasks, success rates, and step-level traces.
Evaluate agents on *task success rate*, not per-call accuracy. Keep a golden set of 20–50 representative tasks. Run them on every prompt change. Capture full traces (tool calls + LLM messages) so regressions are debuggable.
Real-world example: Customer-success copilot at a 200-seat SaaS
The copilot helps account managers handle renewal conversations. Quality regressions are expensive — a bad prompt change can tank renewal close rates. The team maintains a golden set of 200 real conversations with labeled "good outcome" actions (offered the right discount tier, escalated, ran the right playbook).
On every prompt PR:
golden-eval --prompt-rev abc123 --tasks 200
→ success rate: 81% (was 79%)
→ cost / task: $0.024 (was $0.022)
→ top regression: 4 tasks where the agent now skips the "request feedback" stepPrompt changes that drop success below 80% block the merge.
Why this matters: per-call accuracy can be high while *task* success quietly tanks. Track task success on a representative golden set — that's the metric users feel.
- Measure task success on a golden set
- Golden set is 20–50 representative tasks
- Per-call accuracy can hide task failure
- Run evals on every prompt change
- Capture full traces for debugging
- Sample human review for quality signal
