Lesson 7 · 7 min

Evaluating Agent Quality

Golden tasks, success rates, and step-level traces.

Evaluate agents on *task success rate*, not per-call accuracy. Keep a golden set of 20–50 representative tasks. Run them on every prompt change. Capture full traces (tool calls + LLM messages) so regressions are debuggable.

Production scenario

Real-world example: Customer-success copilot at a 200-seat SaaS

The copilot helps account managers handle renewal conversations. Quality regressions are expensive — a bad prompt change can tank renewal close rates. The team maintains a golden set of 200 real conversations with labeled "good outcome" actions (offered the right discount tier, escalated, ran the right playbook).

On every prompt PR:

golden-eval --prompt-rev abc123 --tasks 200
  → success rate: 81% (was 79%)
  → cost / task: $0.024 (was $0.022)
  → top regression: 4 tasks where the agent now skips the "request feedback" step

Prompt changes that drop success below 80% block the merge.

Why this matters: per-call accuracy can be high while *task* success quietly tanks. Track task success on a representative golden set — that's the metric users feel.

Knowledge points in this lesson

Measure task success on a golden set
Golden set is 20–50 representative tasks
Per-call accuracy can hide task failure
Run evals on every prompt change
Capture full traces for debugging
Sample human review for quality signal

Quick check

Agentic ArchitectureSelect one

Why is parallel fan-out a poor fit when subtasks share heavy context?