Claude Certification
Agentic Architecture & Orchestration
Lesson 7 · 7 min

Evaluating Agent Quality

Golden tasks, success rates, and step-level traces.

Evaluate agents on *task success rate*, not per-call accuracy. Keep a golden set of 20–50 representative tasks. Run them on every prompt change. Capture full traces (tool calls + LLM messages) so regressions are debuggable.

Production scenario

Real-world example: Customer-success copilot at a 200-seat SaaS

The copilot helps account managers handle renewal conversations. Quality regressions are expensive — a bad prompt change can tank renewal close rates. The team maintains a golden set of 200 real conversations with labeled "good outcome" actions (offered the right discount tier, escalated, ran the right playbook).

On every prompt PR:

golden-eval --prompt-rev abc123 --tasks 200
  → success rate: 81% (was 79%)
  → cost / task: $0.024 (was $0.022)
  → top regression: 4 tasks where the agent now skips the "request feedback" step

Prompt changes that drop success below 80% block the merge.

Why this matters: per-call accuracy can be high while *task* success quietly tanks. Track task success on a representative golden set — that's the metric users feel.

Knowledge points in this lesson
  • Measure task success on a golden set
  • Golden set is 20–50 representative tasks
  • Per-call accuracy can hide task failure
  • Run evals on every prompt change
  • Capture full traces for debugging
  • Sample human review for quality signal
Quick check
Agentic ArchitectureSelect one
Why is parallel fan-out a poor fit when subtasks share heavy context?