Lesson 7 · 7 min
Evaluating Agent Quality
Golden tasks, success rates, and step-level traces.
Evaluate agents on *task success rate*, not per-call accuracy. Keep a golden set of 20–50 representative tasks. Run them on every prompt change. Capture full traces (tool calls + LLM messages) so regressions are debuggable.
Quick check
Agentic ArchitectureSelect one
Why is parallel fan-out a poor fit when subtasks share heavy context?
