Deep Curriculum

The long-form edition.

Each chapter is roughly four to six minutes of spoken prose — a hook, the concept, a production walkthrough, the edge cases, and a one-sentence bridge to the next lesson. Built for repeated focused listening rather than a single brisk pass.

New chapters are being added domain by domain; the player works at any size and will grow as we ship.

Now playing · Chapter 1 of 38

Welcome

Long-form orientation

Approx total runtime: 2h 22m · Voice can be changed in the header

Chapter list

Transcripts

The same words you hear, on the page. Skim, search, or quote — the audio and the transcript are kept in sync because they are the same source string.

Long-form orientation

Welcome

Welcome to the Claude Certification Deep Curriculum — the long-form audio companion to claudecert.com. This is not the quick read. The quick read is a brisk pass through the syllabus, useful when you want the shape of the exam in under an hour. The Deep Curriculum is the opposite: every lesson opens with why the concept matters in production, restates it in plain language, walks through a concrete fictional build at one of the companies that recur across the site, and closes with the edge cases that trip people up the first time they ship.

This edition is designed for repeated, focused listening. Cars, walks, the gym. You do not need to be at a screen. You do not need to take notes. The same passages you hear here exist as on-page transcripts, so when something lands and you want to bookmark it, the words are already there for you.

A note on how this edition is structured. We move through the five exam domains in the order Anthropic publishes them — Agentic Architecture, Tool Design, Migration, Cost Optimization, and Production Research. Each domain begins with an overview chapter that names the through-line of the domain and tells you what the exam tends to ask. Then the individual lessons. Then production stories — the case studies tagged for that domain, each one narrated end to end. The narration of a case study is roughly the same content you would read on the case studies page, but rewritten for the ear so that there are no code blocks, no inline links, and no acronyms left undefined the first time they appear.

A second note, on tone. We are not going to read the syllabus to you in a monotone. We are going to argue for things. We will tell you when we think a pattern is fragile in production, and we will tell you when we think it is solid. We will tell you when the exam is testing your familiarity with a concept versus your judgment about it. Where the difference matters, we will say so out loud.

Finally, a request. If you are listening to prepare for the certification exam, please also do the diagnostic on the site before you start. The diagnostic is fifteen minutes. It will tell you which two domains to listen to first, and which one you can save for the back end of your study window. Listening to all five domains in order is fine, but listening to the two that you are weakest at, twice each, before you touch the others, is faster.

That is the orientation. Take a breath. Let's begin with Domain One.

Overview · 27% of exam

Domain 1 — Agentic Architecture

The first domain on the exam is Agentic Architecture and Orchestration. It is twenty-seven percent of your final score. By weight, this is the largest single domain, and it tends to set the tone of the rest of the exam, because the design choices you learn here propagate into every other domain. Decisions about tool design, prompt engineering, context management — they all change shape depending on whether you have built a workflow or an agent, a single Claude session or an orchestrator with workers, a single-shot generator or an evaluator-optimizer loop.

The through-line of this domain is judgment about autonomy. Where in a system should the model decide what happens next? Where should the code decide? When should you let one Claude session route subtasks to other Claude sessions, and when is that just expensive theater? When does a critique-and-revise loop actually improve the output, and when does it just burn tokens? When do you fan out work into parallel calls, and when does that violate a constraint the synchronous code was quietly enforcing?

There are seven lessons in this domain. The first three are the architecture decisions: when an agent is the right shape, how to decompose work with an orchestrator and workers, and when to wrap a generator in an evaluator-optimizer loop. The middle two are about scaling and persistence: parallel fan-out when the work shards cleanly, and how to carry state across an agent's turns without dragging the whole conversation forward. The last two are about safety and quality: where human approval gates belong, and how to evaluate agent quality once you have one in production.

A pattern you will hear over and over in this domain: start simple, promote to more complex shapes only when a measured constraint makes you. Defaults beat cleverness. Let's begin.

Domain 1 · Lesson 1

Agent vs. Workflow

Here is the failure mode. A team builds a Claude agent because the framework documentation makes agents look easy. The agent has eight tools. On the happy path, it works. On the unhappy path, it sometimes calls the wrong tool, sometimes calls the right tool with the wrong arguments, sometimes loops through the same two tools waiting for a state that will never arrive. The team bolts on a ninth tool to handle that failure mode. Six weeks later they have fourteen tools and a chatbot that mostly works for the cases they tested. That is what happens when you reach for autonomy you did not need.

The distinction the exam wants you to internalize is this. An agent is a system in which the model decides, at runtime, which tool to call next based on a goal. A workflow is a system in which the order of tool calls is pinned in code. Both are valid shapes. You pick one or the other based on whether the next step depends on language the model has to read in order to decide.

Take the canonical contrast on the site. A merchant's order-status webhook does the same four steps every single time. Verify the signature, normalize the payload, update inventory, fire the confirmation email. The path never branches on the contents of the order. There is no language to read, no judgment call. That is a workflow. You write it as four lines of code.

Now picture the same merchant's customer-service chatbot. The user might ask for a refund, dispute a charge, ask for product specs, request a return label, or escalate to a human. Which tool to call next depends on what the user is asking, and the only thing that can read what the user is asking is the model. That is an agent.

The default the exam expects you to reach for is a workflow. Not because workflows are better in general — they are not — but because they are smaller, debuggable, and cheap. You promote a system from workflow to agent only when you hit a step where the next decision genuinely requires the model's judgment to route. If the next step is the same every time, no matter what the user said, that is a workflow step. Pin it.

A common edge case. People look at a workflow with a single conditional in the middle and wonder if it should be an agent because there is a branch. The answer is almost always no. Conditionals in code are not autonomy. The question is whether the branch can be expressed in code that does not need the model to pick the path. If a regular expression, a database lookup, or a one-line classifier can route the branch, the branch is workflow.

Another pitfall. Promoting a workflow to an agent because the requirements changed and there is now a new branch you have to handle. You can almost always add the branch to the workflow without changing the shape of the system. Reaching for an agent every time the requirements grow is how you end up with a fourteen-tool chatbot that is hard to reason about and harder to test.

Hold this idea steady as we move into the next lesson, because the orchestrator-worker pattern is what you reach for when even an agent gets too big. It is the next promotion. Workflow, then agent, then orchestrator with workers — only ever moving up the ladder when a measured limit forces you.

Domain 1 · Lesson 2

Orchestrator-Worker Pattern

The hook for this lesson is a context window that is full of the wrong things. A single Claude session reading a hundred-page contract, taking notes about indemnity clauses, then about payment terms, then about intellectual property, then about termination — by the time it gets to the last clause it has spent half its context on artifacts of the earlier clauses that have nothing to do with the question now in front of it. The signal-to-noise ratio collapses. The output gets worse.

The orchestrator-worker pattern is the answer. You take one Claude session and call it the orchestrator. Its only job is to read the goal, decompose it, dispatch subtasks to specialized worker sessions, and stitch the results back together. The workers are short-lived. Each worker sees only the slice of context relevant to its slice of the work.

The legal-ops example on the site is the canonical version. A team gets two hundred commercial contracts a week. Each contract has roughly sixty clauses spanning indemnity, intellectual property, payment terms, termination, governing law. A single agent reading the whole contract sequentially burns context and time. The orchestrator parses the contract into clause chunks and dispatches each chunk to a specialized worker. One worker is the indemnity classifier. Another is the payment-terms extractor. Another is the termination-clause analyzer. They run in parallel. Each one returns a structured finding. The orchestrator merges them into a risk report.

The numbers from that walk-through. A single agent took eighty seconds per contract and cost roughly forty-two cents. The orchestrator with eight workers took fourteen seconds per contract and cost thirty-one cents. Faster and cheaper. Both wins came from the same source — the workers each had a tighter context window and a tighter job, so they spent fewer tokens on each clause and got the right answer in fewer tries.

The exam wants you to know when to reach for this pattern and, just as importantly, when not to. Reach for it when the subtasks are independent — meaning one worker's output does not need to be visible to the next worker — and parallelizable. Reach for it when each subtask is well-scoped enough that you can describe what one worker does in a sentence. Avoid it when subtasks share heavy context. The marshalling cost — the orchestrator has to send the relevant slice of the contract to each worker, then receive each result, then merge — is real. You only get a win if the savings from tighter worker contexts outweigh the marshalling overhead.

A second pitfall. Treating every multi-step task as a candidate for orchestrator-worker. Most multi-step tasks are workflows or single agents. Orchestrator-worker is for the case where the work shards cleanly. If you have to invent a way to shard the work — if the workers need to coordinate, share intermediate state, or wait on each other — you are making the system worse, not better. The pattern works because the workers do not talk to each other.

A subtle one. The orchestrator is itself a model call. It has its own context budget. If you give the orchestrator the whole input and ask it to dispatch, the orchestrator reads the whole input — which was the cost you were trying to avoid. The trick is to have the orchestrator route based on the structure of the input, not the contents. The legal example chunks the contract by clause boundaries before any model sees the whole thing. The orchestrator never reads a single full contract.

Hold orchestrator-worker in your head as one specific shape. The next lesson is about a different shape — one in which a second Claude call critiques the first one's output and a third revises it. That is the evaluator-optimizer loop, and the trade-off it asks you to make is different.

Domain 1 · Lesson 3

Evaluator–Optimizer Loop

The hook is short copy that is almost good. A marketing team wires up a Claude call to generate landing-page headlines from a product brief. The first thirty headlines come back. Twenty of them are mediocre. Eight are decent. Two are quietly excellent. The team's editor reads the eight decent ones, sighs, picks one, rewrites half the words. Three weeks later they realize they have built a slightly worse version of their editor.

The evaluator-optimizer loop is the pattern that helps here. You run the generator first to produce candidates. Then you run a separate Claude call as an evaluator. The evaluator does not generate. It scores. It reads each candidate against an explicit rubric — relevance, call-to-action strength, length, brand tone, whatever the brief requires — and returns a numeric score per dimension and an overall ranking. Then a third call, the optimizer, takes the top candidates and rewrites them tighter against the rubric's weak points.

The principle behind the pattern is older than language models. It is easier to judge than to produce. Editing is faster than writing. Scoring is faster than scoring well. If you can produce many candidates cheaply and judge them against a clear rubric, the loop trades extra inference for higher quality on the final artifact.

The marketing example on the site walks through one full cycle. Brief in. Generator produces twenty headlines. Evaluator scores each on the rubric and surfaces the top three. Optimizer rewrites the top three with the evaluator's critiques in hand. Final three out. The whole loop is capped at three iterations. After three passes the marginal lift on the rubric flattens and the cost stops being worth it.

That cap is one of the most important things in this lesson. An uncapped loop will spend money forever, especially if the rubric is sloppy. A loop that is capped but the cap is high — say ten iterations — will spend most of those iterations producing nearly-identical versions of the same output and tell you it is improving because the rubric is being slightly better satisfied each time. Three is a good default. If you find yourself wanting more, the right move is almost always to fix the rubric, not raise the cap.

The pattern works for outputs that are easy to judge but hard to produce in one shot. Copy editing. Code review. Design critique. Test case generation. Anything where there is an explicit list of properties the output should have and you can ask the evaluator to check each one. The pattern works badly for outputs that are subjective in ways the rubric cannot capture. Asking an evaluator to score whether an essay is moving is a recipe for a loop that converges on something formally correct and emotionally dead.

A common pitfall. Using the same model and the same temperature for the generator and the evaluator. The two roles want different settings. The generator is producing — diversity helps, so a higher temperature is fine. The evaluator is judging — consistency matters, so a lower temperature is right, sometimes zero. Splitting them is two lines of configuration and meaningful improvement on the loop's behavior.

Another pitfall. Letting the optimizer see only the rubric and not the original brief. The optimizer's job is to tighten the candidate against the rubric in a way that still respects the brief. Drop the brief and you get rubric-maximizing prose that no one wanted to write.

The next lesson is about a different way of getting more inference for less cost — fanning the work out in parallel when the shards are independent.

Domain 1 · Lesson 4

Parallel Fan-Out

The hook is earnings season. A research desk needs an executive summary of every Standard and Poor's five hundred ten-K filing within forty-eight hours of release. Sequential summarization at thirty seconds per filing is over four hours. The analysts arrive on Monday morning to find that the Friday batch is still running. They go back to writing summaries by hand.

Parallel fan-out is the shape that fixes this kind of problem. The task decomposes into independent shards — one filing in, one summary out, no shared context — and you run all the shards at the same time instead of one after the other. With concurrency capped at twenty in-flight requests to stay inside the rate limit, the same five hundred filings come back in under twelve minutes. The numbers in this domain look ridiculous because the work is so cleanly shardable. Four hours becomes twelve minutes. That is not an optimization. That is the right shape.

The exam wants you to recognize this kind of shape from a description. The keyword you are listening for is independence. If each unit of work has no dependency on any other unit of work, fan-out is on the table. Document summaries. Image classification. Per-row data extraction from a spreadsheet. Per-customer email drafts. Per-question answers in a quiz. Per-product description rewrites for a catalog. They all share the property that no shard needs to know about any other shard.

There are two ways to fan out and the exam treats them as different choices. The first is concurrent inference using your own concurrency primitive — a Promise dot all in TypeScript, an asyncio gather in Python, a thread pool in Java. You control the concurrency limit. The work runs immediately. You handle rate-limit responses with exponential backoff. This is the right choice when latency matters — when you want the result in the next minute and you are happy to pay the cost.

The second is the Messages Batch API. You submit a batch job containing many requests. The platform runs them on its own schedule, usually within twenty-four hours, at a discounted price. This is the right choice for offline workloads where the result is needed by tomorrow morning but not by the next minute. Nightly catalog enrichment. Weekend retraining data preparation. Any pipeline where the latency budget is hours, not seconds.

A pitfall the exam likes to test. Fanning out work that is not actually independent. You have a thousand customer support tickets and you want to draft replies. Tempting to fan out one shard per ticket. But if the replies need to reference a knowledge base that the previous reply just updated, the shards are not independent — the order matters and the state is shared. Fan-out will produce one reply that contradicts another. Either keep this kind of work sequential or split it into a fan-out phase that drafts and a sequential phase that resolves conflicts.

Another pitfall. Forgetting the rate limit. Five hundred concurrent requests to a tier-two account will return four hundred and twenty-nine errors after the first thirty or so. Always set a concurrency cap. Always wire backoff. The right defaults are usually somewhere between ten and fifty in-flight requests for online traffic, depending on the tier.

A subtle one. Fanning out using one model when a tiered split would be cheaper. If half the shards are easy and the other half are hard, you do not have to use Sonnet for everything. A pre-classifier on Haiku that routes shards to the right model size can cut your fan-out cost in half without measurable quality loss.

The next lesson moves the focus inward. We have been talking about how an agent's work decomposes. Now we look at how an agent remembers, across turns, what it has already done.

Domain 1 · Lesson 5

Agent State and Memory

The hook is a forty-thousand-token context window by Wednesday afternoon. A B2B SaaS support team has wired up an agent to work cases that span days and multiple replies. By the third day on a single case, the agent is loading the entire conversation history on every turn — every message, every tool call, every tool result. The token bill is climbing. The agent's responses are slower. Worse, it is starting to make small errors that come from the noise of carrying around things it no longer needs.

The mistake here is treating the conversation as the agent's memory. It is not. The conversation is the agent's transcript. The agent's memory is what it has learned, not what it has said. Persisting a transcript-shaped memory means you are carrying around the cost of regenerating the same understanding on every turn. Persisting a learned-shaped memory means you carry exactly what the agent needs in order to decide what to do next.

The shape of a learned-shaped memory is what the site calls a state vector. It is a small structured document. For the SaaS support case, a state vector might say: case identifier C-1042, customer Acme, summary that the customer's webhook has been receiving duplicate events since the eighth of May, attempted approaches the replay tool and the deduplication index check, currently blocked on the customer sharing their consumer log, next planned action is to wait for the log and then inspect a signature mismatch theory. That is the entire memory of the case. It is under two thousand tokens. On the next turn, you load the state vector plus the latest user message, and the agent has everything it needs to act.

The principle generalizes. State is what the agent learned. Conversation is what the agent said. Persist the former, regenerate the latter. The state vector is also what survives a model upgrade, a prompt change, or a session restart. A transcript-shaped memory ties you to the exact configuration that produced it. A state-vector memory survives almost anything because it is just data.

A common pitfall. Designing the state vector after the fact. Teams build the agent first, get it working, and then try to bolt persistence on by serializing the conversation. That works for a week and breaks the moment the conversation gets long enough to matter. Design the state vector first, in the same sprint as the tool list. Ask the question: what does the next turn need to know that the previous turns figured out? That is your state.

Another pitfall. Putting too much in the state vector. The temptation is to record everything the agent has noticed, in case it might be needed later. Resist it. The state vector is meant to be the agent's working memory, not its archive. If you find yourself wanting to record long-term knowledge, that is a different system — a knowledge base, indexed and queried as a tool. Do not conflate them.

A subtle one. State vectors that include free-form text fields tend to grow without bound. A summary field that started at fifty words is six hundred by week three. The fix is the same as the fix for a long conversation — compact the field on a schedule. Every nth turn, run a one-shot summarization that rewrites the field at its original target length. Compaction is selective summarization, not a recap.

Speaking of which. The next lesson is not compaction — that lives in domain five. The next lesson is human-in-the-loop, the discipline of deciding which actions an agent is allowed to take alone and which actions require a human's hand on the trigger.

Domain 1 · Lesson 6

Human-in-the-Loop

The hook is an agent that almost sent the wrong wire transfer. Or almost emailed the wrong customer. Or almost dropped the wrong production database. The story is the same every time, and you do not want to be in the room when it happens. The lesson is about preventing it.

Human-in-the-loop is the discipline of deciding which actions an agent is allowed to take on its own and which actions require a human to approve them. The exam wants you to know two things. The first is the rule for which actions need approval. The second is how the approval is implemented in code.

The rule is simple. Any action that is irreversible, any action that is externally visible, and any action that moves money or modifies infrastructure requires a human approval. Deletes. Deploys. Outbound messages to people outside the team. Wire transfers. Tool calls that touch production data. Anything that, if done wrong, you could not undo by retrying.

The implementation is a permission tool. The agent is not allowed to take the consequential action directly. Instead, it has to call a tool that surfaces the proposed action to a human for approval. The tool returns either an approval token or a rejection. If it returns approval, the agent can proceed. If it returns rejection, the agent has to either pick a different action or stop.

The trading-desk example on the site is the canonical version. A buy-side desk uses an agent to draft foreign-exchange orders from research notes. The agent does the research, sizes the trade, builds the order ticket, and then stops. It must call a request-trader-approval tool that surfaces the ticket in the trader's interface for a one-click confirm. The trader's click is the only way the order leaves the system. The agent cannot fire the order itself. It does not have a tool that lets it.

Note what is happening at the design level. Approval is enforced in the tool surface, not in the prompt. There is no instruction in the system prompt that says please ask before placing an order. There is no tool in the agent's registry that places an order without going through approval. The agent could not place an order without approval if it tried, because the only way to place an order is the approval tool, which routes through a human first. Approval lives in code, not in prose.

This matters because models can be persuaded. A long enough conversation, the right user pressure, an edge case the prompt did not anticipate — and an agent that was instructed to ask permission will sometimes proceed without asking. An agent that has no tool to proceed without asking cannot. The system prompt is a hint. The tool registry is a constraint.

A pitfall the exam likes to test. Confusing logging with approval. A team adds a logging step before every consequential action and calls that human-in-the-loop. It is not. Logging records what happened. Approval prevents what should not happen. They are different surfaces and they are not substitutes for each other.

Another pitfall. Approval prompts that are unreadable. The trader cannot one-click approve a ticket if the ticket is not laid out in a way they can scan in one second. The approval tool's payload should be designed for the human's reading speed, not the model's writing convenience. Put the most important fields at the top. Use natural units. Surface the dollar amount, not the cents. Surface the recipient name, not the recipient identifier.

A subtle one. Approvals that have no expiry. The trader steps away from the desk for ten minutes. The approval prompt sits there. When they come back, market conditions have changed and the ticket is stale. Always give approval prompts an expiry — measured in seconds for fast trades, in minutes for slow ones, in hours only for things that are not time-sensitive.

The last lesson in this domain is about evaluation. Once you have an agent in production, how do you know it is still working?

Domain 1 · Lesson 7

Evaluating Agent Quality

The hook is a quiet regression you do not notice for two weeks. A team ships a prompt change to their renewal-call copilot. Per-call accuracy on the eval set is up half a percent. The team merges. Three weeks later renewal close rates are down six percent and nobody knows why. Eventually they trace it back to the prompt change. The model is now politer. It is not asking for the close.

The lesson is that per-call accuracy and task success are not the same thing. Per-call accuracy is whether each model call returned a sensible response. Task success is whether the agent's whole sequence of work produced the outcome the user wanted. You can have one without the other. An agent that is correct on every step and never asks for the close is correct on every step and failing at its job.

The fix is to measure task success, not per-call accuracy. The way you measure task success is a golden set. A golden set is a representative collection of real tasks — twenty to fifty if you are starting out, two hundred or more once you are mature — with labeled successful outcomes. For each task, you know what the right outcome looks like. The eval harness runs the agent on every task and checks whether it produced the right outcome.

The customer-success copilot example on the site walks through what this looks like in practice. The team maintains a golden set of two hundred real renewal conversations. Each one has labeled good-outcome actions. Maybe the right outcome for a particular conversation was to offer the second discount tier. Maybe it was to escalate to a senior account executive. Maybe it was to run the playbook for at-risk accounts. On every prompt pull request, the harness runs the agent against the golden set and reports task success rate, cost per task, and any regressions on specific tasks.

The interesting part of this is the regression report. Not just the headline number — eighty-one percent success, up from seventy-nine — but the per-task drift. Four tasks where the agent now skips the request-feedback step. The team can read the regression and decide whether the trade is acceptable. If the four regressions are on low-value accounts and the headline number is up, ship. If the regressions are on the highest-value renewals, do not ship.

A pitfall. Building a golden set out of synthetic tasks. The whole point of a golden set is that it represents real production use. Synthetic tasks are easier to write but they do not catch the failures real users would. Build the golden set from real tasks, with the user's identifying information stripped. If you cannot get to real tasks, the next best thing is real tasks reconstructed from production logs. Synthetic-only golden sets give you false confidence.

Another pitfall. Letting the golden set go stale. The product changes. New use cases appear. The golden set still tests the use cases from six months ago. The eval passes and the regressions go undetected because nothing in the eval matches what users are actually doing. Refresh the golden set quarterly. Add the failures users complained about. Drop the cases that no longer represent the product.

A subtle one. Treating eval cost as overhead instead of insurance. Running a golden set of two hundred tasks on every prompt change does cost money. But the cost of one bad prompt change in production — measured in lost renewals, support escalations, or customer trust — is almost always larger than a year of eval costs. Eval is the cheapest form of quality control you have.

That closes the seven lessons in Domain One. The next two chapters are production stories — case studies tagged for this domain. We will narrate two of them end to end. They braid most of the patterns from these seven lessons into single builds.

Domain 1 · Case study

Production Story — ScaleOps customer-success copilot

ScaleOps is a B2B SaaS company in workforce analytics. About two hundred and fifty employees. They wanted a customer-success copilot that helped account managers prepare for renewal calls. The first instinct was the usual instinct — one agent that does everything. The team talked an account manager through a renewal call and counted the branches: pull the account, check the usage trends, surface health-score outliers, draft talking points, check the contract terms, decide whether to offer a discount.

When they laid the steps out, two of them never branched on runtime data. Pulling the account from the customer database, fetching the ninety-day usage trend from analytics — those were the same call no matter which account they were renewing. Those were workflow steps. The talking-points draft and the discount-tier decision did branch on what the account manager heard live in the call. Those were agent territory.

Week one was about pinning the deterministic part as a workflow. Three calls in sequence — get the account, get the usage, get the contract. No model involvement. Just code.

Week two introduced the orchestrator-worker pattern. The agent's tool registry had the obvious tools — request-discount-approval, book-followup, escalate-to-engineering, draft-renewal-email — and the orchestrator routed each call's preparation work into specialized workers. One worker drafted the talking points. Another sized the discount risk. The workers ran in parallel against the prepped account context.

Weeks three and four were tool design. Schemas got rewritten with intent-named fields and explicit error contracts. The discount-approval tool became a real human-in-the-loop gate — the agent could not offer a discount, only request approval for one, with the discount tier and the reasoning surfaced for the manager.

Week five was the eval set. They labeled two hundred historical renewal conversations with the good outcome — what the right action would have been, given what the manager actually did. The harness ran on every prompt change.

Week six was the rollout. Shadow mode for two days, ten percent for three, full rollout by Friday.

The numbers. Renewal-call preparation time dropped from forty-five minutes to nine minutes per account. Account managers shipped two-point-one times more renewal touches per week. Quality, measured against the golden set, climbed from seventy-one percent to eighty-six percent.

The takeaway is the choreography. Workflow for the deterministic preparation, orchestrator-worker for the agent's reasoning, permission tools for the consequential actions, golden-set evaluation to catch regressions before they ship. Each pattern is its own lesson. Production is where they have to work together.

Domain 1 · Case study

Production Story — LedgerLine tax bot

LedgerLine is a small-business tax preparation startup. The brief was uncomfortable on its face — draft a U.S. small-business tax return from raw books and receipts, surface the draft for a senior CPA to review, and never, under any circumstance, e-file without a human click. Fourteen weeks of work to ship it. Eight thousand four hundred returns drafted in the first quarter. Zero filed without a CPA review. CPAs reviewed five times the throughput at the same quality bar.

Let us walk the design.

The agent's tool registry includes everything you would expect. There are tools to read source documents, tools to classify expenses against the chart of accounts, tools to compute deductions, tools to draft schedules, tools to assemble the final return, and tools to e-file. The interesting design choice is what is not in that registry. There is no tool that e-files without a human approval first. There is no tool that signs off on a deduction without a human approval first. There is no tool that attaches a schedule to a return without a human approval first.

Every action with a downstream consequence — sign off on a deduction, attach a schedule, file the return — is gated behind a request-CPA-approval tool. That tool's payload is shaped for the CPA's reading speed. It includes the artifact name, a diff summary of what changed since the last approval point, the total dollar change, and a rationale paragraph drafted by the agent. The CPA reviews and either approves with a click or rejects with a comment.

The agent's state vector across a return is small. Engagement identifier, taxpayer name, current draft state, list of items the CPA has reviewed and signed off on, list of items still pending review, list of items the CPA rejected with the reasons. The vector does not carry the conversation. It carries what the agent learned about the return so far. On the next turn — which might be tomorrow — the agent loads the vector and continues from there.

The eval set is two hundred returns from the prior tax year, fully labeled with the right outcome. Every prompt change runs against it. Acceptance criteria are explicit — a prompt change cannot ship if it lowers task success on the eval set or if it introduces any regression on a return where the right outcome involved a CPA-rejection scenario.

The lesson the team kept coming back to was the negative space in the tool registry. The reason the agent never filed without a human is that the agent literally had no tool to file without a human. Approval lived in code. The prose just made it readable.

That closes Domain One. We will pause here before moving into Domain Two, Tool Design and MCP Integration.

Overview · 18% of exam

Domain 2 — Tool Design & MCP

The second domain on the exam is Tool Design and MCP Integration. It is eighteen percent of your final score. This domain is smaller than Domain One by weight, but the lessons inside it land everywhere you build with Claude. Almost every production agent you ever ship will have tools attached to it. The quality of those tools — meaning the quality of their schemas, the shape of their errors, and the platform you expose them through — is one of the largest knobs you have on whether your agent works reliably.

The through-line of this domain is reliability through interface design. The model is calling tools. The tools are returning results. The whole loop only works if the model can read the tool definitions, fill in their arguments correctly, interpret the results, and recover from failures. That last part — recovery from failures — is where most teams underinvest, and it is where this domain quietly tests your judgment.

There are five lessons. The first is schema fundamentals — how to design tool input schemas that Claude reliably populates. The second is error contracts — how to return errors that teach the model how to recover. The third introduces the Model Context Protocol, called MCP, which is a standard for exposing tools, prompts, and resources to language model clients. The fourth is the distinction between MCP resources and MCP tools — when read-only addressable data is better served as a resource than as a tool call. The fifth is testing tools in isolation — how to ship tools with confidence without dragging the model into every test.

Two production stories close the domain. Both are about teams that replaced a tangle of bespoke connectors with reusable MCP servers and saw their tool-call error rates collapse along the way. Let's begin.

Domain 2 · Lesson 1

Tool Schema Fundamentals

The failure mode for this lesson is malformed arguments. An agent calls a tool. The tool refuses the call because the arguments do not validate. The agent reads the rejection, tries again with a slightly different shape, and is rejected again. Round and round it goes. Every iteration costs tokens and time. Most of the time, the bug is not in the tool. The bug is in the schema description, and the model is doing its best to interpret an interface that was not designed for it.

A tool schema is the surface the model reads to figure out how to call your tool. It includes the tool name, the description, and the input fields. The exam wants you to know what good looks like across all three.

Take the contrast on the site between a sloppy refund tool and a production one. The sloppy version was named refund and took a single parameter called args which was typed as a string. The model has nothing to work with. The structure lives in prose somewhere — maybe the description, maybe nowhere — and every call is a guess about what to stuff into that string.

The production version is called issue refund. It has a description that says the tool refunds a charge and that the customer is notified by email automatically. The input fields are flat — charge identifier, amount in cents, reason, and a boolean for whether to notify the customer. The reason field is an enum with three allowed values — duplicate, fraudulent, and requested by customer. The notify field has a sensible default.

That second schema produces near-perfect valid tool calls. Three things made the difference. First, the field names reveal intent. The model can read charge identifier and amount in cents and know what to put there without guessing. Second, the structure is flat. There are no nested objects, no parameter wrappers, no fields that contain other fields. Flat schemas are what Claude reliably populates. Third, the bounded values are constrained with enums. When the model knows there are only three valid reasons, it picks one of the three.

The principles generalize. Name fields for intent, not for implementation. A field called user identifier is clearer than a field called u-i-d. A field called amount in cents is clearer than a field called amt. The model has to guess less and the guesses get better.

Use enums anywhere the valid values are bounded. Status fields, type fields, level fields, region fields. Free-form strings are an invitation for the model to invent values that look reasonable to it but mean nothing to your code.

Provide descriptions for every field. The description does not need to be long. It needs to tell the model what shape the value takes, what it is used for, and any constraints that are not captured by the type alone. A description that says the value should be in upper case is better than relying on the model to guess.

Pick sensible defaults. The fewer fields the model is forced to populate, the more reliable the call. If a field is almost always one value, set that value as the default and let the model omit it. Require only what the model has to know.

A common pitfall is treating tool schemas as documentation for engineers. They are not. They are documentation for the model. The reader is different. A schema that an engineer would call complete might be missing the description that the model needed to call it correctly. A schema that an engineer would call over-specified might be exactly what the model needs to never get it wrong.

The next lesson is what happens when a tool call fails. That is where error contracts come in, and they are the second-biggest lever on reliability after schema design.

Domain 2 · Lesson 2

Tool Error Contracts

The hook for this lesson is an agent stuck in a retry loop. A SQL tool returned an error that said operation failed. The agent does not know what to change. It tries the same query again. It fails the same way. The agent improvises a different query — also wrong, but in a new way — and the tool returns the same operation failed. The loop continues until the agent gives up or the call budget runs out. The whole time, the underlying problem was that the table the agent referenced lived in a different schema. The agent could not have known. Nothing in the error told it.

The lesson is that errors are conversational signals to the agent. They are not log lines for engineers. They are not stack traces. They are the only feedback channel the agent has when something goes wrong. If you do not design them deliberately, your agent will spin.

The shape the exam expects you to know is three fields. An error code, a message, and a hint. The error code is a stable, machine-readable string that your own code can branch on — table not found, rate limited, permission denied. The message is a human-readable explanation that ends up in your logs and your dashboards. The hint is the recovery instruction for the model.

The site's canonical example uses a SQL tool against a multi-tenant warehouse. The error code is table not found. The message says the table named order events does not exist in the schema named public. The hint says that tables in this workspace live in the schema named analytics, and suggests retrying as analytics dot order events. The agent reads the hint, retries with the right schema, and gets the right answer on the next call. The whole interaction takes two calls instead of forty.

The hint field is the part that teams underinvest in. The error code is easy. The message is easy. The hint requires you to think about what a model that just hit this error needs to know in order to recover. If the answer is nothing — the error is genuinely unrecoverable — say so in the hint. Tell the agent to stop and surface the failure to the user. If the answer is something — a different parameter, a missing prerequisite, a known retry pattern — write the recovery instruction in plain English.

Stable error codes matter because they are what your own code branches on. A handler that needs to react differently to a rate-limit error than to a not-found error reads the code, not the message. Messages are for humans and they change over time. Codes are for code and they should not.

A common pitfall is overloading a single generic error. The temptation is to catch every failure and wrap it as operation failed with the underlying message tucked inside. Resist it. The categories of failures are usually small — auth, not found, rate limit, validation, transient, fatal. Define a code for each. Let the agent and your code both see what kind of failure happened.

Another pitfall is hints that are too generic. A hint that says try again is worse than no hint at all because it teaches the agent to retry indiscriminately. A hint that says wait one second and retry is better because it tells the agent what to do and what not to do. A hint that says this operation requires the customer to authorize the connection first; surface the connection link from get connection link is better still because it tells the agent the exact next action.

A subtle one. Including stack traces in the message field. Stack traces are useful for engineers reading logs and confusing for models trying to decide what to do. Put them in your logging layer if you want them, but keep them out of the field the model reads.

The next lesson is the protocol that wraps all of this together. The Model Context Protocol, called MCP, is what lets you ship the same tool surface across multiple clients without rewriting it for each one.

Domain 2 · Lesson 3

MCP Overview

The failure mode for this lesson is the same connector implemented three times. A team writes a Notion integration once for their command-line tool, once for their hosted product, and once for the SDK they ship to customers. Three implementations. Three sets of tests. Three places to fix the same bug. Three places to add the next feature. The team gets tired and the integrations drift apart.

The Model Context Protocol — MCP — is the standard that fixes this. It is a J-S-O-N R-P-C protocol that defines how a server exposes capabilities to a language model client. The capabilities come in three shapes. Tools are operations with side effects. Resources are read-only, addressable data the model can request by a U-R-I. Prompts are named templates the user or the model can invoke to start a canonical workflow.

The exam wants you to know what MCP is, why it exists, and when to reach for it. The why-it-exists answer is the reuse story. Once you build an MCP server, every client that speaks the protocol can use it. Claude Code can use it. The desktop app can use it. Your custom agent built on the Agent SDK can use it. The customer-built agent your customer is building on top of your product can use it. One implementation, many clients.

The when-to-reach-for-it answer is when you want the same toolset reachable across multiple surfaces, or when you want to expose your tools to anyone outside your immediate codebase. If you are building a one-off agent that lives in one place, MCP is overhead. If you are building an integration that needs to be used from a command-line tool today and from a desktop app next quarter and from a customer-facing agent next year, MCP is the right shape from the beginning.

The site's canonical example is an internal Notion server. An engineering organization runs a tiny MCP server that exposes their Notion workspace. The tools are search pages and create page. The resources are a U-R-I scheme that lets the model address any page by its Notion identifier. The prompts include a named template the team uses every Monday to summarize meetings. The same server is wired into Claude Code, where engineers use it from the terminal, and into the desktop app, where product managers use it from a normal chat window. One implementation. Two surfaces. Zero code duplication.

There are two transports the exam expects you to know about. Standard input output, called stdio, is the transport for local processes. The client launches the server as a subprocess and they talk over the standard streams. This is what you use for an MCP server that runs on the same machine as the client. The second is HTTPS, sometimes called HTTP plus server-sent events depending on the implementation. This is the transport for hosted MCP servers that the client connects to over the network. The same server implementation usually supports both, controlled by a flag at startup.

A common pitfall is reaching for MCP before you need it. Building an MCP server for an integration that only ever lives inside one agent is more ceremony than the integration needs. The protocol becomes overhead and the abstractions get in the way. The right test is whether the toolset will be reused. If yes, MCP. If no, plain tools wired into the agent directly are fine.

Another pitfall is treating MCP as an alternative to schema discipline. It is not. An MCP server with sloppy schemas is just as unreliable as a regular tool with sloppy schemas. The protocol does not save you from designing the tool surface well. It just makes the surface portable.

The next lesson is the distinction the protocol cares most about — when something belongs on the resource axis versus the tool axis. That choice has real implications for cost and latency.

Domain 2 · Lesson 4

MCP Resources vs. Tools

The failure mode for this lesson is a tool call for every file. A developer-productivity agent has read access to roughly three thousand source files. The team exposed each file as a tool call. Every read request triggers a model call that takes the file path as an argument, calls the read tool, and waits for the result. The schema for the read tool gets re-paid on every call. The model thinks for a beat before each call. The conversation gets slow and expensive, and the team does not know why because each individual operation looks reasonable.

The fix is to know which capabilities belong as tools and which belong as resources. Tools are operations with side effects. Resources are read-only, addressable data the model can request by a U-R-I. The two axes serve different purposes and they cost different amounts.

A tool call is a verb. The model says I want to do this thing. The client passes the call to the server, the server does the thing, and a result comes back. Side effects happen. State changes. The model pays the schema cost on every call because it has to populate the tool's input fields.

A resource is a noun. The model says I want this data. The client says I already have a copy of that data in my cache, here it is, no server round trip needed. Resources are addressable by a U-R-I scheme that the server defines. They are read-only. They are cacheable. The cost on a cache hit is roughly zero.

The codebase example on the site shows the contrast. Once the team moved their three thousand files from a read tool to a resource scheme — a U-R-I that begins with codebase and then names the file path — the model's read requests collapsed in cost. The client caches resources locally. When the agent says it wants to read a particular file, the client serves it from its own cache without going back to the server. The schema overhead for fetch resource is paid once, not per file.

The rule of thumb the exam wants you to apply is this. If the data is read-only and static or slow-changing, expose it as a resource. If the operation has side effects or returns data that changes per call, expose it as a tool. Database tables are usually resources. Database mutations are tools. A file system listing is a resource. A file deletion is a tool. An issue body is a resource. Creating an issue is a tool.

A subtler version of the rule. Resources are appropriate when the same data is likely to be requested more than once in the same session. Resources buy you caching. If your data changes on every request, the cache cannot help and you might as well use a tool.

A common pitfall is treating resources as a synchronous fetch. They can be. They do not have to be. Resource fetches go through the same client cache, and a properly designed server can mark a resource as having a short time-to-live so the client knows when to revalidate. The OrbitalDevs case study on the site settles on a sixty-second time-to-live on most read endpoints, which means the cache is fresh enough for interactive use and the server is not getting hammered.

Another pitfall is exposing tens of thousands of resources without thinking about discoverability. The model cannot know about a resource U-R-I unless something told it. Either the server publishes a listing capability — list all resources matching a pattern — or the prompt makes the U-R-I scheme explicit. Resources are discoverable through the protocol, but only if you wire the discovery.

A subtle one. Resources that depend on user authentication. The cache key has to include the user identity. Two different users hitting the same U-R-I should get two different cache entries. Forget this and one developer gets to read another developer's private documents.

The last lesson in this domain is about testing the tools you have just designed.

Domain 2 · Lesson 5

Testing Tools in Isolation

The hook for this lesson is a tool that worked in production for three weeks and then started failing silently. The team's continuous integration passed. The unit tests passed. Nothing in their pipeline caught it. What broke was not the tool's implementation. It was the tool's schema. The description had become subtly misleading after a feature change and the model started populating one of the fields with the wrong type of value. The tool's implementation rejected the calls with a validation error. The agent retried. The agent gave up. The user got a polite I cannot help with that right now.

The lesson is that there are two failure modes for a tool, and they require two different kinds of tests. The first failure mode is that the tool's implementation does not do what it claims. The second is that the tool's schema does not communicate what it claims. Implementation bugs are caught by function tests. Schema bugs are caught by schema probes. You want both.

A function test is the kind of test you already know how to write. You call the tool's implementation directly, with realistic inputs, and you assert on the output. No model is involved. The example on the site uses a send email tool. The function test mocks the underlying email service, calls send email with a recipient, a subject, and a body, and asserts that the underlying service was called exactly once with the expected message. That test passes or fails based on whether the implementation does the right thing.

A schema probe is the test you might not have written before. It is a tiny model call whose only job is to populate the tool's arguments from a short natural-language prompt. The example on the site says the probe asks Claude to email a particular person saying we will be late. The probe captures the tool argument the model produces and validates it against the input schema. If the schema is communicating well, the model produces a valid argument and the test passes. If the schema has drifted, the model produces an invalid argument and the test fails. The probe does not need to be expensive. A handful of probe calls per tool, run on every pull request, is enough to catch most schema regressions.

The exam wants you to understand why these are two separate test suites. They fail for different reasons. They point at different fixes. A function test failure means the implementation is wrong — go fix the implementation. A schema probe failure means the schema description is no longer guiding the model correctly — go fix the description. If you only have function tests, you ship schema drift without noticing it. If you only have schema probes, you ship implementation bugs that the schema cheerfully accepts.

A common pitfall is conflating the two by running probe tests as if they were function tests. A probe is fundamentally a model call, which means it is not deterministic in the same way a function test is. If your probe asserts on exact text matches, it will be flaky. The right assertion is structural — does the argument validate against the input schema, is the recipient field non-empty, does the message body include the key phrases the prompt implied. Loose, structural assertions are how probes stay green.

Another pitfall is skipping probe tests for tools that have not been updated recently. The model upgrades. The model's interpretation of your schema shifts. A tool that probed clean on one model version may probe dirty on the next. Run probes on every model change as well as every schema change. They are cheap and they catch problems early.

A subtle one. Probes that include the answer in the prompt. If you tell the model to email Bob at example dot com, the model just copies your text into the recipient field. That tests nothing. The right prompt is naturalistic — tell Bob we will be late — and the model has to infer Bob's address from somewhere the schema points it at.

That closes the five lessons in Domain Two. The next two chapters are production stories — case studies tagged for this domain. We will narrate two of them end to end. They braid most of the patterns from these five lessons into single builds.

Domain 2 · Case study

Production Story — DataLayer integrations platform

DataLayer is a developer-tooling startup. They ship data-source connectors. Postgres, Snowflake, Notion, Salesforce, and about a dozen others. Originally each connector was a hand-written Python wrapper with its own tool schemas, its own error format, and its own auth flow. Three problems compounded over the first year.

The first was a tool-call error rate of eighteen percent. Agents kept passing malformed arguments to the connectors because the schemas were inconsistent and underspecified. The second was that the same connector code lived in three places — their command-line tool, their hosted product, and the SDK they shipped to customers. Three implementations to maintain, three places to fix every bug. The third was that no two connectors shared an error shape. Some raised exceptions. Some returned strings. Some returned objects with different field names. Agents could not generalize across them.

The plan was ten weeks of work, split into five steps.

Step one was schemas. Before anything else, the team rewrote every connector's tool schemas with flat fields, intent-revealing names, and enumerated bounds. The representative before schema was a single tool called query whose only parameter was an input typed as a string. The after schema was named run SQL, took a SQL field with an explicit description that said read-only and only certain statement kinds were allowed, took a schema field with a default, and took a row limit field with a default and a maximum. Within two weeks the malformed-argument rate dropped from eighteen percent to five percent.

Step two was errors. The team standardized on the three-field error contract for every tool — code, message, hint. The biggest gain was not speed. It was that retries actually succeeded. Agents stopped looping on opaque operation-failed strings and started self-correcting on the next call. The malformed-argument rate fell from five percent to two percent.

Step three was the migration to MCP. Each connector became an MCP server. The Postgres server exposed tools for run SQL, describe table, and list schemas. It exposed resources for table metadata at a U-R-I scheme that addressed any table by schema and name. It exposed prompts for explain this query and find slow queries. The same server ran over standard input output for local development and over HTTPS for the hosted product. One implementation. Three surfaces.

Step four was using resources for big data. The warehouse metadata catalog had tens of thousands of tables. Exposing each as a tool call would burn schema tokens on every interaction. The team exposed them as resources. The client caches resources locally. When an agent says read the analytics order events schema, the client serves it from its cache without round-tripping the server. Token cost on metadata operations dropped roughly seventy percent.

Step five was tests. Every connector now ships with two test suites — a function suite for the tool implementations and a schema probe suite for the schema descriptions. When something breaks, the team knows whether the implementation or the schema needs the fix.

The numbers. New-connector development time dropped from four weeks to six days. Tool-call error rate dropped from eighteen percent to two percent. The same MCP servers now power the command-line tool, the desktop app, and three customer-built agents.

Domain 2 · Case study

Production Story — OrbitalDevs MCP migration

OrbitalDevs is a continuous integration and deployment platform. They maintained twelve connectors — GitHub, GitLab, Sentry, Datadog, PagerDuty, Slack, Linear, Jira, Vercel, AWS, Google Cloud, and Azure. Each was a Python wrapper with its own auth, its own schema style, and its own error format. The same code lived three times — once in their command-line tool, once in their hosted product, and once in the customer-facing SDK. Tool-call malformed-argument rate sat at fourteen percent.

The plan was to move every connector to an MCP server. Tools for the verbs — create issue, open pull request, page on-call. Resources for the read-only data — a U-R-I scheme that addressed any GitHub issue by owner, repository, and number. Prompts for the canonical workflows — slash triage this issue. Standard input output transport for local development. HTTPS for the hosted product. One server. Three clients.

The hard part was auth. Each connector had a different secret-handoff story. Some used personal access tokens. Some used OAuth. Some used service account keys. The team settled on a shared secrets resolver that the MCP server calls at startup, abstracting the secret source behind a single interface. Each connector implementation declares which secrets it needs and the resolver handles the rest.

Caching strategy for resources took another pass. The team experimented with no caching, with infinite caching, and finally settled on a sixty-second time-to-live on most read endpoints. Long enough that an interactive session benefits from cache hits; short enough that stale data does not become a debugging problem.

The numbers. New-connector development time dropped from five weeks to four days. The malformed-argument rate dropped from fourteen percent to two percent. The same MCP servers now serve the command-line tool, the JetBrains plugin, and a customer-facing agent that the team had been promising to ship for two years.

The lesson the team kept coming back to was that the protocol does not save you from designing the tool surface well — it just makes the surface portable. Once the schemas and the error contracts were right, MCP is what turned a single implementation into a library that lived across every client they cared about.

That closes Domain Two. The next domain is Claude Code — the command-line tool, hooks, skills, and the Agent SDK that lets you build your own.

Overview · 20% of exam

Domain 3 — Claude Code

The third domain on the exam is Claude Code Configuration and Workflows. It is twenty percent of your final score. This domain is different in flavor from the first two. Domains one and two were about systems you build. Domain three is about a system you use — Claude Code, the command-line tool that Anthropic ships, and the Agent SDK that exposes the same primitives so you can embed Claude Code's behavior inside your own products.

The through-line of this domain is reproducible, version-controlled agent configuration. Most teams that adopt Claude Code start by treating it as a personal assistant — one engineer's setup, living in one engineer's home directory. The teams that get the most out of it move quickly to a different posture. They commit the configuration to the repository. They wire safety policy into the configuration. They encode their repeated workflows as slash commands. They encode their tribal knowledge as skills. The result is that every engineer who runs Claude Code in that repository inherits the same guardrails and the same shortcuts, without any per-user setup.

There are six lessons in this domain. The first is what Claude Code actually is and the kind of work it is designed for. The second is the settings file, which is where the harness reads its permissions, its hooks, its environment, and its model preference. The third is hooks themselves — shell commands that fire on Claude Code events, the right place for org-wide observability and policy. The fourth is slash commands, which are reusable named workflows the team can invoke as a single token. The fifth is skills, which are modular capability packs with their own prompts and resources, designed to auto-activate when the conversation matches. The sixth is the Agent SDK, which is the surface you reach for when you want Claude Code's behavior inside your own product instead of as a separate command-line tool.

Two production stories close the domain. Both are about teams that turned Claude Code from one engineer's personal habit into shared infrastructure. Let's begin.

Domain 3 · Lesson 1

Claude Code: What It Is

The hook for this lesson is a four-hundred-thousand-line monolith and a deadline three weeks out. A team has to upgrade their Rails 5 application to Rails 7 before a security mandate cuts off support. Five engineers. Eight weeks of estimated work. Twenty-one days of calendar. The numbers do not add up. This is the kind of problem Claude Code was built to compress.

Claude Code is Anthropic's official agentic command-line tool for software work. The word "agentic" is doing real work in that sentence. Claude Code is not a code-completion tool. It is not a chatbot that you ask questions and copy answers from. It is an agent that runs in your shell, reads your files, edits them, runs your tests, runs your shell commands, and reports back. You give it a goal. It executes a loop of read, edit, run, observe, and repeat until the goal is met or until it needs your input.

The exam wants you to know what Claude Code is for and what it is not for. Where it shines is bounded, well-tested refactors. Migrations across a known framework upgrade. Codemod-style changes that touch many files. Plumbing a new field through several layers. Anything where the work is mechanical, the failure modes are caught by the test suite, and the cognitive load on the human comes from the volume of edits rather than the difficulty of any single edit.

The Rails upgrade example on the site is the canonical version. A senior engineer opens a fresh worktree. They tell Claude Code to read the application's main configuration file and the Gemfile, find every deprecated API usage that needs updating, and produce a migration checklist grouped by file. Eight minutes later Claude Code returns a seventy-three-item checklist with file paths and risk ratings. The team triages the checklist in person. Some items are marked Claude can do this. Some are marked needs a human. Some are marked needs a human and Claude as a pair. The work is then executed file by file, with Claude doing the boring eighty percent and the engineers reviewing diffs.

Where Claude Code is less effective is unbounded design work. If you do not know what you want, Claude Code will produce something that compiles and passes tests but is not what you wanted. If the failure modes are not caught by the test suite, the agent will confidently ship a regression. If the change touches code that does not have a test suite, the agent has no signal that it broke something. The mental model the exam expects you to bring is that Claude Code is a tireless executor. Give it well-scoped work and it will outperform a human on volume. Give it design judgment and it will give you back something plausible that you may not want.

A common pitfall is treating Claude Code as a black box. Engineers who do best with it watch what it does. The harness shows you every tool call. It shows you every diff before it ships. It pauses when it hits anything that looks risky. Reading along is half the value, especially in the first few sessions. You learn what kinds of prompts get good results and what kinds get expensive detours.

Another pitfall is letting Claude Code operate without test coverage. If you do not have tests, the agent's edits are unverifiable. Either land tests first, or constrain the agent to changes that are obviously correct from inspection.

The next lesson is where the harness reads its configuration — the settings file, which is the team's seam for safety policy and shared setup.

Domain 3 · Lesson 2

settings.json Configuration

The hook is a new engineer's first day. They installed Claude Code overnight. They cloned the repository at nine. By ten o'clock they have force-pushed over a colleague's branch, dropped a development database, and ended a stranger's session in a shared dev environment. None of this was malice. Claude Code did what they asked. There was no policy to stop it.

The settings file is where you stop this from happening. It is the harness's configuration surface. It is a JSON file that lives in two places. The user-level file, in your home directory under dot claude, holds your personal preferences and credentials. The project-level file, in the repository under dot claude, holds the policy that applies to anyone who runs Claude Code in that repository. The project-level file is the one that matters for teams.

The exam wants you to know what the settings file controls. There are four blocks. Permissions are allow and deny lists for shell commands. Hooks are shell commands that fire on Claude Code events. Environment variables are values the harness exposes to the agent. Model preference is the default model the harness selects.

The permissions block is the safety baseline. The site's canonical example shows a platform team that commits a project-scoped settings file to every repository. The deny list blocks force-pushes, recursive deletes, namespace deletions, and package publishes. The allow list whitelists the commands that are obviously safe — installing dependencies, running tests, reading git diffs. Every engineer who runs Claude Code in that repository inherits these guardrails the moment they enter the directory. New hires are productive immediately and cannot accidentally do anything catastrophic.

The hooks block is for the things you want to happen automatically on every Claude Code event. We cover hooks in detail in the next lesson. The settings file is where you wire them up.

The environment block is for the values the harness exposes to the agent's environment. API keys, base URLs, feature flags, anything the agent's tools might need. The harness reads these at session startup and passes them to subprocesses.

The model preference is the default Claude model the harness uses. Sonnet four point six for most work. Haiku four point five for cost-sensitive tasks. Opus when you need the strongest reasoning. The default lives in the settings file so the whole team uses the same model unless someone overrides it.

A common pitfall the exam likes to test is putting team-wide policy in the user-level settings file. That works for one engineer. It does not work for a team. The user file lives in the engineer's home directory and does not propagate to anyone else. If you want the policy to apply to your colleagues, it has to live in the project file, which is committed to the repository.

Another pitfall is treating the settings file as static. It is not. Teams that get the most out of Claude Code iterate on the file. When something almost went wrong, they add a deny rule. When a useful pattern emerges, they add an allow rule. When a new hook helps the team, they wire it up. The file is the team's record of what they have learned about running the agent safely.

A subtle one. Forgetting that deny rules take precedence over allow rules. If you allow git push and deny git push minus minus force, the latter wins for force-pushes specifically. Order does not matter. Specificity does.

Another subtle one. Settings files that disable confirmations for actions the team has not actually agreed are safe. The convenience of skipping a permission prompt is real. The cost of one wrong skipped prompt is usually larger. The default is to confirm. Override only after the team has decided the action is genuinely fine without confirmation.

The next lesson is the hook system itself — what hooks are, when they fire, and what they let you do that you cannot do without them.

Domain 3 · Lesson 3

Hooks

The hook for this lesson is on-call wanting visibility into what Claude Code is doing across the team's critical-path repositories. They do not want to slow engineers down. They do not want to mediate every action. They just want to know which files are being touched, by whom, and when, so that when something explodes the next morning they can read the timeline.

Hooks are the right place for this. A hook is a shell command that fires on a Claude Code event. There are several event types and the exam expects you to know the most important ones. PreToolUse fires before a tool is called. PostToolUse fires after a tool is called and its result is known. UserPromptSubmit fires when the user submits a new prompt. SessionStart fires when a Claude Code session begins. SessionEnd fires when it ends. Stop fires when the agent declares it is done.

Hooks let you do three things. The first is observability — emit a log line, post to a chat channel, write to a database. The second is policy — block an action that the team has decided should not happen, regardless of permission settings. The third is augmentation — add context to the agent's view at the start of every session, or after every tool call.

The site's canonical observability example is a PostToolUse hook that posts to Slack on every edit and write. The script reads the tool name from the input that the harness pipes in, reads the file path, and if the tool was an edit or a write, fires a curl call to a Slack webhook with the file name. The hook exits zero, meaning it does not block the agent. The agent proceeds as normal. The Slack channel fills up with a running log of every file Claude Code touches.

The channel gets noisy. That is a feature, not a bug. When something explodes the next morning, on-call can scroll back and see exactly which file had been touched at six fourteen PM. The visibility is cheap, the implementation is fifteen lines of shell, and no engineer has to remember to do anything special.

The policy example on the site is a PreToolUse hook that checks for secrets. The script reads the proposed tool call, scans it for API keys and tokens, and if it finds one, exits non-zero. The harness treats a non-zero exit as a hard block. The agent cannot proceed. The user sees a message explaining what was rejected. The agent has to choose a different action.

This is the contract the exam wants you to remember. Exit zero from a hook to allow the action to proceed. Exit non-zero to block it. The hook can also emit messages on standard error that the harness surfaces to the user. The combination — exit code as policy, standard error as explanation — is enough to build sophisticated guardrails.

The augmentation pattern is less common but it shows up on the exam. A SessionStart hook can emit context that the harness injects into the agent's first view. Maybe the team's current sprint goal. Maybe the open incident list. Maybe a reminder that production deploys are frozen until Friday. The agent treats this as additional context and behaves accordingly.

A common pitfall is putting hook logic that the team does not all agree on in the user-level settings file. As with permissions, hooks in the user file apply to one engineer. If you want the hook to fire for everyone, it goes in the project file.

Another pitfall is hooks that are too slow. Every PreToolUse hook adds latency before every tool call. Every PostToolUse hook adds latency after. If your hook takes a second, every tool call now takes a second longer. Keep hook logic fast. If you need to do something slow — write to a database, post to a chat — do it asynchronously, by spawning a background process and returning immediately.

A subtle one. Hooks that fail noisily. Hooks that exit non-zero block the action. Hooks that fail with a stack trace produce a stack trace in the user's terminal. Wrap your hook logic in a try-catch and exit cleanly even on internal errors. The team will thank you.

The next lesson is the closest cousin of hooks — slash commands, which are workflows the user invokes deliberately rather than ones that fire automatically.

Domain 3 · Lesson 4

Slash Commands

The hook is a twelve-step release checklist that someone has to remember in order. Confirm main is green on CI. Run the test suite locally. Bump the patch version in two manifest files. Commit with the right message format. Tag and push tags. Draft release notes from the commits since the last tag. Post to the releases channel. Twelve steps, every step easy, the whole sequence easy to get wrong if you skip one.

Slash commands are how you stop relying on memory for sequences like this. A slash command is a reusable named workflow that any team member can invoke by typing slash and the command's name. The command's body is a prompt — usually a numbered list of steps written in plain English, sometimes with conditional branches, often with a stop-and-ask point partway through. When a user types the command, the agent reads the body and executes it.

The exam wants you to know where slash commands live and how they are structured. They live in a commands directory under the dot claude folder, either at the user level for personal commands or at the project level for team commands. Each command is a single markdown file. The file name is the command name. The file body is the prompt.

The site's canonical example is a slash release command at a small SaaS company. The body is a numbered list. Confirm main is green on CI. Run the test suite locally and abort on any failure. Bump the patch version in the package and marketplace manifests. Commit with the message format the team uses. Tag and push tags. Draft GitHub release notes from the commits since the last tag. Post to the releases channel. The last instruction in the file is a stop-and-ask — pause before pushing if anything looks off. Anyone on the team types slash release and the agent handles the choreography. The institutional knowledge lives in version control.

The rule of thumb the exam expects you to apply is simple. If you would type a multi-step prompt more than twice, make it a slash command. The team gets the workflow for free. The next time anyone needs to run the same sequence, they type one token instead of typing the whole prompt. The next time the workflow needs to change — a new step, a new check, a new stopping point — you edit the file once and everyone gets the update.

Slash commands can take arguments. A slash command file can reference an argument the user passes after the command name. So a slash migrate Rails command can take a file path as its argument and apply the migration logic to that specific file. The same command, invoked across many files, is a faster way to drive a repetitive migration than typing the same prompt out for each one.

A common pitfall is over-engineering the command body. The body is a prompt, not a script. You do not need to encode every possible branch in the body itself. The agent is good at handling small variations. Write the body for the happy path with explicit stops at the points where human judgment is required. Trust the agent for the in-between.

Another pitfall is putting commands that depend on personal state in the project-level commands directory. A command that references your home directory or your local environment will not work for your colleagues. Commands at the project level should be hermetic — they should work for anyone who clones the repository.

A subtle one. Slash commands are not the same as skills, and the exam will test the distinction. A slash command is invoked deliberately by the user. A skill auto-activates based on the conversation. If you want the workflow to run when the user asks for it explicitly, that is a slash command. If you want the behavior to be available whenever the conversation matches a description, that is a skill, which is the topic of the next lesson.

Domain 3 · Lesson 5

Skills

The hook is a billing team whose Stripe integration is hairy. Junior engineers hit the same gotchas in their first month. Senior engineers spend four to six hours a week answering the same questions. The team writes a runbook. The runbook helps but only when the junior remembers to read it. The team writes a chatbot. The chatbot helps but the juniors have to know to use it. What the team wants is something that just shows up when the conversation mentions a Stripe charge identifier or a payment failure.

Skills are how you get that. A skill is a modular capability pack with its own prompt, its own reference files, and its own scripts. Each skill lives in a directory under dot claude slash skills. The directory contains a SKILL dot M-D file, which is the prompt and the auto-activation description, and any number of supporting files in subdirectories — reference documents, helper scripts, configuration the prompt references.

The exam wants you to know the difference between a slash command and a skill, because the two are easy to confuse. A slash command is invoked deliberately. The user types slash and the command name. A skill auto-activates based on the conversation. The model reads the skill's description and decides whether the current conversation matches. If it does, the skill loads automatically.

That auto-activation is the part that makes skills powerful. The user does not have to know the skill exists. The junior engineer mentions a Stripe charge identifier in a conversation about a customer issue. The Stripe debugger skill notices the charge identifier matches its description, loads, and the agent now has the prompt that knows how to pull the charge, the dispute history, and the recent invoices, plus the cheat sheet of every error code the team cares about. The agent's behavior changes without the user typing anything special.

The site's canonical example is a Stripe debugger skill at a fictional billing team. The SKILL file says when the user mentions a charge identifier or a payment failure, pull the charge, the dispute history, and the recent invoices, then explain the failure code in plain English. The skill directory also contains a script that wraps the Stripe command-line tool and a reference markdown file with the team's internal error code cheat sheet. The skill auto-loads. Engineers get expert behavior without typing the same setup prompt every time.

The HelioGrid case study, which we narrate in the next chapter, takes this to its conclusion — forty skills across the team's repeated questions, auto-activating roughly twelve times per day per engineer, and a measurable reduction in senior on-call pages because the agent handles the routine questions itself.

The exam wants you to know what to put in a skill's auto-activation description. It is the most important sentence in the skill. The description is what the model reads to decide whether to load the skill. A description that says general database help will activate everywhere and load constantly. A description that says when the user mentions a SCADA event identifier or a SCADA replay command will activate only in the right conversations. Write the description like a help-desk ticket title. Specific words and recognizable phrases that the model can match against the current conversation.

A common pitfall is skills with overlapping descriptions. Two skills that both auto-activate on database queries will both load every time, doubling the cost. Make descriptions disjoint. If two skills cover related territory, consider merging them.

Another pitfall is skills that are too large. A skill that loads forty pages of reference documentation costs forty pages of context every time it activates. Keep skill payloads small. Use the SKILL file for the prompt and the auto-activation description. Use the supporting files for things the model reads on demand, not on auto-load.

A subtle one. Skills can themselves invoke slash commands and call tools. A well-designed skill is often a thin prompt that points the agent at the right slash command or the right tool, rather than a heavy bundle that tries to do the work itself.

The last lesson in this domain is the Agent SDK — what you reach for when you want Claude Code's behavior inside your own product instead of as a separate command-line tool.

Domain 3 · Lesson 6

Claude Agent SDK

The hook is a team that wants their code-review behavior to fire automatically on every pull request, not just when an engineer happens to open Claude Code in their terminal. They have built a great review workflow inside Claude Code over the past year. They want it embedded in their continuous integration system, running on every pull request, posting comments inline, never tired, never inconsistent.

The Agent SDK is what they reach for. The Agent SDK is the programmatic surface that exposes the same primitives Claude Code uses — sessions, tools, permissions, history, the same loop — as a library you can call from your own code. You can build agents that run in your continuous integration system, in your hosted product, in a background worker, anywhere your own code runs.

The exam wants you to know when to reach for the Agent SDK versus when to use Claude Code itself. The rule is roughly this. If the agent's behavior runs in a developer's terminal, with a human watching and steering, Claude Code is the right surface. If the agent's behavior runs unattended — in CI, in a background job, inside a product the customer uses — the Agent SDK is the right surface. The Agent SDK is what you reach for when you want Claude Code's behavior, not Claude Code itself, as the deliverable.

The site's canonical example is a code-review agent in continuous integration. The team builds a small session at the top of the pull-request handler. The session is configured with a system prompt that says it reviews pull requests for security, performance, and readability. The session has three tools — read file, list files, and git diff. When a pull request opens, the handler runs the session with the prompt to review the pull request and comment inline. The session executes the same loop Claude Code would execute — read the diff, read the relevant files, form an opinion, write comments. The handler posts the comments to the pull request. The reviewer is consistent across pull requests and never gets tired.

The Agent SDK is the same loop that powers Claude Code, but stripped of the command-line interface and made callable from any code. The session manages history. The tools execute. The permissions system is the same. The hooks system is the same. If you have learned to configure Claude Code well, you have learned to configure the Agent SDK well. The settings file at the project level applies to both.

A common pitfall is using the Agent SDK when a simple model call would suffice. Not every automation needs to be an agent. If the work is a single inference — summarize this document, classify this ticket — a plain Claude API call is simpler. The Agent SDK is for work that requires the agent to take multiple steps, call tools, and decide what to do next based on results. If the work is one shot, do not build an agent.

Another pitfall is forgetting that agents in production are still agents. They will sometimes make wrong calls. They will sometimes loop. They will sometimes give up. Your continuous integration job needs the same kind of bounded budget that any agent needs — a maximum number of turns, a maximum runtime, a fallback path when the agent declares it cannot proceed.

A subtle one. The Agent SDK respects the same permission and hook configuration as Claude Code, but only if you load it. Sessions can be configured to read a specific settings file at startup, or to ignore the team's standard configuration entirely. The default is usually to respect the project settings. Make sure your continuous integration agent is loading the right configuration — the safety baseline you wrote for Claude Code should apply to the SDK too.

That closes the six lessons in Domain Three. The next two chapters are production stories — case studies tagged for this domain. The first is the LumenPay Rails upgrade, which braids every lesson in this domain into a single fourteen-day project. The second is HelioGrid's forty-skill internal library, which is the deeper version of how skills change a team's shape.

Domain 3 · Case study

Production Story — LumenPay Rails upgrade

LumenPay is a fictional payments startup. Their main monolith — four hundred thousand lines of Rails five point two, started in 2018 — had to land on Rails seven point one before a security mandate cut off support. Five engineers. A hairy migration specification. An immovable deadline three weeks out. Eight weeks of estimated work to compress.

Day one was about bounding the work. The senior engineer opened a fresh worktree and gave Claude Code its marching orders. Read the application configuration and the Gemfile. Produce a migration checklist for Rails five point two to seven point one, grouped by file. Note the deprecated APIs the application depends on and which gems block the upgrade. Eight minutes later Claude Code returned a seventy-three-item checklist with file paths and risk ratings. The team triaged the checklist in person. Some items were marked Claude can do this. Some were marked needs a human. Some were marked needs a human and Claude as a pair.

Day two was the safety baseline. Before letting Claude Code touch production code paths, the platform engineer committed a project-scoped settings file. The deny list blocked force-pushes, pushes to the main branch, database drops, and recursive deletes. The allow list whitelisted the bundle install, the test runner, and git diff. A PreToolUse hook checked for secrets in proposed tool calls. Every engineer's session inherited these rules. Nobody could accidentally force-push or wipe a database from a Claude Code session.

Day three was the slash migrate Rails command. The pattern repeated across files. Read the file. Apply the documented Rails seven point one changes. Run the spec for that area. Commit on green. The team committed it once as a slash command. The slash command turned eleven days of repetitive prompting into eleven days of slash migrate Rails calls, each one pointing at a specific file.

Day five was the observability hook. Engineering leadership wanted visibility without micromanaging. A PostToolUse hook posted every edit and write to a Rails upgrade Slack channel. The channel got noisy fast and that was fine. When something exploded the next morning, on-call could scroll back and see exactly which file had been touched at six fourteen PM.

Day eight was a payments debugging skill. LumenPay's payments code has tribal knowledge — which webhooks retry, which counter resets nightly, the exact log lines that mean an authorization failed silently. The team encoded this as a skill. The skill auto-loaded whenever the conversation involved a payment identifier. Junior engineers got expert behavior without typing a paragraph of setup every time.

Day eleven was a code-review agent in continuous integration, built on the Agent SDK. The team configured a session with a Rails seven review prompt and read-file and git-diff tools. The session ran on every pull request, posting review comments inline. It caught two deprecated patterns the human reviewers had missed across sixty pull requests.

The numbers. The upgrade landed in eleven working days instead of the original estimated eight weeks. Zero customer-facing incidents during the migration window.

The takeaway is what the team built around the agent, not the agent itself. The settings file was the team-wide safety baseline. The slash command was the repeated workflow as version-controlled choreography. The hooks were the observability surface that did not slow anyone down. The skill was the tribal knowledge made auto-loading. The Agent SDK was Claude Code's behavior embedded in continuous integration. Each of these primitives is its own lesson in this domain. The Rails upgrade is what happens when a team uses all of them together for two weeks straight.

Domain 3 · Case study

Production Story — HelioGrid skill library

HelioGrid is a fictional energy-infrastructure company. Their Python services touch supervisory control and data acquisition systems, billing, and a twenty-year-old Oracle database. The problem they wanted to solve was ramp time. Every junior engineer hit the same five gotchas in their first month. Senior engineers spent four to six hours a week answering the same questions. The team decided to encode the recurring questions as skills.

Sprint one was identifying the questions. The team pulled six months of internal help-engineering chat traffic, clustered the questions, and ranked them by frequency multiplied by time-to-resolve. The top forty questions covered about eighty-five percent of inbound. The list ranged from Oracle debugging to SCADA replay to billing recomputation to Grafana dashboard creation to pager rotation scheduling.

Sprints two through four were building forty skills, one per question. Each skill lived in a directory under dot claude slash skills, named after the topic. Each one had a SKILL file with the prompt and the auto-activation description. Each one had a reference directory with the relevant runbooks. Some had a scripts directory with helpers — small shell scripts that wrapped command-line tools the team uses regularly.

The descriptions were written deliberately. A description that says general database help would activate everywhere and load every time. A description that says when the user mentions a specific kind of SCADA event identifier would activate only in the conversations where it was useful. The team wrote each description like a help-desk ticket title — specific words and recognizable phrases that the model can match against the current conversation.

Sprints five through eight were tracking usage. The team added a PostToolUse hook that logged which skills auto-loaded and whether the conversation ended in a resolution. The top skills got attention — the team kept refining their prompts and reference material based on which ones were being used most. The rarely-used skills got reviewed — some were merged with related skills, some were removed entirely.

The numbers. New-engineer ramp dropped from nine weeks to three weeks. Skill auto-activation triggered roughly twelve times per day per engineer. Senior on-call pages dropped by thirty-eight percent because the agent handled the routine questions itself, leaving the seniors to focus on the genuinely hard problems.

The lesson the team kept coming back to was that skills are how you encode the institutional knowledge that previously lived in the heads of three senior engineers. Once the knowledge is in version control, it is reviewable, updatable, and inheritable. New engineers do not need to be told what skills exist. The skills find them.

That closes Domain Three. The next domain is Prompt Engineering and Structured Output — system prompts, JSON mode, few-shot examples, prompt caching, and extended thinking.

Overview · 20% of exam

Domain 4 — Prompt Engineering

The fourth domain on the exam is Prompt Engineering and Structured Output. It is twenty percent of your final score. This domain is the inner loop. Everything you have learned in the first three domains lives inside a prompt at some point. The quality of that prompt — its structure, its examples, its caching layout, whether it asks the model to think — is what separates an agent that works in production from an agent that works on the demo machine and falls apart on real traffic.

The through-line of this domain is engineering rigor applied to text. The exam wants you to think about prompts the way you would think about any other piece of production code. Where does each part of the prompt live? What is stable and what is volatile? How is the output structured and how is that structure verified? When should you give the model time to think? How are prompt changes reviewed and rolled back when they regress?

There are six lessons in this domain. The first is system prompts — where to put role, constraints, and the output contract so the cache hits and the model behaves consistently. The second is structured output — JSON mode, schema validation, and the one-retry pattern that closes the gap when the model produces a near-miss. The third is few-shot examples — how many, in what order, and what mistakes to avoid. The fourth is prompt caching — the five-minute server-side cache that cuts input cost by ninety percent on hits, and the static-first ordering that unlocks it. The fifth is extended thinking — when to enable a thinking budget and what it costs you. The sixth is iterating prompts like code — treating prompts the way you treat database schemas, with version control, evals, and review.

Two production stories close the domain. Both are about teams that cut their language-model costs by more than half through deliberate prompt engineering, without sacrificing quality. Let's begin.

Domain 4 · Lesson 1

System Prompts

The hook is a cache hit rate of three percent. A team has a support assistant that runs across thousands of tenants. They have prompt caching enabled. They expect to see most calls hit the cache. They are seeing three percent of calls hit it. Their input cost is fifty times what it should be. The cause is in the system prompt itself, and the lesson teaches you how to see it.

A system prompt is where you put the role the model is playing, the constraints the model has to respect, and the output contract you want the model to honor. The exam wants you to know two things. The first is what belongs in the system prompt as opposed to the user message. The second is why the placement matters for cost and consistency.

The placement rule is structural, not stylistic. Anything that is stable across calls belongs in the system prompt. Anything that changes per call belongs in the user message. Stable means the same string, today and tomorrow, for every call in the same workload. Volatile means a value that is unique to this call — today's date, the current user's question, the most recent ticket.

Why does this matter for cost? Because prompt caching works by matching prefixes. The server looks at the start of your prompt and asks whether it has seen this exact prefix in the last five minutes. If yes, it serves the cached version and charges you ten percent of the normal input price. If no, it processes the prompt from scratch. The cache cannot help you if your system prompt contains volatile content, because the prefix changes on every call.

The OmniSupport example on the site shows the contrast. The original system prompt looked sensible — it named the tenant, named today's date, and named the current ticket. The problem is that today's date changes daily and the current ticket changes every call. The cache prefix changes every call. The hit rate is three percent because the only calls that hit the cache are the few that happen in the same minute on the same ticket.

The fix is to move the volatile fields out of the system prompt and into the user message. The stable parts — the role, the tenant, the tone, the tool list, the constraints — stay in the system prompt. The volatile parts — today's date, the ticket summary, the question — move to the user message. The cache prefix stops changing. The hit rate climbs to seventy-eight percent within a day. The input cost on that path drops by sixty percent.

Why does placement matter for consistency? Because system prompts behave differently from user messages in subtle ways. The model treats the system prompt as authoritative. Instructions placed there are more likely to be followed across long conversations. Instructions placed in the user message can be drowned out by subsequent user messages or by the model's own outputs. If you want the model to always follow a constraint — never make medical claims, always use upper case for tool names, always return JSON — the constraint goes in the system prompt.

A common pitfall is putting the tenant configuration in the user message. The tenant configuration is stable for the duration of the conversation. It belongs in the system prompt where the cache can amortize it. Putting it in the user message means paying for it every call.

Another pitfall is treating the system prompt as documentation. The system prompt is what the model reads to decide how to behave. If your system prompt is a six-thousand-word manifesto, the model has to read six thousand words on every uncached call. Trim the system prompt to what the model actually needs to know in order to behave correctly. Move reference material to a tool or a resource.

A subtle one. System prompts that mix stable and volatile content because the team wanted the structure to look pretty. The shape that caches well looks ugly to engineers and that is fine. Cache before aesthetics.

The next lesson is what to do when the output the model produces has to be machine-readable. That is where structured output and JSON mode come in.

Domain 4 · Lesson 2

Structured Output

The hook is a JSON parser that fails on the third call out of every ten. A team's resume parser at an applicant tracking system takes free-text resumes and returns structured records. The recipe was a prompt that asked the model to return JSON and a try-catch that handled the failures. The failure rate was twenty-two percent. The team's logs were full of missing commas, trailing keys, and string fields where booleans were expected. Quality was nominally fine. The customer experience was not.

The lesson is that asking the model for JSON is necessary but not sufficient. You need three things stacked together. You need JSON mode enabled so the model is constrained to producing JSON syntax. You need a schema validator that rejects outputs that do not match the shape you expect. You need a targeted retry that runs once on a validation failure, including the validator's specific error in the retry prompt.

The exam wants you to know what each piece does. JSON mode is a server-side constraint. The model's outputs are forced to be syntactically valid JSON. This eliminates the missing-comma failures and the trailing-key failures. It does not eliminate semantic failures. The model can return valid JSON whose fields are wrong types or whose values are out of range. JSON mode handles syntax, not semantics.

A schema validator is what catches the semantic failures. The example on the site uses a Zod schema in TypeScript, but the principle applies in any language. The schema declares the fields you expect, their types, and any constraints — email is a valid email, end date is either a string or null, experience is an array of objects with specific fields. The validator runs after the model returns. If the output validates, you proceed. If it does not, you retry.

The retry is where most teams underinvest. A retry that just re-prompts the model with the same prompt does not help. The model will produce the same kind of error. The retry that helps is one that includes the validator's specific error in the prompt. Last attempt failed because end date for the second job was not a valid string. The model reads the error, fixes the specific field, and returns the corrected output. The recipe on the site succeeds ninety-five percent of the time on the second attempt.

The exam expects you to know the budget rule. One retry. Not two. Not five. If the second attempt fails, the prompt is fundamentally not communicating what you need, or the input is genuinely beyond the model's ability to parse. Retrying further wastes money without changing the outcome. Surface the failure to the next layer — log it, queue it for human review, fall back to a simpler extraction — and move on.

A common pitfall is conflating JSON mode with structured output. JSON mode is one specific server feature that constrains the syntax. Structured output is the broader discipline of producing machine-readable results reliably. JSON mode helps. It is not the whole answer.

Another pitfall is retrying with the raw exception message. A stack trace in the retry prompt does not help the model. The model needs a plain-English explanation of what went wrong. The validator usually produces good messages on its own. Forward those, not the underlying exception.

A subtle one. Schemas that are too loose. A schema that types every field as a string and never validates against ranges or enums will accept many wrong outputs. The model will produce them. The next layer will fail in confusing ways. Tighten your schemas with the same rigor you would apply to a database column definition.

Another subtle one. Schemas that include optional fields the model rarely populates. The model reads the schema and tries to populate everything. Optional fields encourage the model to fabricate values. Mark fields required if you actually need them, and accept the validation failure when the input genuinely does not have the data.

The next lesson is the oldest trick in the prompt engineering book — examples — and the version of it that the exam expects you to know.

Domain 4 · Lesson 3

Few-Shot Examples

The hook is an email triage system that gets the easy cases right and the hard cases wrong. A customer-success team auto-routes inbound mail. The prompt has eighteen examples. The accuracy on the obvious cases — newsletters, refund requests — is high. The accuracy on the ambiguous cases — the partnership pitch that looks like a customer email, the upsell that looks like a complaint — is middling. The team adds more examples. Accuracy does not improve. They add even more. Accuracy slightly decreases. The cost per call keeps climbing.

The lesson is that few-shot is about quality of examples and order, not quantity. Three to five carefully chosen examples produce better behavior than fifteen indifferently chosen ones. The exam wants you to know two specific things — how many examples to use and what order to put them in.

The number is three to five. Less than three is usually not enough for the model to see the pattern. More than five usually does not help. Beyond five, the model starts to overfit to the surface shape of the examples — the formatting, the phrasing, the length — and underfit to the underlying judgment you are trying to teach. Twenty similar examples produce a model that copies the examples. Five diverse examples produce a model that applies the principle.

The order matters because of recency bias. Models pay disproportionate attention to the most recent content in their context. The last example is the one the model remembers most clearly. The team's job is to deliberately exploit this. Put the hardest example last. The hardest example is the one whose judgment is most subtle, the one that distinguishes the case the team most often gets wrong from the case that looks like it.

The OmniSupport example on the site walks through this. The five examples are a newsletter, a refund request, a partnership pitch, an urgent outage report, and an ambiguous upsell. The first is easy. The second is the common case. The third is the edge case where the email looks like a customer email but is not. The fourth is the high-stakes case. The fifth — the ambiguous upsell — is the hardest, and it is the one placed last. Within a week, accuracy on the ambiguous bucket climbs eleven points. The model learned the judgment by seeing the hardest example most clearly.

The diversity rule reinforces the count rule. If your five examples cover five different categories — newsletter, refund, pitch, outage, upsell — the model is forced to learn the underlying classification rule. If your five examples are five variations of refund requests, the model learns one category very well and the others not at all. Diversity teaches generalization. Repetition teaches mimicry.

A common pitfall is choosing examples that are too representative. Examples that are dead center in their categories are the easiest for the model to handle without examples at all. Choose examples that sit near the boundary between categories — the partnership pitch that looks like a customer email, the urgent outage that initially sounds like a feature request. Boundary examples teach the model where the line is.

Another pitfall is letting examples drift. The product changes. New email patterns appear. The examples in the prompt still represent the email patterns from six months ago. Refresh the examples on the same cadence you refresh your golden set. The two should evolve together.

A subtle one. Few-shot examples that include explanations of the right answer. Some prompt formats include the reasoning alongside the answer. This sometimes helps and sometimes hurts. Helps when the reasoning teaches a principle the model can generalize. Hurts when the reasoning is post-hoc rationalization the model copies without understanding. Test both versions on your golden set. Whichever produces better accuracy on the hardest cases wins.

Another subtle one. Few-shot examples that contradict the system prompt. The system prompt says to classify urgency on a three-level scale. One of the examples uses a five-level scale because someone copy-pasted from an old version. The model will pick one or the other depending on which it pays more attention to, and the result will be inconsistent. Examples and instructions have to agree.

The next lesson is where most of the cost actually goes — the prompt caching layer that makes the difference between an integration that scales and one that bankrupts you.

Domain 4 · Lesson 4

Prompt Caching

The hook is an internal HR Q-and-A bot whose input cost is fifteen cents per question. A fifty-thousand-token internal handbook is the context. Every employee question re-pays the entire handbook on input. The team has a thousand employees asking a few questions a day. The monthly bill is a number that lands in front of the chief financial officer with a frown next to it.

Prompt caching is the answer. Prompt caching is a server-side cache that the platform maintains for five minutes after a request. If your next request starts with the same prefix as the cached request, the server serves the prefix from cache and charges you about ten percent of the normal input price for that portion. The cache is the single largest cost-optimization lever in language-model operations, and it requires no model changes — only structural prompt design.

The exam wants you to know two things. The first is how the cache decides what is a match. The second is how to architect your prompts so the cache hits.

The match rule is prefix-based. The server compares the start of your prompt to the start of cached prompts. If they match exactly, byte for byte, the cached portion is served. If they diverge anywhere, the cache misses from that point onward and you pay full price for the divergent portion and everything after it. The cache does not match suffixes. It does not match content in the middle. It is strictly prefix-based.

This is what dictates the architecture. The static block of your prompt — the parts that are the same on every call — must come first. The volatile block must come last. The HR bot example on the site lays out the right shape. The system prompt and the handbook come first, marked as a cache breakpoint. The user question comes last and is short. The cache prefix is the entire static block. After warm-up, ninety-seven percent of calls hit the cache. Input cost drops from fifteen cents per question to about one point eight cents. Latency drops too, because serving from cache is faster than re-ingesting.

The placement of the cache breakpoint is what most teams get wrong. The breakpoint tells the server where to stop trying to match the prefix and start treating the rest as volatile. If you put the breakpoint at the wrong place — say, before your handbook — the cache cannot help with the handbook, and the entire handbook gets re-ingested on every call. The OmniSupport case study calls this out specifically. The team had the knowledge base after the user message, which meant the cache breakpoint sat after the volatile bit. Hits were nearly zero. The structural fix — knowledge base first, question last, breakpoint after the knowledge base — turned that into a massive cost win.

The five-minute expiry is the other thing the exam expects you to know. The cache is not persistent. Each cached entry lives for five minutes from its last hit. A workload with steady traffic keeps cache entries warm indefinitely. A workload with bursty traffic — a few hundred questions in five minutes, then nothing for an hour — benefits from caching during each burst but pays full price for the first question of the next burst.

A common pitfall is forgetting that the cache is per-prefix, not per-prompt. Two slightly different system prompts produce two different cache entries. If you are running multiple tenants and giving each one a slightly different system prompt, you are paying for one cache entry per tenant rather than one cache entry total. Decide whether that trade is worth it. Sometimes the per-tenant customization is worth the cache fragmentation. Often it is not.

Another pitfall is treating the cache as automatic. It is not. You enable it by marking cache breakpoints in your prompt. The default behavior on most APIs is no caching at all. You have to opt in.

A subtle one. Cached prefixes that include rapidly stale content. A system prompt that includes a list of currently-trending topics, updated every hour, will produce a cache that expires every hour. Either remove the volatile content from the cached block or accept the lower cache hit rate.

Another subtle one. Underestimating the latency win. Cache hits are not just cheaper. They are faster. For interactive chat, the latency savings matter to the user as much as the cost savings matter to the chief financial officer.

The next lesson is the opposite of speed — when you want the model to slow down and think.

Domain 4 · Lesson 5

Extended Thinking

The hook is a math tutor that gets simple arithmetic right and word problems wrong. A K-12 math tutoring product handles two types of question. Arithmetic lookups — what is two hundred thirty-four times seven — and word problems — Lisa has three times as many marbles as Tom, after Tom gives her four marbles she has twice as many, how many did Tom start with. The model gets the arithmetic right almost always. The model gets the word problems right seventy-one percent of the time. The team adds more examples. Accuracy improves a few points. They want eighty-five.

Extended thinking is the feature that closes the gap. Extended thinking is a server-side capability where the model is given a separate budget of tokens to reason before producing its final answer. The reasoning is not shown to the user by default. It is the model's scratch work. The final answer is what the user sees.

The exam wants you to know two things. When to enable extended thinking and what it costs you.

The when-to-enable answer is on tasks that require genuine multi-step reasoning. Word problems. Multi-step math. Logic puzzles. Constrained optimization. Anything where the answer is not a single lookup but a chain of inferences. Extended thinking is worth its budget when the task is something a human would also stop and think about for a few seconds before answering.

The when-not-to-enable answer is everything else. Lookups. Classifications. Single-step transformations. Simple extractions. These tasks do not benefit from extended thinking because there is no extended reasoning to do. The thinking budget is spent generating tokens that do not improve the answer. You pay the cost without getting the benefit.

The math tutor example on the site shows the gating pattern. The system has a small classifier that decides whether each question is a lookup or a word problem. If it is a word problem, the call is made with a thinking budget of four thousand tokens. If it is a lookup, the call is made without any thinking budget. Pass rate on word problems jumps from seventy-one percent to eighty-nine percent with extended thinking enabled. Lookups stay fast and cheap because the budget never activates.

The cost story is the second thing the exam expects you to know. Extended thinking is billed as output tokens. The four thousand tokens of thinking the model used to reason about a word problem cost four thousand output tokens, even though those tokens are not shown to the user. If you enable extended thinking globally on a high-volume workload, you can double or triple your output token bill overnight. The gating decision is a real cost decision, not just a quality decision.

The budget itself is configurable. You can set the maximum number of tokens the model is allowed to spend on thinking. Lower budgets are faster and cheaper but help less on the hardest problems. Higher budgets help more but at proportional cost. The right budget for a workload is the number where additional thinking stops improving accuracy on your golden set. Most word-problem workloads settle in the two to eight thousand range.

A common pitfall is enabling extended thinking on tasks where it does not help. The temptation is to turn it on globally and let the model decide whether to use the budget. The model does not decide whether to use the budget. It always uses the budget, because the budget is allocated. Gate enablement on the task shape, not on the model's discretion.

Another pitfall is showing the thinking to the user. The thinking is the model's scratch work. It is often messy, sometimes wrong in places, and routinely much longer than the final answer. Showing it to the user is usually a worse experience than hiding it. There are exceptions — a math tutor that wants to show its work for pedagogical reasons. But the default is to suppress.

A subtle one. Extended thinking interacts with prompt caching in ways the exam may test. The cache prefix does not include the thinking budget. Two calls with the same prefix but different thinking budgets share a cache entry. This is usually fine. Be aware of it if you are debugging cache hit rates.

The last lesson in this domain is the discipline that ties everything together — treating prompts the way you treat any other piece of production code.

Domain 4 · Lesson 6

Iterating Prompts Like Code

The hook is a prompt change that quietly tanks production. A team ships an updated system prompt on a Tuesday afternoon. It feels like an improvement when they test it interactively. By Friday, customer support is fielding complaints. By Monday, the leadership team is asking why renewal rates are down. By Wednesday, someone traces the regression back to the Tuesday prompt change. They revert. The product recovers. The team has lost a week of revenue and a chunk of customer trust because they treated a prompt change like a documentation update instead of like a code change.

The lesson is that prompts have production impact and they deserve the same engineering rigor as any other piece of production code. The exam wants you to know what that rigor looks like. The pattern on the site, used by a fictional fifty-person AI team, is the canonical version. Every prompt lives in a file under a prompts directory, version-controlled with the rest of the codebase. Every pull request that changes a prompt runs the eval harness on a fixed set of golden tasks. A bot posts the results of the eval as a comment on the pull request — success rate moved from seventy-eight percent to eighty-two percent, cost moved up twelve percent, latency stable. The team reads the comment, decides whether the trade is acceptable, and either merges or sends it back for another iteration.

The exam expects you to know four pieces of this discipline. Version control, evaluation, review, and rollback.

Version control means the prompt lives in a file, not in a system you can edit at runtime without a paper trail. The history is visible. The blame is visible. The diff for any change is reviewable like any other diff. When something regresses, you can pinpoint the commit that introduced the regression and revert it.

Evaluation means there is a golden set the prompt is tested against on every change. Not by hand. By a harness that runs automatically on every pull request. The harness produces a comparable metric — task success rate, cost per call, latency — that you can read at a glance.

Review means a human reads the prompt diff and the eval result before the change ships. Pull requests get approved. Prompts get approved. The reviewer's job is to look at the diff and at the eval result and decide whether the change is worth shipping. They might catch a regression the eval missed. They might notice that the prompt is now contradicting another prompt in the same system. They might just say it looks fine, ship it.

Rollback means you can undo a prompt change as fast as you can undo a code change. Because the prompt is version-controlled, the rollback is a revert. Because the prompt is loaded from a file at deployment time, the revert ships the next time the application redeploys. If you cannot revert a prompt in under five minutes, you do not have the right control surface and a regression that lands on Tuesday will not be undone until Friday.

The pattern that ties this together is treating the prompt as a deployment artifact. Prompts are not configuration. Configuration changes happen at runtime, often without a paper trail. Prompts are code that produces production behavior. Treat them like code.

A common pitfall is letting prompts live in environment variables or in a database that someone can edit through an admin interface. That admin interface becomes the production-incident generator. Three months in, no one remembers who changed what. Move prompts into the codebase. Edit them through pull requests.

Another pitfall is shipping prompt changes without an eval gate. The eval is what keeps the team honest. Without it, every prompt change is a vibe check. Vibe checks are how regressions ship.

A subtle one. Eval harnesses that take too long to run. If the eval takes an hour, no one runs it before opening the pull request. The eval has to be fast enough to run on every pull request without complaint. Two hundred tasks running in under five minutes is the right zone for most teams.

That closes the six lessons in Domain Four. The next two chapters are production stories — case studies tagged for this domain. Both walked into the same problem from different angles and both came out on the other side with cost cuts of more than half.

Domain 4 · Case study

Production Story — OmniSupport cost cuts

OmniSupport is a fictional customer-experience SaaS company. Their AI Reply Assistant feature was costing eleven thousand dollars a day on two million conversations. Quality was fine. Cost was eating the gross margin. The team ran an audit and found three leaks, and the five-week project that followed cut cost per conversation by sixty-two percent while holding quality steady at four point six out of five across a five-hundred-conversation review sample.

The first leak was volatile content in the system prompt. The original prompt named the tenant, named today's date, and named the current ticket. Today's date changes daily. The current ticket changes every call. The cache hit rate was three percent. The fix was structural. Stable content — the role, the tenant, the tone, the constraints — stayed in the system prompt. Volatile content — today's date, the ticket summary, the question — moved to the user message. The cache hit rate climbed to seventy-eight percent in a week. Input cost on that path dropped sixty percent.

The second leak was JSON output that did not validate. The ticket summarizer produced JSON the parser frequently failed on — missing commas, trailing keys, string fields where booleans were expected. The team's fix had been a try-catch that re-prompted with the raw error text. It worked, but the retry rate was twenty-two percent. The new recipe was JSON mode plus a Zod schema plus a single targeted retry that included the validator's specific error in the prompt. Last attempt failed because the needs-human field was the string "true" instead of a boolean. Retry rate dropped under three percent.

The third leak was a knowledge-base feature paying for the same knowledge base on every call. The relevant article — usually eight to twelve thousand tokens — was passed inline with every customer question. Prompt caching was enabled, but the article came after the user message. The cache breakpoint sat after the volatile bit. The team flipped the order. Knowledge base first. Question last. Cache breakpoint after the knowledge base. Identical articles fetched within the five-minute window cost about ten percent of the original input price.

The team also tuned the few-shot examples on email triage. They had been packing eighteen examples into the prompt. Performance was middling and tokens were heavy. They cut to five carefully chosen examples — newsletter, refund request, partnership pitch, urgent outage, ambiguous upsell — and put the ambiguous one last to lean into recency bias. Accuracy on the hard cases climbed from seventy-nine to eighty-four percent. Cost per call dropped from four point one cents to one point four cents.

Extended thinking got gated. The math-help feature had it enabled globally. The team gated it on a heuristic — only enable for messages classified as computational. The simple lookups stayed fast and cheap. The genuinely hard ones got the budget.

The last piece was moving prompts into version control. Every prompt-changing pull request now runs the eval harness against a one-hundred-conversation golden set. A bot posts a diff comment with success rate, cost per call, and latency. Pull requests that regress any metric require explicit override. This is what kept the gains from rotting over time.

The takeaway is the leverage of structural prompt engineering. Stable in the system. Volatile in the user. Schemas to catch the near-misses. Cache breakpoints after the static block. Few examples chosen well, hardest last. Extended thinking gated. Prompts as code, with evals. Each lesson on its own is a small win. Together they were a sixty-two percent cost reduction with quality held steady.

Domain 4 · Case study

Production Story — NorthStar cost engineering

NorthStar Analytics is a fictional business-intelligence company. Their monthly language-model spend was fifty-eight thousand dollars. The four-week project that followed cut it to nineteen thousand dollars — a sixty-seven percent reduction. The prompt-cache hit rate moved from eleven percent to eighty-one percent. Customer-perceived latency on the busiest endpoint dropped twenty-two percent.

Week one was an audit. The team instrumented their workloads and asked one question. Where is the money going? The largest spender was a "smart questions" feature that re-sent a forty-thousand-token product schema on every customer query. Caching was enabled, but the variable user message came first and broke the prefix every time. Cache hit rate sat at eleven percent.

Week two was static-first ordering. The team moved the schema into the system prompt and dropped today's volatile values — the user identifier, the tenant identifier, the current date — into a small user-side block. Cache hits climbed past seventy percent within an afternoon. The fix was structural, not technical. The schema's value lived in being identical across calls. Putting it before the volatile content was the only way the cache could see that.

Week three was model tiering. The team's "did the customer mean X?" intent classifier was running on Sonnet. The eval showed that Haiku, the smaller and cheaper model in the family, was nearly as accurate on this specific task. The team switched the classifier to Haiku. Quality drop on the eval set was one point two percentage points — within the team's acceptable threshold. Cost on that endpoint dropped seventy-eight percent.

Week four was compaction inside long agent sessions. Long-running analyst conversations were carrying every tool result verbatim — every query result, every chart, every transformation. After a few hours, the agent's context window was full of artifacts that no longer mattered for the current question. The team added an explicit compaction step every fifteen turns that collapses prior tool outputs into a structured "what we learned" briefing. The agent's context stayed bounded. The cost per turn stayed flat instead of climbing as the session aged.

The numbers across the four weeks. Monthly spend dropped from fifty-eight thousand to nineteen thousand. Cache hit rate climbed from eleven percent to eighty-one percent. Latency on the busiest endpoint dropped twenty-two percent.

The takeaway, in the team's own framing. The prompt-cache audit alone paid for the whole project four times over. The right model for the right step is a real lever — Haiku for the easy work, Sonnet for the medium work, Opus for the genuinely hard work. Compaction is selective summarization, not a recap. None of these are exotic techniques. They are engineering hygiene applied to prompts.

That closes Domain Four. The last domain is Context Management and Reliability — token budgets, compaction, retries, observability, and the production patterns that keep an agent running when things go wrong.

Where to go from here

Closing

That is the Deep Curriculum, end to end. If you have made it this far in one stretch, take a break before you do anything else with your day — and then come back to the site for the parts that the audio cannot do for you.

The first thing to do, after a listening pass, is the diagnostic. The diagnostic is fifteen minutes long and produces a per-domain readiness score. If your scores are uneven, the fourteen-day plan will rebalance itself in real time so the weakest domains float to the front of your remaining days. If the scores are even and high, skip the plan and go straight to the full mock exam.

The drill mode is the second thing. Drill is built on spaced repetition over the same knowledge points the rest of the site tracks. A short drill session every day for a week is worth more than one long session the night before. The drill cards are quick. Most are under twenty seconds.

The full mock exam is the final gate. Sixty questions, ninety minutes, seven hundred and twenty out of a thousand to pass. Take it under exam conditions — phone away, single sitting, no tab switching. If you score above seven hundred and twenty on the mock, you are ready. If you score between six hundred and twenty and seven hundred and twenty, you are within striking distance and can probably close the gap with two or three more days of focused drill on your weakest domain. Below that, do another listening pass through the two weakest domains in this audiobook and try the diagnostic again before re-attempting the mock.

A last word, from the team that maintains this site. The Claude Certified Architect exam is a snapshot. It tests what is true today about the platform — the model lineup, the protocols, the patterns that work in production at the time the exam was written. The platform will change. The exam will change with it. The thing this audiobook is trying to give you, beneath the test prep, is taste. Taste for which architectural pattern fits which problem. Taste for when to reach for an agent loop versus a single inference. Taste for when a tool should exist and when its work belongs inside the model's response.

If you build that taste, you will pass the exam, and the exam will be the smaller of the two things you got out of the work. Good luck on test day.