Saturday, May 16, 2026
HomeTechnologyAgent Harness Engineering – O’Reilly

Agent Harness Engineering – O’Reilly

This text was initially revealed on Addy Osmani’s weblog. It’s being reposted right here with the writer’s permission.

Roughly: Anytime you discover an agent makes a mistake, you are taking the time to engineer an answer such that the agent by no means makes that mistake once more.

We’ve spent the final two years arguing about fashions. Which one is smartest, which one writes the cleanest React, which one hallucinates much less. That dialog is okay so far as it goes, nevertheless it’s lacking the opposite half of the system. The mannequin is one enter right into a working agent. The remainder is the harness: the prompts, instruments, context insurance policies, hooks, sandboxes, subagents, suggestions loops, and restoration paths wrapped across the mannequin so it may well truly end one thing.

An honest mannequin with a fantastic harness beats a fantastic mannequin with a foul harness. I’ve watched this play out alone work time and again. And more and more the fascinating engineering isn’t in selecting the mannequin; it’s in designing the scaffolding round it.

That self-discipline now has a reputation. Viv Trivedy coined the time period harness engineering, and his “Anatomy of an Agent Harness” submit is the cleanest derivation of what a harness truly is and why every bit exists. Dex Horthy has been monitoring the sample because it emerges. HumanLayer frames most agent failures as “ability points” that come all the way down to configuration moderately than mannequin weights. Anthropic’s engineering staff has revealed what I feel is the very best public breakdown of the right way to design a harness for long-running work. And Birgitta Böckeler has overview of what this appears like from the person’s aspect.

This submit is my try to drag these threads collectively.

What’s a harness, actually?

Viv’s one-liner does many of the work:

Agent = Mannequin + Harness. Should you’re not the mannequin, you’re the harness.

A harness is each piece of code, configuration, and execution logic that isn’t the mannequin itself. A uncooked mannequin isn’t an agent. It turns into one as soon as a harness provides it state, device execution, suggestions loops, and enforceable constraints.

The model is one chip on the board. The harness is everything else that makes it useful.

Concretely, a harness contains:

  • System prompts, CLAUDE.md, AGENTS.md, ability information, and subagent prompts
  • Instruments, abilities, MCP servers, and their descriptions
  • Bundled infrastructure (filesystem, sandbox, browser)
  • Orchestration logic (subagent spawning, handoffs, mannequin routing)
  • Hooks and middleware for deterministic execution (compaction, continuation, lint checks)
  • Observability (logs, traces, value and latency metering)

Simon Willison reduces the loop half to its essence: an agent is a system that “runs instruments in a loop to attain a purpose.” The ability is within the design of each the instruments and the loop.

If that feels like quite a lot of floor space, it’s. And it’s your floor space, not the mannequin supplier’s. Claude Code, Cursor, Codex, Aider, Cline: These are all harnesses. The mannequin beneath is typically the identical, however the habits you expertise is dominated by what the harness does.

coding agent = AI mannequin(s) + harness

This equation, articulated by Viv and echoed by HumanLayer, is the place the work truly lives. The talk over the left-hand aspect is loud. Many of the precise leverage sits on the suitable.

The “ability challenge” reframe

There’s a sample I watch engineers fall into. The agent does one thing dumb, the engineer blames the mannequin, and the blame will get filed below “look forward to the subsequent model.”

The harness-engineering mindset rejects that default. The failure is normally legible. The agent didn’t find out about a conference, so that you add it to AGENTS.md. The agent ran a harmful command, so that you add a hook that blocks it. The agent obtained misplaced in a 40-step job, so that you cut up it right into a planner and an executor. The agent stored “ending” damaged code, so that you wire a typecheck back-pressure sign into the loop.

HumanLayer says: “It’s not a mannequin downside. It’s a configuration downside.” Harness engineering is what occurs whenever you take that severely.

There’s a hanging knowledge level that exhibits up in each Viv’s write-up and HumanLayer’s. On Terminal Bench 2.0, Claude Opus 4.6 working inside Claude Code scores far decrease than the identical mannequin working in a customized harness. Viv’s staff moved a coding agent from High 30 to High 5 by altering solely the harness. Fashions get posttraining coupled to the harness they had been skilled in opposition to. Transferring them into a unique harness, with higher instruments to your codebase, a tighter immediate, and sharper backpressure, can unlock functionality the unique harness was leaving on the ground.

That is the alternative of the “simply look forward to GPT-6” narrative. The hole between what in the present day’s fashions can do and what you see them doing is essentially a harness hole.

The ratchet: Each mistake turns into a rule

A very powerful behavior in harness engineering is treating agent errors as everlasting alerts. Not one-off tales to chuckle about, not “dangerous runs” to retry. Indicators.

If the agent ships a PR with a commented-out take a look at and I merge it accidentally, that’s an enter. The subsequent model of my AGENTS.md says “by no means remark out checks; delete them or repair them.” The subsequent model of my precommit hook greps for .skip( and xit( within the diff. The subsequent model of my reviewer subagent flags commented-out checks as a blocker.

You solely add constraints whenever you’ve seen an actual failure. You solely take away them when a succesful mannequin has made them redundant. Each line in AGENTS.md needs to be traceable again to a selected factor that went unsuitable.

That is additionally why harness engineering is a self-discipline moderately than a framework. The proper harness to your codebase is formed by your failure historical past. You’ll be able to’t obtain it.

Working backward from habits

The framing from Viv that I discover most helpful once I’m truly designing a harness is to begin from the habits you need and derive the harness piece that delivers it. His sample: habits we would like (or wish to repair) → harness design to assist the mannequin obtain this.

Every harness feature is a bridge across a specific thing the model can't do on its own

The helpful factor about deriving it this manner is that each harness part has a selected job. Should you can’t title the habits a part exists to ship, it most likely shouldn’t be there.

The remainder of this part walks the items in roughly the order Viv does, with the particular patterns I’ve discovered value stealing.

Filesystem and Git: Sturdy state

The filesystem is probably the most foundational primitive, and it tends to be underrated as a result of it’s boring. Fashions can solely straight function on what suits in context. With out a filesystem, you’re copy-pasting right into a chat window, and that isn’t a workflow.

After you have a filesystem, the agent will get a workspace to learn knowledge, code, and docs; a spot to dump intermediate work as an alternative of holding it in context; and a floor the place a number of brokers and people can coordinate by means of shared information. Including Git on high provides you versioning at no cost, so the agent can observe progress, roll again errors, and department experiments.

Many of the different harness primitives find yourself pointing on the filesystem for one thing.

Bash and code execution: The final-purpose device

The principle agent loop in the present day is a ReAct loop: The mannequin causes, takes an motion by way of a device name, observes the outcome, and repeats. However a harness can solely execute the instruments it has logic for. You’ll be able to attempt to prebuild a device for each potential motion, otherwise you can provide the agent bash and let it construct the instruments it wants on the fly.

Willison’s tackle that is that brokers already excel at shell instructions; most duties collapse to a couple well-chosen CLI invocations. Harnesses nonetheless ship centered instruments, however bash plus code execution has develop into the default general-purpose technique for autonomous downside fixing. It’s the distinction between educating somebody to make use of a single kitchen gadget and handing them a kitchen.

Sandboxes and default tooling

Bash is simply helpful if it runs someplace protected. Operating agent-generated code in your laptop computer is dangerous, and a single native setting doesn’t scale to many parallel brokers.

Sandboxes give brokers an remoted working setting. As a substitute of executing domestically, the harness connects to a sandbox to run code, examine information, set up dependencies, and confirm work. You’ll be able to allow-list instructions, implement community isolation, spin up new environments on demand, and tear them down when the duty is completed.

A great sandbox ships with good defaults: preinstalled language runtimes and packages, Git and take a look at CLIs, a headless browser for internet interplay. Browsers, logs, screenshots, and take a look at runners are what let the agent observe its personal work and shut the self-verification loop.

The mannequin doesn’t configure its execution setting. Deciding the place the agent runs, what’s obtainable, and the way it verifies its output are all harness-level calls.

Reminiscence and search: Continuous studying

Fashions don’t have any further information past their weights and what’s presently in context. With out the flexibility to edit weights, the one approach so as to add information is thru context injection.

The filesystem is once more the primitive. Harnesses help reminiscence file requirements like AGENTS.md that get injected on each begin. Because the agent edits that file, the harness reloads it, and information from one session carries into the subsequent. It is a crude however efficient type of continuous studying.

For information that didn’t exist at coaching time (new library variations, present docs, in the present day’s knowledge), internet search and MCP instruments like Context7 bridge the cutoff. These are helpful primitives to bake into the harness moderately than leaving to the person.

Battling context rot

Context rot is the statement that fashions worsen at reasoning and finishing duties because the context window fills up. Context is scarce, and harnesses are largely supply mechanisms for good context engineering.

Three strategies present up repeatedly:

Compaction. When the window will get near full, one thing has to present. Letting the API error isn’t an possibility for a manufacturing harness, so the harness intelligently summarizes and offloads older context so the agent can maintain working.

Device-call offloading. Massive device outputs (suppose 2,000-line log information) litter context with out including a lot sign. The harness retains the top and tail tokens above a threshold and offloads the total output to the filesystem, the place the agent can learn it on demand.

Abilities with progressive disclosure. Loading each device and MCP into context at startup degrades efficiency earlier than the agent takes a single motion. Abilities let the harness reveal directions and instruments solely when the duty truly requires them.

Anthropic’s harness submit provides another approach for the actually lengthy jobs: full context resets, the place the harness tears the session down and rebuilds it from a compact handoff file. They’re specific that compaction alone wasn’t enough for lengthy duties; generally it’s essential to begin contemporary with a structured transient. That is nearer to how people onboard a brand new engineer than to how we normally take into consideration “reminiscence.”

Lengthy-horizon execution: Ralph loops, planning, verification

Autonomous long-horizon work is the holy grail and the toughest factor to get proper. Right this moment’s fashions undergo from early stopping, poor decomposition of advanced issues, and incoherence as work stretches throughout a number of context home windows. The harness has to design round all of that.

I’ve written about autonomous coding loops just like the Ralph loop earlier than in self-improving brokers and in my 2026 tendencies piece, nevertheless it’s value restating on this framing: A hook intercepts the mannequin’s try to exit and reinjects the unique immediate right into a contemporary context window, forcing the agent to proceed in opposition to a completion purpose. Every iteration begins clear however reads state from the earlier one by means of the filesystem. It’s a surprisingly easy trick for turning a single-session agent right into a multisession one, and it’s the sort of primitive you’d by no means derive from “simply use a wiser mannequin.”

Planning is when the mannequin decomposes a purpose right into a sequence of steps, normally right into a plan file on disk. The harness helps this with prompting and reminders about the right way to use the plan file. After every step, the agent checks its work by way of self-verification: Hooks run a predefined take a look at suite and loop failures again to the mannequin with the error textual content, or the mannequin evaluations its personal output in opposition to specific standards.

Planner/generator/evaluator splits. Anthropic’s long-running harness work is specific that separating era from analysis into distinct brokers outperforms self-evaluation, as a result of brokers reliably skew optimistic when grading their very own work. It’s GANs for prose. The associated sample is the dash contract, the place the generator and evaluator negotiate what “carried out” truly means earlier than code will get written. In my very own workflows, writing down the carried out situation earlier than beginning has caught extra scope drift than any immediate change I’ve ever made.

Hooks: The enforcement layer

Hooks are what separate “I instructed the agent to do X” from “the system enforces X.”

A hook is a script that runs at a selected lifecycle level: earlier than a device name, after a file edit, earlier than commit, on session begin. They’re the suitable place for issues the agent ought to always remember however typically does. Run typecheck and lint and checks after each edit and floor failures. Block harmful bash (rm -rf, git push --force, DROP TABLE). Require approval earlier than opening a PR or pushing to most important. Auto-format on write so the agent doesn’t waste tokens on whitespace.

The precept HumanLayer highlights and I’ve come to agree with is: Success is silent; failures are verbose. If typecheck passes, the agent hears nothing. If it fails, the error textual content will get injected into the loop and the agent self-corrects. That makes the suggestions loop virtually free within the widespread case and straight actionable when one thing goes unsuitable.

AGENTS.md and gear alternative

The flat markdown rulebook on the root of your repo continues to be the one highest-leverage configuration level, as a result of it lands within the system immediate each flip. Conventions go right here: bundle supervisor, take a look at framework, formatting, “by no means contact /legacy,” “all the time use our logger.” Two hard-won classes:

Hold it brief. HumanLayer retains theirs below 60 traces. Each line is competing for consideration, and extra guidelines make every rule matter much less. Pilot’s guidelines, not model information.

Earn every line. Guidelines ought to hint to a selected previous failure or a tough exterior constraint. In the event that they don’t, they’re noise. Ratchet; don’t brainstorm.

Similar self-discipline applies to instruments. Every device’s title, description, and schema will get stamped into the immediate each request. Ten centered instruments outperform fifty overlapping ones as a result of the mannequin can maintain the menu in its head. HumanLayer additionally flags an actual safety concern right here: device descriptions populate the immediate, so any MCP server you put in is trusted textual content the mannequin will learn. A sloppy or malicious MCP can prompt-inject your agent earlier than you’ve typed something.

What this appears like in manufacturing

The clearest public image I’ve seen of a mature harness is Fareed Khan’s (estimated) breakdown of Claude Code’s structure.

Nearly each idea from the earlier part exhibits up on this diagram as a named part. Context injection is the information layer. Loop state lives within the reminiscence retailer and the worktree isolator. Harmful-action hooks sit behind the permission gate. Subagent context firewalls are all the multi-agent layer. The device dispatch registry is the place MCP servers and bash each plug in. Khan’s argument is identical as Viv’s, simply labored by means of a delivery product: Claude Code’s trajectory is in regards to the harness no less than as a lot as in regards to the mannequin beneath it.

Harnesses don’t shrink; they transfer

One of many higher observations within the Anthropic write-up is that as fashions enhance, the area of fascinating harness mixtures doesn’t shrink. It strikes.

The naive story is that higher fashions make harnesses out of date. If the mannequin can plan, no planner. If the mannequin is coherent at lengthy horizons, no context resets. And sure, Opus 4.6 largely killed the context-anxiety failure mode (Sonnet 4.5 used to wrap up work prematurely because it approached what it thought was its context restrict), which implies a complete class of anxiety-mitigation scaffolding I used to be writing six months in the past is now useless code.

However the ceiling moved with the mannequin. Duties that had been unreachable are in play, and so they have their very own failure modes. The anxiousness scaffolding goes away, and instead you want a multiday reminiscence coverage or a harness that coordinates three specialised brokers or evaluators for design high quality in generated UIs. The assumptions shift, and so does the scaffolding that encodes them.

Anthropic places it cleanly: “Each part in a harness encodes an assumption about what the mannequin can’t do by itself.” When the mannequin will get higher at one thing, that part turns into load-bearing for nothing and may come out. When the mannequin unlocks one thing new, new scaffolding is required to succeed in the brand new ceiling.

The model-harness coaching loop

The opposite factor that’s occurring, which Viv names explicitly, is a suggestions loop between harness design and mannequin coaching.

The harness doesn't shrink, it moves

Right this moment’s agent merchandise are posttrained with harnesses within the loop. The mannequin will get particularly higher on the actions the harness designers suppose it needs to be good at: filesystem operations, bash, planning, subagent dispatch. That’s why Opus 4.6 feels completely different inside Claude Code than inside another person’s harness, and it’s why altering a device’s logic generally causes unusual regressions. A genuinely basic mannequin wouldn’t care whether or not you used apply_patch or str_replace, however cotraining creates overfitting.

The sensible implication is twofold. A harness is a dwelling system, not a config file you arrange as soon as. And the “greatest” harness isn’t essentially the one the mannequin was skilled inside; it’s the one designed to your job. Viv’s High 30 to High 5 Terminal Bench bounce is the clearest proof level I’ve seen.

Harness as a service

Viv’s different contribution is the HaaS framing: harness as a service. The statement is that we’re shifting from constructing on LLM APIs (which offer you a completion) to constructing on harness APIs (which offer you a runtime). The Claude Agent SDK, the Codex SDK, and the OpenAI Brokers SDK all level in the identical path. You get the loop, the instruments, the context administration, the hooks, and the sandbox primitives out of the field, and also you customise them.

The shift issues as a result of the default path was once: construct your personal loop, wire up your personal tool-calling, deal with your personal dialog state, invent your personal approval circulation. Now the default path is: decide a harness framework, configure it alongside the 4 pillars (system immediate, instruments, context, subagents), and put the remainder of your effort into domain-specific immediate and gear design.

That’s what makes “ability challenge” tractable. You’re not rebuilding an agent from scratch each time one thing goes unsuitable. You’re tuning a configuration floor that’s already well-factored.

Viv’s line on that is additionally the very best argument for beginning messy: “Good agent constructing is an train in iteration. You’ll be able to’t do iterations when you don’t have a v0.1.”

The place that is going

Have a look at the highest coding brokers aspect by aspect (Claude Code, Cursor, Codex, Aider, Cline) and they give the impression of being extra like one another than their underlying fashions do. The fashions are completely different. The harness patterns are converging. I don’t suppose that’s an accident. It’s the trade slowly discovering the load-bearing items of scaffolding that flip a generative mannequin into one thing that may ship.

Viv’s framing of the open issues is the one I discover most fun: orchestrating many brokers working in parallel on a shared codebase; brokers that analyze their very own traces to establish and repair harness-level failure modes; harnesses that dynamically assemble the suitable instruments and context just-in-time for a given job as an alternative of being preconfigured at startup.

That final one, particularly, seems like the place harnesses cease being static config and begin turning into one thing nearer to a compiler.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments