Tuesday, June 9, 2026
HomeTechnologyLengthy-Operating Brokers – O’Reilly

Lengthy-Operating Brokers – O’Reilly

The next article initially appeared on Addy Osmani’s weblog and is being reposted right here with the writer’s permission.

An extended-running AI agent can hold making progress over hours, days, or weeks. It could possibly do that throughout many context home windows and sandboxes, get better from failure, go away structured artifacts behind, and resume the place it left off.

For 2 years the dominant picture of an “AI agent” has been a chat window with a intelligent loop in it. You kind a objective; the agent calls some instruments; you watch tokens stream by; you cease watching when the work runs out of persistence or the context window fills up. That paradigm acquired us a good distance, but it surely has a ceiling. The mannequin forgets. It declares “process full” when it isn’t. It reintroduces a bug it fastened 9 turns in the past. The entire thing is structured round a single sitting.

Long-running AI agents

Lengthy-running brokers are what comes subsequent. The concept is straightforward to state: an agent that retains making ahead progress on a objective throughout many periods and plenty of sandboxes, presumably many days or even weeks, whereas leaving the workspace clear sufficient that the subsequent session can decide up the place the final one left off. The engineering is more durable. It’s important to clear up for persistence, restoration, and verification in a approach that doesn’t simply paper over the cracks. It’s important to construct a state layer that lives outdoors the mannequin’s context window, and you must design the handoff between periods so the agent doesn’t lose its thoughts when it wakes up and finds itself in a special sandbox with a special context window.

This submit is my try to put out what’s modified, who’s pushing on it, and the way an engineer can use long-running brokers at the moment with out writing the entire thing from scratch.

What “long-running” really means

“Lengthy-running” used to imply at the least three various things in follow, and it helps to maintain them separate.

Lengthy-horizon reasoning. The agent has to plan and execute over many dependent steps. That is principally a model-quality story: coherence, planning, the power to get better from a unsuitable flip 10 steps in the past. METR has been monitoring this with their time horizon metric, which estimates how lengthy a process a frontier mannequin can full with 50% reliability. The headline discovering is that the metric has been doubling roughly each seven months since 2019, and their TH1.1 replace earlier this yr doubled the rely of eight-hour-plus duties within the eval set. If that curve holds, frontier brokers full duties on the day scale by 2028 and the yr scale by 2034.

Lengthy-running execution. The agent’s course of runs for hours or days. Possibly it’s a coding job, possibly it’s a analysis sweep, possibly it’s a 24-7 monitoring service. The mannequin could be invoked hundreds of occasions throughout the run. That is principally a harness story, and it’s the one this submit is generally about.

Persistent company. The agent has an id that outlives any single process. It accumulates reminiscence, learns person preferences, and is at all times out there. That is the Reminiscence Financial institution taste of long-running.

In follow the three blur collectively. An actual manufacturing agent does long-horizon reasoning inside a long-running execution backed by persistent company. However the engineering issues are totally different in every, and so are the merchandise that clear up them.

Why this issues

There are two causes I imagine this work issues so much proper now.

The primary is a part change in what’s economically possible to delegate. An agent that runs for 10 minutes can reply a query, summarize a doc, repair a small bug. An agent that runs for 10 hours can personal a whole function, end a migration that was on the backlog for six quarters, or do the type of in a single day analysis sweep that used to require a junior analyst. One in all Anthropic’s Claude Sonnet bulletins put concrete numbers on this final fall: 30+ hours of autonomous coding in inside exams, together with one run that produced an 11,000-line Slack-style app. That’s already previous the edge the place the reply to “Ought to I delegate this?” is not apparent.

The second is that persistence modifications what the agent is. A stateless agent solutions your query and disappears. An extended-running one accumulates context: which competitor moved which approach final week, which check flaked twice on Tuesday, what you often imply by “the dashboard.” Anthropic’s Mission Vend was probably the most public early demonstration of this. That they had a Claude occasion run an precise workplace merchandising enterprise for a month, managing stock, setting costs, speaking to suppliers. It failed in informative methods, and the second part ran significantly better, however the level wasn’t profitability. The purpose was watching what sorts of bizarre coherence issues present up when an agent has to take care of id throughout weeks as an alternative of turns.

These are the identical issues each workforce constructing manufacturing brokers now hits.

The three partitions each long-running agent hits

Three partitions present up in mainly each write-up I’ve learn this yr.

Finite context. Even a 1M-token window fills. And context rot, the regular degradation of mannequin efficiency because the window will get full, kicks in properly earlier than the exhausting restrict. A 24-hour run will not be going to slot in any context window the sector has on its roadmap. One thing has to present.

No persistent state. A brand new session begins clean. Anthropic’s framing of their scientific computing submit is the cleanest model I’ve seen: “Think about a software program mission staffed by engineers working in shifts, the place every new engineer arrives with no reminiscence of what occurred on the earlier shift.” With out an express persistence story, each shift change is a productiveness catastrophe.

No self-verification. Fashions reliably skew constructive after they grade their very own work. Requested “Are you achieved?” they reply “sure” extra usually than they need to. And not using a separate sign that the work meets a bar, you get the agent that ships at 30% full with full confidence.

Lengthy-running agent designs are principally solutions to those three issues. The foremost labs have converged on comparable shapes of reply, however with very totally different floor space.

The Ralph loop: One of many easier practitioner variations of long-running brokers

The Ralph loop (typically referred to as the Ralph Wiggum approach) is considered one of “easier” practitioner model of long-running brokers, popularized by Geoffrey Huntley and Ryan Carson. The reference implementation is actually a bash script that loops:

  1. Choose the subsequent unfinished process from an inventory (prd.json or equal).
  2. Construct a immediate with the duty, the related context, and any persistent notes.
  3. Name the agent.
  4. Run exams or different checks.
  5. Append what occurred to progress.txt.
  6. Replace the duty record (achieved, failed, blocked).
  7. Return to step 1.

The explanation it really works is identical motive any of the harnesses under work: State lives outdoors the agent’s context. prd.json is the plan, progress.txt is the lab notes, and AGENTS.md is the rolling rulebook. The agent itself is amnesiac, however the filesystem isn’t. Every iteration begins contemporary and reads sufficient state from disk to maintain going. Carson’s Compound Product extends the thought by chaining a number of loops (an evaluation loop that reads each day experiences, a planning loop that emits a PRD, an execution loop that writes the code), which is roughly the open supply model of the planner-generator-evaluator triad Anthropic landed on independently.

I went deeper on all of this in “Self-Enhancing Coding Brokers”: process record construction, progress recordsdata, QA gates, monitoring, the failure modes you’ll really hit. The quick model is that you could construct a working long-running agent in a night with a bash script and a JSON file. Most of what Google and Anthropic have productized is the work of constructing this sample recoverable, safe, and observable at scale.

The massive-lab tales under are alternative ways of paying for that production-readiness.

Anthropic: Harnesses, then the mind/fingers/session cut up

Anthropic has been probably the most public in regards to the engineering. Two posts are price studying finish to finish.

The primary is “Efficient Harnesses for Lengthy-Operating Brokers,” which lays out a two-agent harness for autonomous full stack improvement. An initializer agent runs as soon as at the beginning of a mission to arrange the setting, broaden the immediate right into a structured feature-list.json, and write an init.sh that future periods will run on boot. A coding agent is then woken up again and again, every session requested to make incremental progress on one function, run exams, go away a claude-progress.txt word, and commit. A check ratchet (“it’s unacceptable to take away or edit exams as a result of this might result in lacking or buggy performance”) sits within the immediate to cease the quite common failure of an agent deleting failing exams to “make them move.” InfoQ’s writeup extends this right into a planner, generator, and evaluator triad, on the identical logic that separating era from analysis issues as a result of fashions grade their very own work too generously.

The second is “Scaling Managed Brokers: Decoupling the Mind from the Arms,” the architectural submit behind Claude Managed Brokers (Anthropic’s hosted runtime, launched in early April). The argument is that an agent has three elements that must be independently replaceable. The Mind is the mannequin and the harness loop that calls it. The Arms are sandboxed, ephemeral execution environments the place instruments really run. The Session is an append-only occasion log of each thought, device name, and statement.

This sounds summary, but it surely isn’t. Right here’s Anthropic’s framing: “Each element in a harness encodes an assumption about what the mannequin can’t do by itself.” Whenever you couple them, an assumption that goes stale (e.g., the mannequin used to wish an express planner and now plans natively) means the entire system has to vary without delay. Whenever you decouple them, the harness turns into stateless, sandboxes develop into cattle, not pets, and a mind crash doesn’t lose the run. A contemporary container calls wake(sessionId) and reconstitutes the state from the log. They reported time-to-first-token dropped ~60% at p50 and over 90% at p95 simply from having the ability to begin inference earlier than the sandbox is prepared.

The session-as-event-log thought is the half most groups underappreciate. It’s what makes a long-running agent recoverable. With out it, a container failure is a session failure and also you’re debugging right into a stale snapshot. With it, the agent’s reminiscence is a queryable artifact that lives outdoors no matter course of occurs to be working in the mean time.

For the scientific computing crowd, Anthropic’s “long-running Claude” submit reduces all of this to an easier stack: CLAUDE.md as a residing plan the agent edits because it learns, CHANGELOG.md as transportable lab notes, tmux plus SLURM plus git because the execution and coordination layer, and the Ralph loop, a for loop that kicks the agent again into context every time it claims completion and asks if it’s actually achieved. Their flagship case research is a Boltzmann solver Claude Opus 4.6 constructed over a number of days that reached subpercent settlement with a reference CLASS implementation. Months to years of researcher time, compressed.

Similar patterns throughout all three posts: an express plan file, an express progress file, structured handoffs between periods, separate era from analysis, and a loop that refuses to let the agent cease early.

Cursor: Planners, employees, judges

Cursor’s “Scaling Lengthy-Operating Autonomous Coding” is the opposite important learn this yr. They walked into partitions that Anthropic principally papered over.

Their first try was a flat coordination mannequin: equal-status brokers writing to shared recordsdata with locks. It grew to become a bottleneck and made the brokers threat averse, churning somewhat than committing. Their second try swapped locks for optimistic concurrency management, which eliminated the bottleneck however didn’t repair the coordination drawback. The third design is what’s working in manufacturing now and what they describe as fixing many of the drawback:

  • Planners repeatedly discover the codebase and emit duties. They’ll recursively spawn subplanners.
  • Staff are targeted executors. They don’t coordinate with one another and so they don’t fear in regards to the massive image.
  • Judges determine when an iteration is completed and when to restart.

Two issues stand out from the submit. One: “A shocking quantity of the system’s conduct comes right down to how we immediate the brokers” greater than the harness or the mannequin. Two: Totally different fashions slot into totally different roles. Their reported discovering is {that a} GPT mannequin was higher than Opus for prolonged autonomous work particularly as a result of Opus tended to cease early and take shortcuts. Similar process, totally different position, totally different mannequin. The matching is turning into a part of the design floor.

This pairs with Composer 2 (their proprietary frontier coding mannequin that ships in Cursor 3) and their background cloud brokers: long-running duties that run on Anysphere’s cloud infrastructure somewhat than your laptop computer. Eight-hour refactors and codebase-wide migrations survive a closed lid. You can begin a process domestically, hit run in cloud once you notice it’ll take half-hour, and reattach later out of your telephone. Every agent runs in an remoted Git worktree and merges again through PR. The handoff between native and distant is the half most groups haven’t discovered but, and Cursor’s guess is that it needs to be its personal product floor.

The form finally ends up near Anthropic’s: Roles are cut up, periods are sturdy, judges sit beside the employee, and a protracted process runs in a cloud sandbox with Git because the coordination substrate.

Google: Lengthy-running brokers on the Agent Platform

Google’s announcement at Cloud Subsequent ’26 folded Vertex AI into the Gemini Enterprise Agent Platform and turned long-running brokers right into a named product, with named SLAs.

The items that matter for this submit:

  • Agent Runtime helps brokers that “run autonomously for days at a time” with sub-second chilly begins and on-demand sandbox provisioning. The launch submit’s instance use case is a gross sales prospecting sequence that takes every week to play out, which is roughly the appropriate form for it.
  • Agent Classes persist dialog and occasion historical past. You possibly can pin them to a customized session ID that maps to your individual CRM or DB file, so the agent’s state lives subsequent to the enterprise state as an alternative of in a separate AI silo.
  • Agent Reminiscence Financial institution is the persistent long-term reminiscence layer, typically out there as of Subsequent ’26. It curates reminiscences from periods, scopes them to a person id, and exposes a search API so the subsequent agent invocation can pull what’s related. Payhawk reported that auto-submitting bills via a Reminiscence Financial institution-backed agent minimize submission time by over 50%.
  • Agent Sandbox handles hardened code execution.
  • Agent-to-Agent Orchestration, Agent Registry, Agent Id, Agent Gateway, Agent Observability, and Agent Simulation cowl mainly each operational concern you’d in any other case construct by hand for a manufacturing fleet, together with the cryptographic-identity-and-audit-log story enterprises really have to ship.

Architecturally this is identical mind/fingers/session cut up Anthropic described, simply productized at platform scale and bundled with ADK (the code-first dev package) and Agent Studio (the visible one). In case you’re constructing inside Google Cloud, you don’t need to design a session log or a reminiscence retailer from scratch anymore. You wire an ADK agent into Reminiscence Financial institution and Classes, deploy onto Agent Runtime, and the persistence query is answered.

Discover how a lot this appears just like the sample Anthropic and Cursor describe, simply unbundled into named companies with SLAs. Three years in the past you’d have constructed all of this your self. Now you decide which model of “decoupled mind, fingers, and session” you need to lease.

5 patterns for long-running brokers in manufacturing

Shubham Saboo and I wrote up 5 design patterns we’ve seen separate working long-running brokers from demos. They aren’t Google-specific, however they map cleanly onto the primitives Agent Runtime now exposes, so it’s price strolling via them right here in shortened type.

Checkpoint-and-resume. The most typical multiday failure is context loss. An agent processes 200 paperwork over 4 hours, hits an error on doc 201, and with no checkpoint you begin from scratch. Deal with the agent like a long-running server course of: write intermediate state to disk, checkpoint each N items of labor, get better from failures. The Agent Runtime sandbox offers you a persistent filesystem, however selecting the best checkpoint granularity (not each step, not solely the tip) is on you.

Delegated approval (human-in-the-loop). Most “human-in-the-loop” implementations are: serialize state to JSON, hearth a webhook, hope somebody responds. The state goes stale, the notification will get buried, the agent re-deserializes right into a barely totally different world. Lengthy-running runtimes let the agent pause in place with full execution state intact: reasoning chain, working reminiscence, device historical past, pending motion. Hours of human time move, the agent consumes zero compute, and it resumes with subsecond latency. Mission Management is Google’s inbox for this. The sample works no matter vendor.

Reminiscence-layered context. A seven-day agent wants greater than session state. Reminiscence Financial institution handles long-term curated reminiscence, Reminiscence Profiles add low-latency lookups, and the failure mode you’ll hit in manufacturing is reminiscence drift: The agent learns a procedural shortcut from a number of atypical interactions and begins making use of it broadly. Govern reminiscence such as you govern microservices. Agent Id controls who can learn and write which banks. Agent Registry tracks which model of which agent is working. Agent Gateway enforces coverage on the wire. The auditing query stops being “What are my brokers doing?” and turns into “What are my brokers remembering, and the way is that altering their conduct?”

Ambient processing. Not each long-running agent talks to a human. Some sit on a Pub/Sub stream or a BigQuery desk and act on occasions as they arrive: content material moderation, anomaly detection, inbox triage. The architectural choice price making early is to not hardcode coverage into the agent. Outline it within the Gateway and the fleet picks up coverage modifications with out redeploys. Ambient brokers run unsupervised for lengthy stretches, and the one sane solution to replace 100 of them is to replace the coverage layer as soon as.

Fleet orchestration. In actual programs, you hardly ever have one agent. A coordinator delegates subtasks to specialists (a Lead Researcher Agent, a Scoring Agent, an Outreach Agent), every working independently for various durations. Every specialist will get its personal Id (so the Outreach Agent can’t learn monetary information meant for Scoring), its personal coverage enforcement, its personal Registry entry. This is identical coordinator/employee form distributed programs have used for many years. What’s new is that ADK handles it declaratively with graph-based workflows, and a nasty deployment in a single specialist doesn’t cascade to the others.

The patterns compose. A compliance system may use checkpointing for doc processing, delegated approval for evaluation gates, reminiscence layering for cross-session information, and fleet orchestration to coordinate the specialists. The opening query is at all times the identical: What’s the longest uninterrupted unit of labor your agent must carry out? Minutes, and also you don’t want long-running brokers. Hours or days, and these patterns are the place to start out. The full write-up with code samples covers every sample in depth.

So how do you really construct one at the moment?

That is the sensible query, and it has a special reply relying on what you’re constructing.

You’re a developer who needs long-running coding work by yourself repo. Simply use Claude Code (or Antigravity, Cursor, or Codex). The harness is already there. Deal with your AGENTS.md like a pilot’s guidelines: quick, each line earned by an actual failure. Add hooks for typecheck and lint that floor failures again to the agent. Write a plan file earlier than the agent begins. Use the Ralph loop when the agent claims it’s achieved and also you don’t imagine it. For multihour or in a single day jobs, run in a worktree so a closed laptop computer doesn’t kill the run, and have it commit progress each significant unit of labor. That is the trail most individuals ought to take, and it’s the place probably the most leverage is true now.

You’re constructing a hosted agent product. Don’t construct the runtime. Choose a managed one. The three actual choices at the moment: Google’s Agent Platform (Agent Engine + Reminiscence Financial institution + Classes), Claude Managed Brokers, or roll one thing on high of ADK, the Claude Agent SDK, or Codex SDK and host it your self. The trade-off is the standard one. Managed will get you the mind/fingers/session cut up, observability, id, and an audit path out of the field. Self-hosted will get you management and the power to make use of bizarre fashions for bizarre roles (Cursor’s sample). For many groups, the appropriate place to begin is a managed runtime plus your individual ADK or SDK code for the precise loop.

You’re doing one thing autonomous and operational (monitoring, analysis, ops). Reminiscence Financial institution-style persistence is what you need, and it’s the half that doesn’t exist in Claude Code. ADK + Reminiscence Financial institution + Cloud Run + Cloud Scheduler is the cleanest stack I’ve seen for “agent runs each N hours, accumulates state, alerts on a threshold.” That is additionally the place Cursor’s planner/employee/choose cut up begins to matter greater than it does for IDE coding, as a result of the work is genuinely parallel and the failure modes are totally different.

A number of issues matter no matter which path you are taking.

Write down the achieved situation earlier than the agent begins. That is the only highest-leverage transfer for lengthy runs. The Anthropic harness submit calls it the function record; Cursor calls it the planner’s process spec. Both approach, it’s an exterior file with express, testable completion standards, and it exists so the agent can’t quietly redefine achieved midrun.

Separate the evaluator from the generator. Self-grading is the failure mode. A planner/employee/choose pipeline, or a generator/evaluator pair, is an actual architectural sample, not a stylistic choice. Even when it’s the identical mannequin in numerous roles with totally different prompts.

Put money into the session log, not simply the immediate. The append-only occasion log is what makes the agent recoverable, debuggable, and auditable. In case you can’t reconstruct what the agent did within the final 24 hours from sturdy storage, what you have got is a long-running shell script that occurs to name an LLM, not a long-running agent.

Deal with compaction and context resets as top notch. Anthropic is express that summarization-as-compaction wasn’t sufficient for very lengthy jobs; they needed to do full context resets the place the harness tears the session down and rebuilds it from a structured handoff file. It’s primarily how people onboard a brand new engineer.

There are some actual limitations proper now

A number of issues are nonetheless genuinely unsolved.

Value. A 24-hour run with a frontier mannequin and some instruments will not be low cost. With out budgets, circuit breakers, and a tough cap on device spend, an agent can quietly burn via every week’s API finances in a day. That is solvable, but it surely’s an express step you must take.

Safety. An extended-running agent with API keys, cloud entry, and the power to run shell instructions has a a lot bigger assault floor than a chat session. The mind/fingers separation sample issues right here too: Credentials must be unreachable from the sandbox the place model-generated code runs, which is among the advantages Anthropic calls out for Managed Brokers.

Alignment drift. Over many context home windows, brokers drift. The unique objective will get summarized, then resummarized, then loses constancy. That is the half hooks and judges exist to defend in opposition to. It’s also the commonest motive “the agent went off and did one thing I didn’t ask for.”

Verification. Auditing 24 hours of autonomous exercise is an actual human-time drawback. Observability and structured artifacts (PRs, commits, briefings, check runs) are the way you make this tractable. With out them, you’re scrolling logs and also you’ll miss what issues.

The human position. That is the one I hold coming again to. Defining work crisply sufficient that an agent can run for a day on it’s more durable than doing the work your self. The ability that’s appreciating in worth isn’t writing code. It’s writing specs that survive contact with an autonomous executor.

The place that is going

Google, Anthropic, and Cursor have converged on roughly the identical form. Separate the mannequin loop from the execution sandbox from the sturdy session log. Cut up planning from era from analysis. Bake in compaction, hooks, and context resets. Expose reminiscence as a managed service that any agent invocation can question.

Floor space is what differs. Google’s Agent Platform is the enterprise-stack model, with the id and audit path story baked in. The patterns beneath are the identical. Claude Managed Brokers is “Anthropic’s harness, hosted.” Cursor’s background brokers are “long-running coding, pulled out of the IDE and into the cloud.”

The more durable issues for the subsequent yr aren’t in any of these layers individually. They’re within the coordination above them. Many long-running brokers on a shared codebase. Brokers that learn their very own traces and patch their very own harnesses. Harnesses that assemble instruments and context simply in time for a process as an alternative of being preconfigured at startup. That’s the place the agent stops trying like a wiser chat window and begins trying like a colleague who’s been on the mission longer than you have got.

The mannequin remains to be load-bearing. However the hole between a chat window and an agent you possibly can go away working in a single day is generally within the state, periods, and structured handoffs wrapped round it. That’s the place I’d spend my studying time proper now.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments