That is the third article in a sequence on agentic engineering and AI-driven improvement. Learn half one right here, half two right here, and search for the subsequent article on April 15 on O’Reilly Radar.
The toolkit sample is a manner of documenting your mission’s configuration in order that any AI can generate working inputs from a plain-English description. You and the AI create a single file that describes your device’s configuration format, its constraints, and sufficient labored examples that any AI can generate working inputs from a plain-English description. You construct it iteratively, working with the AI (or, higher, a number of AIs) to draft it. You take a look at it by beginning a contemporary AI session and making an attempt to make use of it, and each time that fails you develop the toolkit from these failures. If you construct the toolkit effectively, your customers won’t ever have to learn the way your device’s configuration information work, as a result of they describe what they need in dialog and the AI handles the interpretation. Meaning you don’t need to compromise on the way in which your mission is configured, as a result of the config information may be extra complicated and extra full than they might be if a human needed to edit and perceive them.
To know why all of this issues, let me take you again to the mid-Nineteen Eighties.
I used to be 12 years outdated, and our household acquired an AT&T PC 6300, an IBM-compatible that got here with a consumer’s information roughly 159 pages lengthy. Chapter 4 of that guide was known as “What Each Consumer Ought to Know.” It lined issues like how you can use the keyboard, how you can care on your diskettes, and, memorably, how you can label them, full with hand-drawn illustrations and actually helpful recommendation, like how it’s best to solely use felt-tipped pens, by no means ballpoint, as a result of the strain may harm the magnetic floor.

I bear in mind being fascinated by this guide. It wasn’t our first laptop. I’d been writing BASIC packages and dialing into BBSs and CompuServe for a few years, so I knew there have been all types of wonderful issues you can do with a PC, particularly one with a blazing quick 8MHz processor. However the guide barely talked about any of that. That appeared actually bizarre to me, whilst a child, that you’d give somebody a guide that had an entire web page on utilizing the backspace key to right typing errors (actually!) however didn’t truly inform them how you can use the factor to do something helpful.
That’s how most developer documentation works. We write the stuff that’s simple to jot down—set up, setup, the getting-started information—as a result of it’s lots simpler than writing the stuff that’s truly arduous: the deep rationalization of how all of the items match collectively, the constraints you solely uncover by hitting them, the patterns that separate a configuration that works from one that just about works. That is yet one more “on the lookout for your keys below the streetlight” downside: We write the documentation we write as a result of it’s best to jot down, even when it’s probably not the documentation our customers want.
Builders who got here up by the Unix period know this effectively. Man pages have been thorough, correct, and infrequently fully impenetrable when you didn’t already know what you have been doing. The tar man web page is the canonical instance: It paperwork each flag and possibility in exhaustive element, however when you simply need to know how you can extract a .tar.gz file, it’s nearly ineffective. (The suitable flag is -xzvf in case you’re curious.) Stack Overflow exists largely as a result of man pages like tar’s left a niche between what the documentation stated and what builders truly wanted to know.
And now we’ve AI assistants. You may ask Claude or ChatGPT about, say, Kubernetes, Terraform, or React, and also you’ll truly get helpful solutions, as a result of these are all established tasks which have been written about extensively and the coaching knowledge is all over the place.
However AI hits a tough wall on the boundary of its coaching knowledge. If you happen to’ve constructed one thing new—a framework, an inside platform, a device your workforce created—no mannequin has ever seen it. Your customers can’t ask their AI assistant for assist, as a result of the AI doesn’t know your factor even exists.
There’s been loads of nice work shifting AI documentation in the correct route. AGENTS.md tells AI coding brokers how you can work in your codebase, treating the AI as a developer. llms.txt offers fashions a structured abstract of your exterior documentation, treating the AI as a search engine. What’s been lacking is a follow for treating the AI as a help engineer. Each mission wants configuration: enter information, possibility schemas, workflow definitions, often within the type of an entire bunch of JSON or YAML information with cryptic codecs that customers need to be taught earlier than they will do something helpful.
The toolkit sample solves that downside of getting AIs to jot down configuration information for a mission that isn’t in its coaching knowledge. It consists of a documentation file that teaches any AI sufficient about your mission’s configuration that it might probably generate working inputs from a plain-English description, with out your customers ever having to be taught the format themselves. Builders have been arriving at this similar sample (or one thing very related) independently from totally different instructions, however so far as I can inform, no person has named it or described a technique for doing it effectively. This text distills what I realized from constructing the toolkit for Octobatch pipelines right into a set of practices you possibly can apply to your personal tasks.
Construct the AI its personal guide
Historically, builders face a trade-off with configuration: hold it easy and straightforward to grasp, or let it develop to deal with actual complexity and settle for that it now requires a guide. The toolkit sample emerged for me whereas I used to be constructing Octobatch, the batch-processing orchestrator I’ve been writing about on this sequence. As I described within the earlier articles on this sequence, “The Unintentional Orchestrator” and “Hold Deterministic Work Deterministic,” Octobatch runs complicated multistep LLM pipelines that generate information or run Monte Carlo simulations. Every pipeline is outlined utilizing a fancy configuration that consists of YAML, Jinja2 templates, JSON schemas, expression steps, and a algorithm tying all of it collectively. The toolkit sample let me sidestep that conventional trade-off.
As Octobatch grew extra complicated, I discovered myself counting on the AIs (Claude and Gemini) to construct configuration information for me, which turned out to be genuinely precious. Once I developed a brand new characteristic, I might work with the AIs to provide you with the configuration construction to help it. At first I outlined the configuration, however by the top of the mission I relied on the AIs to provide you with the primary reduce, and I’d push again when one thing appeared off or not forward-looking sufficient. As soon as all of us agreed, I might have an AI produce the precise up to date config for no matter pipeline we have been engaged on. This transfer to having the AIs do the heavy lifting of writing the configuration was actually precious, as a result of it let me create a really strong format in a short time with out having to spend hours updating current configurations each time I modified the syntax or semantics.
In some unspecified time in the future I noticed that each time a brand new consumer wished to construct a pipeline, they confronted the identical studying curve and implementation challenges that I’d already labored by with the AIs. The mission already had a README.md file, and each time I modified the configuration I had an AI replace it to maintain the documentation updated. However by this time, the README.md file was doing manner an excessive amount of work: It was actually complete however an actual headache to learn. It had eight separate subdocuments exhibiting the consumer how you can do just about the whole lot Octobatch supported, and the majority of it was targeted on configuration, and it was turning into precisely the form of documentation no person ever desires to learn. That significantly bothered me as a author; I’d produced documentation that was genuinely painful to learn.
Trying again at my chats, I can hint how the toolkit sample developed. My first intuition was to construct an AI-assisted editor. About 4 weeks into the mission, I described the concept to Gemini:
I’m fascinated about how you can present any form of AI-assisted device to assist folks create their very own pipeline. I used to be fascinated about a characteristic we’d name “Octobatch Studio” the place we make it simple to immediate for modifying pipeline phases, presumably aiding in creating the prompts. However perhaps as a substitute we embrace loads of documentation in Markdown information, and count on them to make use of Claude Code, and provides numerous steerage for creating it.
I can truly see the pivot to the toolkit sample occurring in actual time on this later message I despatched to Claude. It had sunk in that my customers might use Claude Code, Cursor, or one other AI as interactive documentation to construct their configs precisely the identical manner I’ve been doing:
My plan is to make use of Claude Code because the IDE for creating new pipelines, so individuals who need to create them can simply spin up Claude Code and begin producing them. Meaning we have to give Claude Code particular context information to inform it the whole lot it must know to create the pipeline YAML config with asteval expressions and Jinja2 template information.
The normal trade-off between simplicity and suppleness comes from cognitive overhead: the price of holding all of a system’s guidelines, constraints, and interactions in your head whilst you work with it. It’s why many builders go for easier config information, so that they don’t overload their customers (or themselves). As soon as the AI was writing the configuration, that trade-off disappeared. The configs might get as difficult as they wanted to be, as a result of I wasn’t the one who needed to bear in mind how all of the items match collectively. In some unspecified time in the future I noticed the toolkit sample was value standardizing.
That toolkit-based workflow—customers describe what they need, the AI reads TOOLKIT.md and generates the config—is the core of the Octobatch consumer expertise now. A consumer clones the repo and opens Claude Code, Cursor, or Copilot, the identical manner they might with any open supply mission. Each configuration immediate begins the identical manner: “Learn pipelines/TOOLKIT.md and use it as your information.” The AI reads the file, understands the mission construction, and guides them step-by-step.
To see what this seems like in follow, take the Drunken Sailor pipeline I described in “The Unintentional Orchestrator.” It’s a Monte Carlo random stroll simulation: A sailor leaves a bar and stumbles randomly towards the ship or the water. The pipeline configuration for that entails a number of YAML information, JSON schemas, Jinja2 templates, and expression steps with actual mathematical logic, all wired along with particular guidelines.

Right here’s the immediate that generated all of that. The consumer describes what they need in plain English, and the AI produces your complete configuration by studying TOOLKIT.md. That is the precise immediate I gave Claude Code to generate the Drunken Sailor pipeline—discover the primary line of the immediate, telling it to learn the toolkit file.

However configuration era is just half of what the toolkit file does. Customers can even add TOOLKIT.md and PROJECT_CONTEXT.md (which has details about the mission) to any AI assistant—ChatGPT, Gemini, Claude, Copilot, no matter they like—and use it as interactive documentation. A pipeline run completed with validation failures? Add the 2 information and ask what went unsuitable. Caught on how retries work? Ask. You may even paste in a screenshot of the TUI and say, “What do I do?” and the AI will learn the display and provides particular recommendation. The toolkit file turns any AI into an on-demand help engineer on your mission.

What the Octobatch mission taught me concerning the toolkit sample
Constructing the generative toolkit for Octobatch produced extra than simply documentation that an AI might use to create configuration information that labored; it additionally yielded a set of practices, and people practices turn into fairly constant no matter what sort of mission you’re constructing. Listed here are the 5 that mattered most:
- Begin with the toolkit file and develop it from failures. Don’t wait till the mission is completed to jot down the documentation. Create the toolkit file first, then let every actual failure add one precept at a time.
- Let the AI write the config information. Your job is product imaginative and prescient—what the mission ought to do and the way it ought to really feel. The AI’s job is translating that into legitimate configuration.
- Hold steerage lean. State the precept, give one concrete instance, transfer on. Each guardrail prices tokens, and bloated steerage makes AI efficiency worse.
- Deal with each use as a take a look at. There’s no separate testing section for documentation. Each time somebody makes use of the toolkit file to construct one thing, that’s a take a look at of whether or not the documentation works.
- Use multiple mannequin. Completely different fashions catch various things. In a three-model audit of Octobatch, three-quarters of the defects have been caught by just one mannequin.
I’m not proposing a regular format for a toolkit file, and I believe making an attempt to create one could be counterproductive. Configuration codecs range wildly from device to device—that’s the entire downside we’re making an attempt to resolve—and a toolkit file that describes your mission’s constructing blocks goes to look fully totally different from one which describes another person’s. What I discovered is that the AI is completely able to studying no matter you give it, and might be higher at writing the file than you might be anyway, as a result of it’s writing for one more AI. These 5 practices ought to assist construct an efficient toolkit no matter what your mission seems like.
Begin with the toolkit file and develop it from failures
You can begin constructing a toolkit at any level in your mission. The best way it occurred for me was natural: After weeks of working with Claude and Gemini on Octobatch configuration, the information about what labored and what didn’t was scattered throughout dozens of chat periods and context information. I wrote a immediate asking Gemini to consolidate the whole lot it knew concerning the config format—the construction, the foundations, the constraints, the examples, the whole lot we’d talked about—right into a single TOOLKIT.md file. That first model wasn’t nice, nevertheless it was a place to begin, and each failure after that made it higher.
I didn’t plan the toolkit from the start of the Octobatch mission. It began as a result of I wished my customers to have the ability to construct pipelines the identical manner I had—by working with an AI—however the whole lot they’d want to do this was unfold throughout months of chat logs and the CONTEXT.md information I’d been sustaining to bootstrap new improvement periods. As soon as I had Gemini consolidate the whole lot right into a single TOOLKIT.md file and had Claude evaluate it, I handled it the way in which I deal with another code: Each time one thing broke, I discovered the basis trigger, labored with the AIs to replace the toolkit to account for it, and verified {that a} contemporary AI session might nonetheless use it to generate legitimate configuration.
That incremental method labored effectively for me, and it let me take a look at my toolkit the way in which I take a look at another code: attempt it out, discover bugs, repair them, rinse, repeat.
You are able to do the identical factor. If you happen to’re beginning a brand new mission, you possibly can plan to create the toolkit on the finish. But it surely’s more practical to begin with a easy model early and let it emerge over the course of improvement. That manner you’re dogfooding it the entire time as a substitute of guessing what customers will want.
Let the AI write the config information (however keep in management!)
Early Octobatch pipelines had easy sufficient configuration {that a} human might learn and perceive them, however not as a result of I used to be writing them by hand. One of many floor guidelines I set for the Octobatch experiment in AI-driven improvement was that the AIs would write the entire code, and that included writing the entire configuration information. The issue was that though they have been doing the writing, I used to be unconsciously constraining the AIs: pushing again on something that felt too complicated, steering towards constructions I might nonetheless maintain in my head.
In some unspecified time in the future I noticed my pushback was inserting a man-made restrict on the mission. The entire level of getting AIs write the config was that I didn’t have to hold each single line in my head—it was okay to let the AIs deal with that stage of complexity. As soon as I finished constraining them, the cognitive overhead restrict I described earlier went away. I might have full pipelines outlined in config, together with expression steps with actual mathematical logic, while not having to carry all the foundations and relationships in my head.
As soon as the mission actually acquired rolling, I by no means wrote YAML by hand once more. The cycle was all the time: want a characteristic, talk about it with Claude and Gemini, push again when one thing appeared off, and one in all them produces the up to date config. My job was product imaginative and prescient. Their job was translating that into legitimate configuration. And each config file they wrote was one other take a look at of whether or not the toolkit truly labored.
This job delineation, nonetheless, meant inevitable disagreements between me and the AI, and it’s not all the time simple to search out your self disagreeing with a machine as a result of they’re surprisingly cussed (and infrequently shockingly silly). It required persistence and vigilance to remain accountable for the mission, particularly after I turned over massive tasks to the AIs.
The AIs persistently optimized for technical correctness—separation of issues, code group, effort estimation—which was nice, as a result of that’s the job I requested them to do. I optimized for product worth. I discovered that protecting that worth as my north star and all the time specializing in constructing helpful options persistently helped with these disagreements.
Hold steerage lean
When you begin rising the toolkit from failures, the pure development is to overdocument the whole lot. Generative AIs are biased towards producing, and it’s simple to allow them to get carried away with it. Each bug feels prefer it deserves a warning, each edge case feels prefer it wants a caveat, and earlier than lengthy your toolkit file is bloated with guardrails that price tokens with out including a lot worth. And for the reason that AI is the one writing your toolkit updates, you’ll want to push again on it the identical manner you push again on structure choices. AIs love including WARNING blocks and exhaustive caveats. The self-discipline you’ll want to convey is telling them when to not add one thing.
The suitable stage is to state the precept, give one concrete instance, and belief the AI to use it to new conditions. When Claude Code made a selection about JSON schema constraints that I may need second-guessed, I needed to determine whether or not so as to add extra guardrails to TOOLKIT.md. The reply was no—the steerage was already there, and the selection it made was truly right. If you happen to hold tightening guardrails each time an AI makes a judgment name, the sign will get misplaced within the noise and efficiency will get worse, not higher. When one thing goes unsuitable, the impulse—for each you and the AI—is so as to add a WARNING block. Resist it. One precept, one instance, transfer on.
Deal with each use as a take a look at
There was no separate “testing section” for Octobatch’s TOOLKIT.md. Each pipeline that I created with it was a brand new take a look at. After the very first model, I opened a contemporary Claude Code session that had by no means seen any of my improvement conversations, pointed it on the newly minted TOOLKIT.md, and requested it to construct a pipeline. The primary time I attempted it, I used to be stunned at how effectively it labored! So I saved utilizing it, and because the mission rolled alongside, I up to date it with each new characteristic and examined these updates. When one thing failed, I traced it again to a lacking or unclear rule within the toolkit and glued it there.
That’s the sensible take a look at for any toolkit: open a contemporary AI session with no context past the file, describe what you need in plain English, and see if the output works. If it doesn’t, the toolkit has a bug.
Use multiple mannequin
If you’re constructing and testing your toolkit, don’t simply use one AI. Run the identical job by a second mannequin. sample that labored for me was persistently having Claude generate the toolkit and Gemini verify its work.
Completely different fashions catch various things, and this issues for each growing and testing the toolkit. I used Claude and Gemini collectively all through Octobatch improvement, and I overruled each after they have been unsuitable about product intent. You are able to do the identical factor: If you happen to work with a number of AIs all through your mission, you’ll begin to get a really feel for the totally different sorts of questions they’re good at answering.
When you’ve gotten a number of fashions generate config from the identical toolkit independently, you discover out quick the place your documentation is ambiguous. If two fashions interpret the identical rule in a different way, the rule wants rewriting. That’s a sign you possibly can’t get from utilizing only one mannequin.
The guide, revisited
That AT&T PC 6300 guide devoted a full web page to labeling diskettes, which can have been overkill, nevertheless it acquired one factor proper: it described the constructing blocks and trusted the reader to determine the remainder. It simply had the unsuitable reader in thoughts.
The toolkit sample is identical concept, pointed at a unique viewers. You write a file that describes your mission’s configuration format, its constraints, and sufficient labored examples that any AI can generate working inputs from a plain-English description. Your customers by no means need to be taught YAML or memorize your schema, as a result of they’ve a dialog with the AI and it handles the interpretation.
If you happen to’re constructing a mission and also you need AI to have the ability to assist your customers, begin right here: write the toolkit file earlier than you write the README, develop it from actual failures as a substitute of making an attempt to plan all of it upfront, hold it lean, take a look at it through the use of it, and use multiple mannequin as a result of no single AI catches the whole lot.
The AT&T guide’s Chapter 4 was known as “What Each Consumer Ought to Know.” Your toolkit file is “What Each AI Ought to Know.” The distinction is that this time, the reader will truly use it.
Within the subsequent article, I’ll begin with a statistic about developer belief in AI-generated code that turned out to be fabricated by the AI itself—and use that to clarify why I constructed a top quality playbook that revives the standard high quality practices most groups reduce many years in the past. It explores an unfamiliar codebase, generates an entire high quality infrastructure—exams, evaluate protocols, validation guidelines—and finds actual bugs within the course of. It really works throughout Java, C#, Python, and Scala, and it’s out there as an open supply Claude Code talent.
