
I arrange an AI agent on a rented GPU, pointed it at a coaching script, and went to mattress. By morning it had run 40 experiments, improved validation loss by 5.9%, and minimize reminiscence utilization from 44 GB to 17 GB. It additionally spent 4 hours chasing a bug {that a} linter launched behind its again. The agent by no means flagged it. I solely discovered as a result of the numbers stopped bettering and I began studying logs.
The setup was primarily based on Andrej Karpathy’s autoresearch undertaking: Give an agent one file it will probably edit (practice.py), one metric to optimize (validation bits per byte), a hard and fast five-minute coaching funds per experiment, and Git for checkpointing. If an experiment beats the present finest, hold the commit. If not, revert. Loop perpetually. Karpathy’s personal run produced 700 experiments and 20 real enhancements throughout 48 hours, an 11% speedup on already-optimized code. Shopify’s Tobi Lütke pointed the identical sample at Liquid, their templating engine, and acquired 53% sooner rendering from 93 automated commits. The sample clearly works. The query is what breaks whenever you run it your self.
The primary failure: Brokers fixing brokers
Earlier than working autoresearch, I had a separate drawback. I had 15 customized abilities for Claude Code (assume reusable immediate templates with software entry, structured inputs, and particular behaviors). Most of them had been damaged when dispatched as parallel background brokers. Imprecise descriptions meant the system couldn’t work out when to invoke them. Lacking software permissions brought about silent failures. Duplicate scopes between related abilities created routing confusion.
So I used the identical sample: dispatch background brokers in parallel, one per ability, every tasked with studying the ability definition, figuring out issues, and rewriting it. 13 out of 15 got here again improved. Descriptions acquired particular. Useless references to nonexistent information had been eliminated. Software permissions had been added. Two abilities had been left untouched as a result of the brokers couldn’t discover something incorrect with them. The entire batch took beneath an hour.
However right here’s what I didn’t count on. Three of the “improved” abilities had delicate regressions. One agent eliminated an AskUserQuestion gate that was there for a motive, as a result of the gate’s goal wasn’t documented and the agent learn it as pointless friction. One other agent rewrote a ability description so exactly that it stopped triggering on the fuzzy, misspelled queries actual customers really kind. I caught these throughout guide assessment, but when I had trusted the parallel output with out checking, three abilities would have silently degraded in manufacturing.
The second failure: The linter within the loop
Then I began the coaching loop. The agent labored by hyperparameters methodically. It halved the batch measurement early (experiment 4), which turned out to be the only greatest win: extra gradient steps in the identical five-minute window. It lowered mannequin depth from eight to seven layers, dropped weight decay from 0.2 to 0.05, and tuned the training fee schedule. Every change was small. The cumulative impact was a 5.9% enchancment in validation loss and a 60% discount in peak GPU reminiscence.
Out of 40 experiments, the agent stored 9, discarded 28, and crashed three. That hold/discard ratio felt about proper. Most concepts don’t work. The purpose of automation isn’t to have higher concepts. It’s to attempt dangerous ones sooner.
Then the numbers plateaued. Experiments 30 by 38 produced nothing price retaining. I began digging by the logs and located one thing I hadn’t anticipated: A linter working on the distant machine had been silently modifying a hyperparameter in practice.py. It modified SCALAR_LR from 0.5 to 0.3 each time the agent saved the file. The agent would set the worth, commit, and run the experiment, however the linter would alter the file between the save and the execution. The agent had no approach to detect this as a result of it checked Git diffs, not the runtime state of the file. Each experiment after a sure level was working with a studying fee the agent by no means selected.
I misplaced roughly 4 hours of compute to this. The agent stored going, proposing new concepts, working experiments, logging outcomes. From its perspective nothing was incorrect. The experiments ran, produced numbers, and the numbers had been believable. There was no crash, no error, no alert.
Why this issues past my GPU invoice
Gartner predicts over 40% of agentic AI initiatives will probably be canceled by the tip of 2027, citing escalating prices and insufficient threat controls as the first drivers. My in a single day session was a toy instance: a single GPU, a small mannequin, and a low-stakes experiment. However the failure sample scales. An agent that may’t detect when its inputs are being modified between choices will make the identical class of error whether or not it’s tuning hyperparameters or managing a manufacturing pipeline.
The autoresearch constraints are good—one file, one metric, and Git for state—however they assume the surroundings is secure. No person checks whether or not one thing outdoors the loop is modifying the file between commits. The agent optimizes inside its sandbox, and the sandbox has a gap within the wall that no person thought to search for.
Anybody who has run distributed techniques acknowledges this. When the linter modified that hyperparameter, it was the equal of somebody enhancing a database file between a learn and a write. We solved that drawback years in the past with compare-and-swap, optimistic locking, checksums. We simply haven’t introduced any of it to autonomous AI workflows. The SkyPilot staff lately scaled autoresearch to 16 GPUs and 910 experiments. At that scale, an undetected surroundings mutation doesn’t price you 4 hours. It prices you a cluster.
Subsequent time I run autoresearch, I’ll add a file integrity verify earlier than each experiment. It’s three strains of code, however it will have saved me 4 hours and produced a greater remaining outcome. The agent did its job. The surroundings didn’t.
