AI Code Assessment Solely Catches Half of Your Bugs – O’Reilly

May 3, 2026

1

That is the fifth article in a sequence on agentic engineering and AI-driven improvement. Learn half one right here, half two right here, half three right here, and half 4 right here.

I not too long ago had a style of humility with my AI-generated code. I reside in Park Slope, Brooklyn, and not too long ago I wanted to get to the opposite facet of the neighborhood. I assumed I’d be intelligent: I like taking the bus, so I made a decision to hop on the one which goes proper down seventh Avenue. I do know I may examine the schedule utilizing the MTA’s actually helpful Bus Time app or web site, but it surely doesn’t consider strolling time from my home or give me a good suggestion of when to go away. This appeared like a terrific alternative to vibe code an app and do some fast AI-driven improvement.

It took about two minutes for Claude Code to get my new app working. It made a beautiful little internet UI, I configured my cease and the way lengthy it takes me to stroll there, and it gave me the proper departure time.

Once I really walked out the door, the app completely predicted my wait. There was only one downside: my bus was nowhere to be seen. What I did see was a bus driving the precise other way down seventh Avenue.

It was fairly apparent what had occurred. I wanted to go deeper into Brooklyn, not in direction of Manhattan, and the AI had picked the unsuitable course. (Truly, as Cowork identified, every cease has its personal ID, and it had chosen the ID for the unsuitable cease.) I’d been utilizing Cowork to orchestrate every thing, and I may simply have simply requested it to exit and examine the MTA’s BusTime website for me to verify the app was working. However I simply trusted the AI. Consequently, I needed to stroll. Which is ok—I like strolling—however the irony was painful. I had actually simply printed an article about AI code high quality and why you shouldn’t blindly belief it, and right here I used to be doing precisely that.

The app had a bug. Nevertheless it wasn’t the sort of bug you’d essentially catch utilizing a typical AI code overview immediate. It constructed, ran, and did a superbly effective job parsing the JSON from the MTA API. But when I’d began with a easy requirement—even only a person story like “as a Park Slope resident, I wish to catch the B69 headed in direction of Kensington so I can get deeper into Brooklyn”—the AI would have constructed it in another way. The issue is that AI can solely construct the factor you inform it to construct, which isn’t essentially the factor you wished it to construct. AI is absolutely good at writing “appropriate” code that does the unsuitable factor.

My Brooklyn bus detour was a minor inconvenience. Nevertheless it was a extremely helpful, small-scale instance of what I stored operating into in my bigger tasks, too. There’s a whole class of bugs that you just can’t discover with structural evaluation—no linter, no static analyzer, no AI code reviewer will catch them—as a result of the code isn’t unsuitable in any means that’s seen from the code alone. You should know what the code was purported to do. You should know the intent.

The info on why necessities matter goes again many years. Again within the Nineties, for instance, the Standish CHAOS studies had been a giant eye-opener for me and plenty of different individuals within the trade, large-scale information confirming what we’d been seeing on our personal tasks: that the most costly defects hint again to misunderstood or lacking necessities. These studies actually underscored the concept poor necessities administration, and particularly incomplete or ceaselessly altering specs, had been one of the crucial major drivers behind IT mission failures. (And, so far as I can inform, they nonetheless are, and AI isn’t serving to issues—see my O’Reilly Radar article, “Immediate Engineering Is Necessities Engineering”).

The concept that necessities issues actually are the supply of the most costly sort of defects ought to make intuitive sense: When you construct the unsuitable factor, you must tear it aside and rebuild it. That’s why I made necessities the muse of the High quality Playbook, an open-source ability for AI instruments like Claude Code, Cursor, and Copilot that I launched within the earlier article. I’ve spent many years doing test-driven improvement, partnering with QA groups, welcoming the harshest code evaluations from teammates who don’t pull punches—and that have led me to construct a software that makes use of AI to deliver again high quality engineering practices the trade deserted many years in the past. I’ve examined it towards a variety of open-source tasks in Go, Java, Rust, Python, and C#, from small utilities to widely-used libraries with tens of 1000’s of stars, and it’s discovered actual bugs in nearly each mission it’s come throughout, together with ones which have been confirmed and merged upstream.

I feel there are plenty of wider classes we will be taught from my expertise utilizing necessities to assist AI discover bugs—particularly safety bugs. So on this article, I wish to give attention to the only most necessary factor I’ve realized from constructing it: every thing is determined by necessities. Not simply any necessities, however a selected sort of requirement that the majority tasks don’t have, that the majority AI instruments don’t ask for, and that seems to be the important thing to creating AI really helpful for verifying code high quality.

Spec-driven improvement and what it misses

Builders utilizing AI instruments have been rediscovering the worth of writing issues down earlier than asking the AI to construct them. Spec-driven improvement (SDD) has turn into extremely popular, and for good purpose. Addy Osmani wrote a wonderful piece on this, “ Write a Good Spec for AI Brokers,” and the core thought is sound: When you write a transparent specification of what you need constructed, the AI produces dramatically higher outcomes than in case you simply describe it in a chat immediate and hope for one of the best.

I feel SDD is necessary, and I’d encourage any developer working with AI to undertake it. However as I used to be constructing the High quality Playbook, I found that SDD has a blind spot that issues lots for code high quality. An SDD spec describes the how—what the implementation ought to appear to be. It tells the AI “implement a replica key examine” or “add a retry mechanism with exponential backoff” or “create a REST endpoint that returns paginated outcomes.” That’s helpful for constructing issues. Nevertheless it’s not sufficient for verifying them.

However a requirement doesn’t say “implement a replica key examine.” It says “customers depend upon Gson to reject ambiguous enter in order that they don’t silently settle for corrupted information.” The AI can purpose about the second in methods it might probably’t purpose in regards to the first, as a result of the second has the aim connected. When the AI is aware of the aim, it might probably consider whether or not the code really fulfills that objective throughout all the sting circumstances, not simply those the spec explicitly listed. That’s how the High quality Playbook caught a bug in Google’s Gson library, one of the crucial broadly used JSON libraries in Java.

I feel it’s value digging into that exact bug, as a result of it’s a terrific instance of simply how highly effective necessities evaluation may be for locating defects. The playbook derived null-handling necessities from Gson’s personal neighborhood—GitHub points #676, #913, #948, and #1558, some courting again to 2016—then used these necessities to seek out that duplicate keys had been silently accepted when the primary worth was null. It confirmed the bug by producing a failing check, then patched the code and verified the check handed. I’ve used Gson for years and executed plenty of work with Java serialization, so I learn the code and the repair myself earlier than submitting something—belief however confirm. The repair was merged as https://github.com/google/gson/pull/3006, confirmed by Google’s personal check suite.

That bug had been hiding in plain sight for years, by means of 1000’s of exams and numerous code evaluations. Nevertheless it’s attainable that no structural evaluation may need ever discovered it since you wanted the requirement to understand it was unsuitable.

This distinction would possibly sound educational, but it surely has very concrete penalties for whether or not your AI can really discover bugs in your code.

About half of all safety bugs are invisible to structural evaluation

The safety world has identified in regards to the limits of structural evaluation for a very long time. The NIST SATE evaluations discovered that one of the best static evaluation instruments plateaued at round 50-60% detection charges for safety vulnerabilities. Gary McGraw’s Software program Safety: Constructing Safety In (Addison-Wesley, 2006) explains why: Roughly 50% of safety defects are implementation bugs, and the opposite 50% are design flaws. Static evaluation instruments goal the implementation bugs—buffer overflows, SQL injection, format string vulnerabilities—as a result of these are pattern-matchable. However design flaws are about intent: The system’s structure doesn’t implement the safety properties it’s purported to implement, and no quantity of scanning the code will reveal that. A 2024 examine by Charoenwet et al. (ISSTA 2024) confirmed that is nonetheless the case: They examined 5 static evaluation instruments towards 815 actual vulnerability-contributing commits and located that 22% of weak commits went completely undetected, and 76% of warnings in weak capabilities had been irrelevant to the precise vulnerability. The sample is constant throughout 20 years of analysis: There’s a ceiling on what yow will discover by analyzing code, and it’s round half.

There’s an excellent purpose for that limitation: the intent ceiling. A structural evaluation software is restricted to studying the code and what it does; it has no technique to consider what the developer meant it to do.

When an AI does a code overview with out necessities, it’s restricted to structural evaluation: sample matching, code odor detection, race situation evaluation. It will probably ask “does this look proper?” however it might probably’t ask “does this do what it’s purported to do?” as a result of it doesn’t know what the code is meant to do. Structural overview catches genuinely necessary stuff—race situations, null pointer points, useful resource leaks, concurrency bugs. A structural reviewer a shell script will catch a lacking fi, a nasty variable enlargement, a race situation. Structural overview is helpful, and structural overview is what most AI code overview instruments do as we speak.

However about half of all safety defects are intent violations: issues the code doesn’t do this it was purported to do, or issues it does that it wasn’t purported to do. They’re invisible with no specification to examine towards, and no software will discover them by code that’s, structurally, completely sound. A structural reviewer a script that’s, say, used to examine router configuration recordsdata, would possibly discover well-formed bash, appropriate syntax, correct quoting, and code that appears like it really works and doesn’t match identified antipatterns. It wouldn’t know the script is barely validating three of the 5 entry management guidelines it’s purported to implement as a result of that’s a necessities query, not a syntax query.

Or, extra personally for me, that is what occurred with my bus tracker app: The JSON parsing was flawless, the UI was appropriate, the timing logic labored completely. The one downside was that it confirmed buses headed in direction of Manhattan after I wanted to go deeper into Brooklyn—and no structural evaluation would ever catch that, as a result of it is advisable to know which course I meant to go. That’s me and my very intelligent AI hitting the intent ceiling.

The intent ceiling is a safety downside

That is the place it will get actually critical, as a result of safety vulnerabilities are a few of the most harmful members of this class of invisible bugs.

Take into consideration what a lacking authorization examine seems prefer to an AI code reviewer. Let’s say you’ve bought an online endpoint with a well-formed HTTP handler, correctly sanitized inputs, and a secure database question. The code is clear, and passes each structural examine and static evaluation software you’ve thrown at it. Now you’re testing it and, a lot to your dismay, you uncover that the endpoint lets any authenticated person delete every other person’s information as a result of no person ever wrote down the requirement that claims “solely directors can carry out deletions.” That’s CWE-862: Lacking Authorization, and it rose to #9 on the 2024 CWE High 25 most harmful software program weaknesses.

That’s not a coding error! It’s a lacking requirement.

That’s McGraw’s level: About half of all safety defects aren’t implementation bugs in any respect. They’re design flaws, locations the place the system’s structure doesn’t implement the safety properties it was purported to implement. A cross-site scripting vulnerability isn’t all the time a failure to sanitize enter. Typically it’s a failure to outline which inputs are trusted and which aren’t. A privilege escalation isn’t all the time a damaged entry examine. Typically there was by no means an entry examine to start with as a result of no person specified that one was wanted. These are intent violations and so they’re invisible to any software that doesn’t know what the software program is meant to stop.

AI code overview instruments as we speak are superb at catching the implementation half of McGraw’s break up. They’ll spot a SQL injection sample, flag an unsafe deserialization, establish a buffer overflow. However they’re engaged on the identical facet of the 50/50 line that static evaluation has all the time labored on. The design half—the lacking authorization checks, the unspecified belief boundaries, the safety properties that had been by no means written down—requires the identical factor that catching my bus tracker bug required: figuring out what the software program was purported to do within the first place.

How the High quality Playbook derives necessities (and how one can too!)

The issue most tasks face is that they don’t have formal necessities. What they’ve is code, documentation, commit messages, chat historical past, README recordsdata, and perhaps some design docs. The query is find out how to get from that mess to a specification that an AI can really use for verification.

The important thing perception I had whereas constructing the playbook was that each earlier method I attempted requested the mannequin to do two issues without delay: determine what contracts exist AND write necessities for them. That doesn’t work—the mannequin runs out of consideration making an attempt to carry your complete behavioral floor in its head whereas additionally producing formatted necessities. So I break up them aside into 4 steps: First, have the AI learn every supply file and write down each behavioral contract it observes as a easy record. Second, derive necessities from these contracts plus the documentation. Third, examine whether or not each contract is roofed by a requirement. Fourth, assert completeness—and if there are gaps, return to the 1st step for the recordsdata with gaps.

The important thing thought is that the contracts file is exterior reminiscence. When the mannequin “forgets” a few behavioral contract it seen earlier, that forgetting is often invisible. With a contracts file, each statement is written down earlier than any necessities work begins, so an uncovered contract is a visual, greppable hole.

You don’t want the High quality Playbook to do that—you possibly can apply the identical method with any AI coding software that you simply’re already utilizing. Right here’s what I’d suggest:

Write down what your software program is meant to ensure. Not simply what it does—what it’s purported to do, for whom, below what situations. When you’re working towards spec-driven improvement, you’re already partway there. The subsequent step is including the why: Why does this habits matter, who is determined by it, what goes unsuitable if it fails? That’s the distinction between a spec and a requirement, and it’s the distinction between an AI that may construct your code and an AI that may confirm it.
Feed the AI your intent, not simply your code. The intent is already sitting in your chat historical past, your design discussions, your Slack threads, your assist tickets. Each Claude export, each Gemini dialog, each Cowork transcript comprises design intent that by no means made it into specs: why a perform was written a sure means, what failure prompted an architectural resolution, what tradeoffs had been mentioned earlier than selecting an method. The design intent that used to require a human to extract and doc is now sitting in your chat logs. Your AI can learn the transcripts and extract the why.
Search for the unfavourable necessities. What ought to your software program not do? What states ought to be inconceivable? What information ought to by no means be uncovered? These unfavourable necessities are sometimes essentially the most helpful as a result of they outline boundaries that structural overview can’t see. The lacking authorization bug was a unfavourable requirement: Unauthenticated customers should not be capable to delete different customers’ information. The Gson bug was a unfavourable requirement: Duplicate keys should not be silently accepted when the primary worth is null. When you can articulate what your software program must not ever do, you’ve given the AI one thing highly effective to examine towards.

Within the subsequent article, I’ll discuss context administration—the ability that truly determines whether or not your AI classes produce good work or mediocre work. All the things I’ve described right here is determined by the AI having the correct data on the proper time, and it seems that managing what the AI is aware of (and what it forgets) is an engineering self-discipline in its personal proper. I’ll cowl how I went from operating 15 million tokens in a single immediate to splitting the playbook into unbiased phases with zero context carryover, and why that transition labored on the primary attempt.

The High quality Playbook is open supply and works with GitHub Copilot, Cursor, and Claude Code. It’s additionally obtainable as a part of awesome-copilot.

Disclosure: Points of the methodology described on this article are the topic of US Provisional Patent Utility No. 64/044,178, filed April 20, 2026 by the creator. The open-source High quality Playbook mission (Apache 2.0) features a patent grant to customers of that mission below the phrases of the Apache 2.0 license.

AI Code Assessment Solely Catches Half of Your Bugs – O’Reilly

Spec-driven improvement and what it misses

About half of all safety bugs are invisible to structural evaluation

The intent ceiling is a safety downside

How the High quality Playbook derives necessities (and how one can too!)

Musk v. Altman week 1: Elon Musk says he was duped, warns AI might kill us all, and admits that xAI distills OpenAI’s fashions

Everybody’s an Engineer Now – O’Reilly

Researchers Say This System of seven Good Rings Can Translate Signal Language

LEAVE A REPLY Cancel reply

Most Popular

G2’s Evaluation of 500 Purchaser Evaluations

Greatest Wedge Thong Sandals 2026: Summer season’s Largest Shoe Development

How I Discovered Alternative The place Most Traders Noticed Danger

Hokum’s Director Made One Of The Scariest Horror Motion pictures Ever (And It is Streaming On Hulu)

Recent Comments

ABOUT US

POPULAR POSTS

G2’s Evaluation of 500 Purchaser Evaluations

Greatest Wedge Thong Sandals 2026: Summer season’s Largest Shoe Development

How I Discovered Alternative The place Most Traders Noticed Danger

POPULAR CATEGORY