Verification debt: the hidden cost of AI-generated code

fazy.medium.com

72 points by xfz 7 hours ago


fishtoaster - 5 hours ago

Figuring out how to trust AI-written code faster is the project of software engineering for the next few years, IMO.

We'll need to figure out the techniques and strategies that let us merge AI code sight unseen. Some ideas that have already started floating around:

- Include the spec for the change in your PR and only bother reviewing that, on the assumption that the AI faithfully executed it

- Lean harder on your deterministic verification: unit tests, full stack tests, linters, formatters, static analysis

- Get better ai-based review: greptile and bugbot and half a dozen others

- Lean into your observability tooling so that AIs can fix your production bugs so fast they don't even matter.

None of these seem fully sufficient right now, but it's such a new problem that I suspect we'll be figuring this out for the next few years at least. Maybe one of these becomes the silver bullet or maybe it's just a bunch of lead bullets.

But anyone who's able to ship AI code without human review (and without their codebase collapsing) will run circles around the rest.

hnthrow0287345 - 6 hours ago

This still seems like technical debt to me. It's just debt with a much higher compounding interest rate and/or shorter due date. Credit cards vs. traditional loans or mortgages.

>And six months later you discover you’ve built exactly what the spec said — and nothing the customer actually wanted.

That's not a developer problem, it's a PM/business problem. Your PM or equivalent should be neck deep in finding out what to build. Some developers like doing that (likely for free) but they can't spend as much time on it as a PM because they have other responsibilities, so they are not as likely not as good at it.

If you are building POCs (and everyone understands it's a POC), then AI is actually better getting those built as long as you clean it up afterwards. Having something to interact with is still way better than passively staring at designs or mockup slides.

Developers being able to spend less time on code that is helpful but likely to be thrown away is a good thing IMO.

talkvoix - an hour ago

With a CS degree and 15 years of software engineering under my belt, I was initially skeptical of 'vibe coding'. But the article is right about this adolescent phase. I recently built my platform (https://voix.chat) 100% through agentic workflows. Having that much experience meant I didn't use the AI as a crutch to learn how to code; I used it as a hyper-productive junior dev while I played the paranoid senior architect. It allowed me to focus purely on the hard stuff: strict anti-flood mechanisms, brute-force protection, and overall server hardening. The AI handles the syntax; the human handles the paranoia.

jldugger - 5 hours ago

Verification debt has always been present, we just now feel an acute need for it, because we do it wrong.

Clause and friends represent an increase in coders, without any corresponding increase in code reviewers. It's a break in the traditional model of reviewing as much code as you submit, and it all falls on human engineers, typically the most senior.

Well, that model kinda sucked anyways. Humans are falliable and Ironies of Automation lays bare the failure modes. We all know the signs: 50 comments on a 5 line PR, a lonely "LGTM" on the 5000 line PR. This is not responsible software engineering or design; it is, as the author puts it, a big green "I'm accountable" button with no force behind it.

It's probably time for all of us on HN to pick up a book or course on TLA+ and elevate the state of software verification. Even if Claude ends up writing TLA+ specs too, at least that will be a smaller, simpler code base to review?

Kerrick - 6 hours ago

> It gets 50% more pull requests, 50% more documentation, 50% more design proposals

Perhaps this will finally force the pendulum to swing back towards continuous integration (the practice now aliased trunk-based development to disambiguate it from the build server). If we're really lucky, it may even swing the pendulum back to favoring working software over comprehensive documentation, but maybe that's hoping too much. :-)

ironman1478 - 6 hours ago

Verification has always been hard and always ignored, in software more than other industries. This is not specific to AI generated code.

I currently work in a software field that has a large numerical component and verifying that the system is implemented correctly and stable takes much longer than actually implementing it. It should have been like that when I used to work in a more software-y role, but people were much more cavalier then and it bit that company in the butt often. This isn't new, but it is being amplified.

mentalgear - 2 hours ago

> Output is mind-numbingly verbose. You ask for a focused change and get a dissertation with unsolicited comments and gratuitous refactoring.

Recent Devstral 2 (mistral) is pretty precise and concise in it's changes.

johngossman - 6 hours ago

This verification problem is general.

As an experiment, I had Claude Cowork write a history book. I chose as subject a biography of Paolo Sarpi, a Venetian thinker most active in the early 17th century. I chose the subject because I know something about him, but am far from expert, because many of the sources in Italian, in which I am a beginner, and because many of the sources are behind paywalls, which does not mean the AIs haven't been trained on them.

I prompted it to cite and footnote all sources, avoid plagiarism and AI-style writing. After 5 hours, it was finished (amusingly, it generated JavaScript and emitted a DOCX). And then I read the book. There was still a lingering jauntiness and breathlessness ("Paolo Sarpi was a pivotal figure in European history!") but various online checkers did not detect AI writing or plagiarism. I spot checked the footnotes and dates. But clearly this was a huge job, especially since I couldn't see behind the paywalls (if I worked for a Uni I probably could).

Finally, I used Gemini Deep Research to confirm the historical facts and that all the cited sources exist. Gemini thought it was all good.

But how do I know Gemini didn't hallucinate the same things Claude did?

Definitely an incredible research tool. If I were actually writing such a book, this would be a big start. But verification would still be a huge effort.

bryanlarsen - 6 hours ago

Verification is the bottleneck now, so we have to adjust our tooling and processes to make verification as easy as possible.

When you submit a PR, verifiability should be top of mind. Use those magic AI tools to make the PR as easy to possible to verify as possible. Chunk your PR into palatable chunks. Document and comment to aid verification. Add tests that are easy for the reviewer to read, test and tweak. Etc.

abetusk - 4 hours ago

Both empirically and theoretically, verification is often much more tractable than discovery.

Software development is a highly complex task and verification becomes not just validation of the output but also verification that the work is solving the problem desired, not just the problem specified.

I'm empathetic to that scenario, but this was a problem with software development to begin with. I would much rather be in a situation of reducing friction to verification than reducing friction to discovery.

Cognitive load might be the same but now we get a potential boost in productivity for the same cost.

poemxo - 2 hours ago

We were verifying code before? And wouldn't AI help with verification at least for the trivial flaws?

chromaton - 5 hours ago

Historically, the cycle has been requirements -> code -> test, but with coding becoming much faster, the bottlenecks have changed. That's one of the reasons I've been working on Spark Runner to help automate testing for web apps: https://https://github.com/simonarthur/spark-runner

maxdo - 6 hours ago

Code is fully disposable way to generate custom logic.

Hand crafted , scalable code will be a very rare phenomenon

There will be a clear distinction between too.

bensyverson - 5 hours ago

It comes down to trust. I was not able to trust GPT 4.1 or Sonnet 3.5 with anything other than short, well-specified tasks. If I let them go too long (e.g. in long Cursor sessions), it would lose the plot and start thrashing.

With better models and harnesses (e.g. Claude Code), I can now trust the AI more than I would trust a junior developer in the past.

I still review Claude's plans before it begins, and I try out its code after it finishes. I do catch errors on both ends, which is why I haven't taken myself out of the loop yet. But we're getting there.

Most of the time, the way I "verify" the code is behavioral: does it do what it's supposed to do? Have I tried sufficient edge cases during QA to pressure-test it? Do we have good test coverage to prevent regressions and check critical calculations? That's about as far as I ever took human code verification. If anything, I have more confidence in my codebases now.

ritcgab - 4 hours ago

At the end of the day, it's about liability. Whether you use AI tools to generate the code or not, you are the author of the code, and such authorship implies the liability that you are being paid to take.

VanTodi - 6 hours ago

I've come to the point where I think generated code is nothing better than a random package I install. Did I read it all and just accepted what was promised? Yes Can it bite me in the butt somewhere down the road? Probably, but I currently at least have more doubt about the generated code than a random package I picked up somewhere on git which readme I just partly skipped over.

apical_dendrite - 6 hours ago

My company recently hired a contractor. He submits multi-thousand line PRs every day, far faster than I can review them. This would maybe be OK if I could trust his output, but I can't. When I ask him really basic questions about the system, he either doesn't know or he gets it wrong. This week, I asked for some simple scripts that would let someone load data in a a local or staging environment, so that the system could be tested in various configurations. He submitted a PR with 3800 lines of shell scripts. We do not have any significant shell scripts anywhere else in our codebase. I spent several hours reviewing it with him - maybe more time than he spent writing it. His PR had tons and tons of end-to-end tests of the system that didn't actually test anything - some said they were validating state, but passed if a get request returned a 200. There were a few tests that called a create API. The tests would pass if the API returned an ID of the created object. But they would ALSO pass if the test didn't return an ID. I was trying to be a good teacher, so I kept asking questions like "why did you make this decision", etc, to try to have a conversation about the design choices and it was very clear that he was just making up bullshit rationalizations - he hadn't made any decisions at all. There was one particularly nonsensical test suite - it said it was testing X but included API calls that had nothing to do with X. I was trying to figure out how he had come up with that, and then I realized - I had given him a Postman export with some example API requests, and in one of the API requests I had gotten lazy and modified the request to test something but hadn't modified the name in Postman. So the LLM had assumed that the request was related to the old name and used it when generating a test suite, even though these things had nothing to do with each other. He had probably never actually read the output so he had no idea that it made no sense.

When he was first hired, I asked him to refactor a core part of the system to improve code quality (get rid of previous LLM slop). He submitted a 2000+ line PR within a day or so. He's getting frustrated because I haven't reviewed it and he has other 2000+ line PRs waiting on review. I asked him some questions about how this part of the system was invoked and how it returned data to the rest of the system, and he couldn't answer. At that point I tried to explain why I am reluctant to let him commit his refactor of a core part of the system when he can't even explain the basic functionality of that component.

aplomb1026 - 5 hours ago

[dead]

ClaudioAnthrop - 2 hours ago

[dead]

decker_dev - an hour ago

[flagged]