Discussion
Toward automated verification of unreviewed AI-generated code
jghn: I do think that GenAI will lead to a rise in mutation testing, property testing, and fuzzing. But it's worth people keeping in mind that there are reasons why these aren't already ubiquitous. Among other issues, they can be computationally expensive, especially mutation testing.
tedivm: While I understand why people want to skip code reviews, I think it is an absolute mistake at this point in time. I think AI coding assistants are great, but I've seen them fail or go down the wrong path enough times (even with things like spec driven development) where I don't think it's reasonable to not review code. Everything from development paths in production code, improper implementations, security risks: all of those are just as likely to happen with an AI as a Human, and any team that let's humans push to production without a review would absolutely be ridiculed for it.Again, I'm not opposed to AI coding. I know a lot of people are. I have multiple open source projects that were 100% created with AI assistants, and wrote a blog post about it you can see in my post history. I'm not anti-ai, but I do think that developers have some responsibility for the code they create with those tools.
Ancalagon: Even with mutation testing doesn’t this still require review of the testing code?
sharkjacobs: I'm having a hard time wrapping my head around how this can scale beyond trivial programs like simplified FizzBuzz.
Lerc: I agree that it would be a mistake to use something like this in something where people depend upon specific behaviour of the software. The only way we will get to the point where we can do this is by building things that don't quite work and then start fixing the problems.There are a subset of things that it would be ok to do this right now. Instances where the cost of utter failure is relatively low. For visual results the benchmark is often 'does it look right?' rather than 'Is it strictly accurate?"
jryio: Correct. Where did the engineering go? First it was in code files. Then it went to prompts, followed by context, and then agent harnesses. I think the engineering has gone into architecture and testing now.We are simply shuffling cognitive and entropic complexity around and calling it intelligence. As you said, at the end of the day the engineer - like the pilot - is ultimately the responsible party at all stages of the journey.
phailhaus: Using FizzBuzz as your proxy for "unreviewed code" is extremely misleading. It has practically no complexity, it's completely self-contained and easy to verify. In any codebase of even modest complexity, the challenge shifts from "does this produce the correct outputs" to "is this going to let me grow the way I need it to in the future" and thornier questions like "does this have the performance characteristics that I need".
loloquwowndueo: > is this going to let me grow the way I need it to in the futureThis doesn’t matter in the age of AI - when you get a new requirement just tell the AI to fulfill it and the old requirements (perhaps backed by a decent test suite?) and let it figure out the details, up to and including totally trashing the old implementation and creating an entirely new one from scratch that matches all the requirements.For performance, give the AI a benchmark and let it figure it out as well. You can create teams of agents each coming up with an implementation and killing the ones that don’t make the cut.Or so goes the gospel in the age of AI. I’m being totally sarcastic, I don’t believe in AI coding
Swizec: > including totally trashing the old implementation and creating an entirely new one from scratch that matches all the requirementsLet me guess, you've never worked in a real production environment?When your software supports 8, 9, 10 or more zeroes of revenue, "trash the old and create new" are just about the scariest words you can say. There's people relying on this code that you've never even heard of.Really good post about why AI is a poor fit in software environments where nobody even knows the full requirements: https://www.linkedin.com/pulse/production-telemetry-spec-sur...
baq: it isn't gospel, it's perspective. if you care about the code, it's obviously bonkers. if you care about the product... code doesn't matter - it's just a means to an end. there's an intersection of both views in places where code actually is the product - the foundational building blocks of today's computing software infrastructure like kernels, low level libraries, cryptography, etc. - but your typical 'uber for cat pictures' saas business cares about none of this.
hrmtst93837: People treating this as a scaling problem are skipping the part where verification runs into undecidability fast.Proving a small pure function is one thing, but once the code touches syscalls, a stateful network protocol, time, randomness, or messy I/O semantics, the work shifts from 'verify the program' to 'model the world well enough that the proof means anything,' and that is where the wheels come off.
andai: ...in FizzBuzz
person22: I work on a product that meets your criteria. We can't fix a class of defects because once we ship, customers will depend upon that behavior and changing is very expensive and takes years to deprecate and age out. So we are stuck with what we ship and need to be very careful about what we release.
pron: > The code must pass property-based testsWho writes the tests? It can be ok to trust code that passes tests if you can trust the tests.There are, however, other problems. I frequently see agents write code that's functionally correct but that they won't be able to evolve for long. That's also what happened with Anthropic's attempt to have agents write a C compiler. They had thousands of human-written tests, but at some point the agents couldn't get the software to converge. Fixing a bug created another.
rigorclaw: does the cost of writing good property tests scale better than the cost of code review as the codebase grows? seems like the bottleneck just moves from reviewing code to reviewing specs.
duskdozer: So are we finally past the stage where people pretend they're actually reading any of the code their LLMs are dumping out?
empath75: In a year people will be complaining about human written code going into production without LLM review.
empath75: > When your software supports 8, 9, 10 or more zeroes of revenue, "trash the old and create new" are just about the scariest words you can say. There's people relying on this code that you've never even heard of.Well, now it'll take them 5 minutes to rewrite their code to work around your change.
fhd2: Who's "we"?I'd consider shipping LLM generated code without review risky. Far riskier than shipping human-generated code without review.But it's arguably faster in the short run. Also cheaper.So we have a risk vs speed to market / near term cost situation. Or in other words, a risk vs gain situation.If you want higher gains, you typically accept more risk. Technically it's a weird decision to ship something that might break, that you don't understand. But depending on the business making that decision, their situation and strategy, it can absolutely make sense.How to balance revenue, costs and risks is pretty much what companies do. So that's how I think about this kind of stuff. Is it a stupid risk to take for questionable gains in most situations? I'd say so. But it's not my call, and I don't have all the information. I can imagine it making sense for some.
boombapoom: production ready "fizz buzz" code. lol. I can't even continue typing this response.
builtbyzac: The revenue-from-cold-start problem is harder than most AI posts acknowledge. I built products and ran distribution for 72 hours with a Claude Code agent — zero sales. Not because the agent couldn't do the work, but because the work (audience, credibility, trust) takes longer than 72 hours. The AI capability problem is mostly solved; the distribution and trust problem isn't.
phillipclapham: There's a layer above this that's harder to automate: verifying that the architectural decision was right, not just the implementation. You can lint for correctness, run the tests, catch the bug classes. But "this should've been a stateless function, not a microservice" or "this abstraction is wrong for the problem", well that's not in the artifact. An agent can happily produce code that passes every automated check and still represent a fundamentally wrong design choice.The thread's hitting on this with "who writes the tests" but I think it undersells the scope. You're not just shifting responsibility, you're also hitting a ceiling: test specs can verify behavior, not decisions. Worth thinking about what it'd even mean to verify the decision trail that produced the code, not just the code itself.
wordpad: > AI capability problem is mostly solved; the distribution and trust problem isn't.SaaS opportunity? Maybe, some sort of marketplace of AI-written applications and services with discovery features?