Discussion
Claude Code Found a Linux Vulnerability Hidden for 23 Years
jazz9k: This does sound great, but the cost of tokens will prevent most companies from using agents to secure their code.
epolanski: I don't buy it.Inference cost has dropped 300x in 3 years, no reason to think this won't keep happening with improvements on models, agent architecture and hardware.Also, too many people are fixated with American models when Chinese ones deliver similar quality often at fraction of a cost.From my tests, "personality" of an LLM, it's tendency to stick to prompts and not derail far outweights the low % digit of delta in benchmark performance.Not to mention, different LLMs perform better at different tasks, and they are all particularly sensible to prompts and instructions.
NitpickLawyer: Tokens aren't more expensive than highly trained meatbags today. There's no way they'll be more expensive "tomorrow"...
dist-epoch: > "given enough eyeballs, all bugs are shallow"Time to update that:"given 1 million tokens context window, all bugs are shallow"
_pdp_: The title is a little misleading.It was Opus 4.6 (the model). You could discover this with some other coding agent harness.The other thing that bugs me and frankly I don't have the time to try it out myself, is that they did not compare to see if the same bug would have been found with GPT 5.4 or perhaps even an open source model.Without that, and for the reasons I posted above, while I am sure this is not the intention, the post reads like an ad for claude code.
cookiengineer: > Nicholas has found hundreds more potential bugs in the Linux kernel, but the bottleneck to fixing them is the manual step of humans sorting through all of Claude’s findingsNo, the problem is sorting out thousands of false positives from claude code's reports. 5 out of 1000+ reports to be valid is statistically worse than running a fuzzer on the codebase.Just sayin'
mgraczyk: No the title is correct and you are misreading or didn't read. It was found with Claude code, that's the quote. This isn't a model eval, it's an Anthropic employee talking about Claude code. So comparing to other models isn't a thing to reasonably expect.
jason1cho: This isn't surprising. What is not mentioned is that Claude Code also found one thousand false positive bugs, which developers spent three months to rule out.
riffraff: ..and three months to review the false positives
addandsubtract: On the other hand, some bugs take three months to find. So this still seems like a win.
up2isomorphism: But on the other hand, Claude might introduce more vulnerability than it discovered.
yunnpp: Code review is the real deal for these models. This area seems largely underappreciated to me. Especially for things like C++, where static analysis tools have traditionally generated too many false positives to be useful, the LLMs seem especially good. I'm no black hat but have found similarly old bugs at my own place. Even if shit is hallucinated half the time, it still pays off when it finds that really nasty bug.Instead, people seem to be infatuated with vibe coding technical debt at scale.
qsera: > Code review is the real deal for these models.Yea, that is what I have been saying as well...>Instead, people seem to be infatuated with vibe coding technical debt at scale.Don't blame them. That is what AI marketing pushes. And people are sheep to marketing..I understand why AI companies don't want to promote it. Because they understand that the LCD/Majority of their client base won't see code review as a critical part of their business. If LLMs are marketed as best suited for code review, then they probably cannot justify the investments that they are getting...
112233: this is always overlooked. AI stories sound like "with right attitude, you too can win 10M $ in lottery, like this man just did"Running LLM on 1000 functions produces 10000 reports (these numbers are accurate because I just generated them) — of course only the lottery winners who pulled the actually correct report from the bag will write an article in Evening Post
dist-epoch: > On the kernel security list we've seen a huge bump of reports. We were between 2 and 3 per week maybe two years ago, then reached probably 10 a week over the last year with the only difference being only AI slop, and now since the beginning of the year we're around 5-10 per day depending on the days (fridays and tuesdays seem the worst). Now most of these reports are correct, to the point that we had to bring in more maintainers to help us. ... Also it's interesting to keep thinking that these bugs are within reach from criminals so they deserve to get fixed.https://lwn.net/Articles/1065620/
only 112 bytes. The denial message includes the owner ID, which can be up to 1024 bytes, bringing the total size of the message to 1056 bytes. The kernel writes 1056 bytes into a 112-byte buffer
userbinator: Not "hidden", but probably more like "no one bothered to look".declares a 1024-byte owner ID, which is an unusually long but legal value for the owner ID.When I'm designing protocols or writing code with variable-length elements, "what is the valid range of lengths?" is always at the front of my mind.it uses a memory buffer that’s only 112 bytes. The denial message includes the owner ID, which can be up to 1024 bytes, bringing the total size of the message to 1056 bytes. The kernel writes 1056 bytes into a 112-byte bufferThis is something a lot of static analysers can easily find. Of course asking an LLM to "inspect all fixed-size buffers" may give you a bunch of hallucinations too, but could be a good starting point for further inspection.
NitpickLawyer: > This is something a lot of static analysers can easily find.And yet they didn't (either noone ran them, or they didn't find it, or they did find it but it was buried in hundreds of false positives) for 20+ years...I find it funny that every time someone does something cool with LLMs, there's a bunch of takes like this: it was trivial, it's just not important, my dad could have done that in his sleep.
userbinator: Remember Heartbleed in OpenSSL? That long predated LLMs, but same story: some bozo forgot how long something should/could be, and no one else bothered to check either.