Discussion
Entropic Thoughts
ryanackley: I agree completely. I haven't noticed much improvement in coding ability in the last year. I'm using frontier models.What's been the game changer are tools like Claude Code. Automatic agentic tool loops purpose built for coding. This is what I have seen as the impetus for mainstream adoption rather than noticeable improvements in ability.
sho_hn: My anecdotal experience is rather different.I write a lot of C++ and QML code. Codex 5.3 was the the first model I used that would regularly generate code that passes my expert smell test and has turned generative coding from a timesap into a tool I can somewhat rely on.QML is a declarative-first markup language that is a superset of the JavaScript syntax. It's niche and doesn't have a giant amount of training data in the corpus. Codex is the first model that doesn't super botch it or prefers to write reams of procedural JS embeds (after steering). Much reduced is also the tendency to go overboard on spamming everything with clouds of helper functions/methods in both C++ and QML. It knows when to stop, so to speak.
davecoffin: I've been able to supercharge a hobby project of mine over the last couple months using Opus 4.6 in claude code. I had to collaborate and write code still, but claude did like 75% of the work to add meaningful new features to an iOS/Android native mobile app, including Live Activities which is so overly complicated i would not have been able to figure that out. I have it running in a folder that contains both my back end api (express) and my mobile app (nativescript), so it does back end and front end work simultaneously to support new features. this wasnt possible 8 months ago.
mavamaarten: Maybe n=1, but I disagree? I notice that Sonnet 4.6 follows instructions much better than 4.5 and it generates code much closer to our already in-place production code.It's just a point release and it isn't a significant upgrade in terms of features or capabilities, but it works... better for me.
jeffnv: I don't think it's true, but am I alone in wishing it was? My world is disrupted somewhat but so far I don't think we have a thing that upends our way of life completely yet. If it stayed exactly this good I'd be pretty content.
cj: I agree with your sentiment, but I think we've yet to see the full application of the current technology. (Even if LLMs themselves don't improve, there's significant opportunity for people to use it in ways not currently being done)
Flavius: > This means llms have not improved in their programming abilities for over a year. Isn’t that wild? Why is nobody talking about this?Because it's not true. They have improved tremendously in the last year, but it looks like they've hit a wall in the last 3 months. Still seeing some improvements but mostly in skills and token use optimization.
postflopclarity: > but mostly in skills and token use optimization.I have heard rumors that token use optimization has been a recent focus to try to tidy up the financials of these companies before they IPO. take that with a grain of salt though
reedf1: Given that it is the general consensus that a step function occurred with Opus 4.5/4.6 only 3 months ago - it seems like an insane omission.
jeremyjh: This has been the general consensus for about three years now. "Drastic increases in capability have happened the last 3-6 months" have been a constant refrain.Without any data from the study past September I think its not unreasonable, if you want to make an argument based on evidence.For me personally, I agree with you, I'm really seeing it as well.
postflopclarity: > "Drastic increases in capability have happened the last 3-6 months" have been a constant refrain.well, yeah. because that's been the experience for many people.3 years ago, trying to use ChatGPT 3.5 for coding tasks was more of a gimmick than anything else, and was basically useless for helping me with my job.today, agentic Opus 4.6 provides more value to me than probably 2 more human engineers on my team would
fluidcruft: Yeah I'm not buying the last bit about lower MSE with one term in the model vs two (Brier with one outcome category is MSE of the probabilities). That's the sort of thing that would make me go dig to find where I fucked up the calculation.
roxolotl: These studies are always really hard to judge the efficacy of. I would say though the most surprising thing to me about LLMs in the past year is how many people got hyped about the Opus 4.5 release. Having used Claude Code at work since it was released I haven't really noticed any step changes in improvement. Maybe that's because I've never tried to use it to one shot things?Regardless I'm more inclined to believe that 4.5 was the point that people started using it after having given up on copy/pasting output in 2024. If you're going from chat to agentic level of interaction it's going to feel like a leap.
tossandthrow: Nah, pre 4.5 it was not comfortable to use agentic coding.
sunaurus: I am pretty convinced that for most types of day to day work, any perceived improvements from the latest Claude models for example were total placebo. In blind tests and with normal tasks, people would probably have no idea if they're using Opus 4.5 or 4.6.
suddenlybananas: >well, yeah. because that's been the experience for many people.Yes but this blogpost argues that at least over the course of 2024 to the end of 2025, those people were mistaken.
idorozin: My experience has been that raw “one-shot intelligence” hasn’t improved as dramatically in the last year, but the workflow around the models has improved massively.When you combine models with:tool useplanning loopsagents that break tasks into smaller piecespersistent context / reposthe practical capability jump is huge.
camdenreslink: From my personal experience, they have gotten better, but they haven’t unlocked any new capabilities. They’ve just improved at what I was already using them for.At the end of the day they still produce code that I need to manually review and fully understand before merging. Usually with a session of back-and-forth prompting or manual edits by me.That was true 2 years ago, and it’s true now (except 2 years ago I was copy/pasting from the browser chat window and we have some nicer IDE integration now).
josephg: Yep this has been my experience too.I tried GPT3.5 for translating code from typescript to rust. It made many mistakes in rust. It couldn't fix borrow checker issues. The context was so small that I could only feed it small amounts of my program at a time. It also introduced new bugs into the algorithm.Yesterday I had an idea for a simple macos app I wanted. I prompted claude code. It programmed the whole thing start to finish in 10 minutes, no problem. I asked it to optimize the program using a technique I came up with, and it did. I asked it to make a web version and it did. (Though for some reason, the web version needed several rounds of "it doesn't work, here's the console output").I'm slowly coming to terms with the idea that my job is fundamentally changing. I can get way more done by prompting claude than I can by writing the code myself.
antisthenes: They are getting better, but they are also hitting diminishing returns.There's only so much data to train on, and we are unlikely to see giant leaps in performance as we did in 2023/2024.2026-27 will be the years of primarily ecosystem/agentic improvements and reducing costs.
WithinReason: If you look at a separate trend for the smaller Sonnet models, you can see a rapid trend
GaggiX: How the "costant function" result fits the data points better than a slope that has two parameters instead of one.
Havoc: As they become more capable peoples commits will also become more ambitious.So I’d say fairly flat commit acceptance numbers make sense even in the context of improving LLMs
sd9: You really can't model 5 data points with a linear regression or a step function. It's just not enough data. Not to mention the models are of different sizes / use cases, and from two different labs.I feel like what we've observed generally is that different labs releasing similarly sized models at similar times are generally pretty similar.I think the only reasonable thing to read into interesting comparison is the Sonnet 3.5 -> 3.7 -> 4.5. But yeah, you just can't draw a line through this thing.My experience is that Claude Opus is ahead of the rest as of now and since about mid 2025, but that's really just vibes. I use Claude Code too, and it's hard to tell how much of the improvement is the model vs the harness.I will die on the hill that LLMs are getting better, particularly Anthropic's releases since December. But I can't point at a graph to prove that, I'm just drawing on my personal experience.
sigmar: >This means the step function has more predictive power (“fits better”) than the linear slope. For fun, we can also fit a function that is completely constant across the entire timespan. That happens to get the best Brier score.I mean, sure. but it's obvious in that graph that the single openai model is dragging down the right side. Wouldn't it be better to just stick to analyzing models from only one lab so that this was showing change over time rather than differences between models?
jwpapi: I had this suspicion for a while I think we just got way better in harnessing not the models actual reasoningSo we got better in giving it the right context and tools to do the stuff we need to do but not the actual thinking improvements
utopiah: I gave up on trying months ago, you can see the timeline on top of https://fabien.benetou.fr/Content/SelfHostingArtificialIntel...Truth is I'm probably wrong. I should keep on testing ... but at the same time I precisely gave up because I didn't think the trend was fast enough to keep on investing on checking it so frequently. Now I just read this kind of post, ask around (mainly arguing with comments asking for genuine examples that should be "surprising" and kept on being disappointed) and that seems to be enough for a proxy.I should though, as I mentioned in another comment, keep track of failed attempts.PS: I check solely on self-hosted models (even if not on my machine but least on machines I could setup) because I do NOT trust the scaffolding around proprietary closed sources models. I can't verify that nobody is in the loop.
orwin: I think what happened with static image generation is happening with LLMs. Basically the tools around are becoming better, but all the AI improvements stall, the error rate stay the same (but external tools curate the results so it won't be noticeable if you don't run your own model), the accuracy is still slightly improving, but slower and slower, and never reach the 'perfect' point. Basically stablediffusion early 2025
GaggiX: Image quality has improved a lot in recent months thanks to better models. The ability of people to notice these improvements is plateauing because they are not trained to spot artifacts, which are becoming more obscure.
Zababa: From the METR study (https://metr.org/notes/2026-03-10-many-swe-bench-passing-prs...):>To study how agent success on benchmark tasks relates to real-world usefulness, we had 4 active maintainers from 3 SWE-bench Verified repositories review 296 AI-generated pull requests (PRs). We had maintainers (hypothetically) accept or request changes for patches as well as provide the core reason they were requesting changes: core functionality failure, patch breaks other code or code quality issues.I would also advise taking a look at the rejection reasons for the PRs. For example, Figure 5 shows two rejections for "code quality" because of (and I quote) "looks like a useless AI slop comment." This is something models still do, but that is also very easily fixable. I think in that case the issue is that the level of comment wanted hasn't been properly formalized in the repo and the model hasn't been able to deduce it from the context it had.As for the article, I think mixing all models together doesn't make sense. For example, maybe a slope describe the increasing Claude Sonnet better than a step function.
thomascgalvin: Anecdotally, I haven't seen any real improvement from the AI tools I leverage. They're all good-ish at what they do, but all still lie occasionally, and all need babysitting.I also wonder how much of the jump in early 2025 comes from cultural acceptance by devs, rather than an improvement in the tools themselves.
zozbot234: The latest open LLMs, released around Chinese New Year are quite amazing, including wrt. smaller models that can be run seamlessly even on a cheap Macbook Neo. These smaller models aren't good enough for agentic use cases or coding yet (though the Qwen A3B models are something to watch since they might become viable even on that hardware!) but they'll be quite OK for more casual uses of AI.
aerhardt: I feel that two things are true at the same time:1) Something happened during 2025 that made the models (or crucially, the wrapping terminal-based apps like Claude Code or Codex) much better. I only type in the terminal anymore.2) The quality of the code is still quite often terrible. Quadruple-nested control flow abounds. Software architecture in rather small scopes is unsound. People say AI is “good at front end” but I see the worst kind of atrocities there (a few days ago Codex 5.3 tried to inject a massive HTML element with a CSS before hack, rather than proprerly refactoring markup)Two forces feel true simultaneously but in permanent tension. I still cannot make out my mind and see the synthesis in the dialectic, where this is truly going, if we’re meaningfully moving forward or mostly moving in circles.
orwin: > People say AI is “good at front end”I only say that because I'm a shit frontend dev. Honestly, I'm not that bad anymore, but I'm still shit, and the AI will probably generate better code than I will.
sigbottle: LLM's have 100% gotten better, but it's hard to say if it's "intrinsically better", if that makes sense.> OpenAI’s leading researchers have not completed a successful full-scale pre-training run that was broadly deployed for a new frontier model since GPT-4o in May 2024 [1]That's evidence against "intrinsically better". They've also trained on the entire internet - we only have 1 internet, so.However, late 2024 was the introduction of o1 and early 2025 was Deepseek R1 and o3. These were definitely significant reasoning models - the introduction of test time compute and significant RL pipelines were here.Mid 2025 was when they really started getting integrated with tool calling.Late 2025 is when they really started to become agentic and integrate with the CLI pretty well (at least for me). For example, codex would at least try and run some smoke tests for itself to test its code.In early 2026, the trend now appears to be harness engineering - as opposed to "context engineering" in 2025, where we had to preciously babysit 1 model's context, we make it both easier to rebuild context (classic CS trick btw: rebooting is easier than restoring stale state [2]) and really lean into raw cli tool calling, subagents, etc.[1] https://newsletter.semianalysis.com/p/tpuv7-google-takes-a-s...[2] https://en.wikipedia.org/wiki/Kernel_panicFWIW, AI programming has still been as frustrating as it was when it was just TTC in 2025. Maybe because I don't have the "full harness" but it still has programming styles embedded such as silent fallback values, overly defensive programming, etc. which are obvoiusly gleaned from the desire to just pass all tests, rather than truly good programming design. I've been able to do more, but I have to review more slop... also the agents are really unpleasant to work with, if you're trying to have any reasonable conversation with them and not just delegate to them. It's as if they think the entire world revolves around them, and all information from the operator is BS, if you try and open a proper 2-way channel.It seems like 2026 will go full zoom with AI tooling because the goal is to replace devs, but hopefully AI agents become actually nice to work with. Not sycophantic, but not passively aggressively arrogant either.
BoppreH: Controversial opinion from a casual user, but state-of-art LLMs now feel to me more intelligent then the average person on the steet. Also explains why training on more average-quality data (if there's any left) is not making improvements.But LLMs are hamstrung by their harnesses. They are doing the equivalent of providing technical support via phone call: little to no context, and limited to a bidirectional stream of words (tokens). The best agent harnesses have the equivalent of vision-impairment accessibility interfaces, and even those are still subpar.Heck, giving LLMs time to think was once a groundbreaking idea. Yesterday I saw Claude Code editing a file using shell redirects! It's barbaric.I expect future improvements to come in large discontinuous steps from harness improvements, especially around sub agents/context rollbacks (to work around the non-linear cost of context) and LLM-aligned "accessibility tools". That, or more synthetic training data.
jygg4: The issue with llm’s is trust.I don’t see that ever going away. Humans have learned to trust other humans over a large time scale with rules in place to control behaviour.
xyzsparetimexyz: Steet? Do you mean street? They're smarter in the same way a search engine is smarter.
This means llms have not improved in their programming abilities for over a year.
codeulike: This means llms have not improved in their programming abilities for over a year. Isn’t that wild? Why is nobody talking about this?Because hype makes money.