Discussion
Simon Willison’s Weblog
lbreakjai: We're going to do it again, aren't we? We're going to take something simple and sensible ("write tests first", "small composable modules", etc.), give it a fancy complicated name ("Behavior-Constrained Implementation Lifecycle pattern", "Boundary-Scoped Processing Constructs pattern", etc.), and create an entire industry of consultants and experts selling books and enterprise coaching around it, each swearing they have the secret sauce and the right incantations.The damn thing _talks_. You can just _speak_ to it. You can just ask it to do what you want.
flir: Has anyone staked a claim to "Agile AI" yet?
joelthelion: I've seen several already. There's a huge business opportunity (at our expense, of course).
Rohunyyy: At this point what is happening that is not at our expense? Hell if I could be a grifter and start another .ai company honestly I would. I guess I am just not that talented.
jihadjihad: I wish there was a little more color in the Testing and QA section. While I agree with this: > A comprehensive test suite is by far the most effective way to keep those features working. there is no mention at all about LLMs' tendency to write tautological tests--tests that pass because they are defined to pass. Or, tests that are not at all relevant or useful, and are ultimately noise in the codebase wasting cycles on every CI run. Sometimes to pass the tests the model might even hardcode a value in a unit test itself!IMO this section is a great place to show how we as humans can guide the LLM toward a rigorous test suite, rather than one that has a lot of "coverage" but doesn't actually provide sound guarantees about behavior.
Thews: There was a mention of using agents to build projects into WASM. I've had the best luck telling it to use zig to compile to webassembly. It shortens the time to completion by a significant amount.
simonw: That's a great tip, thanks! I did not know Zig could do this.You can "pip install ziglang" and get the right version for different platforms too.
john-tells-all: Yes. And, a bad test -- that passes because it's defined to pass -- is _much worse_ than no test at all. It makes you think an edge case is "covered" with a meaningful check.Worse: once you have one "bad apple" in your pile of tests, it decreases trust in the _whole batch of tests_. Each time a test passes, you have to think if it's a bad test...
mohsen1: I've experimented with agentic coding/engineering a lot recently. My observation is that software that is easily tested are perfect for this sort of agentic loop.In one of my experiments I had the simple goal of "making Linux binaries smaller to download using better compression" [1]. Compression is perfect for this. Easily validated (binary -> compress -> decompress -> binary) so each iteration should make a dent otherwise the attempt is thrown out.Lessons I learned from my attempts:- Do not micro-manage. AI is probably good at coming up with ideas and does not need your input too much- Test harness is everything, if you don't have a way of validating the work, the loop will go stray- Let the iterations experiment. Let AI explore ideas and break things in its experiment. The iteration might take longer but those experiments are valuable for the next iteration- Keep some .md files as scratch pad in between sessions so each iteration in the loop can learn from previous experiments and attempts[1] https://github.com/mohsen1/fesh
CloakHQ: The test harness point is the one that really sticks for me too. We've been using agentic loops for browser automation work, and the domain has a natural validation signal: either the browser session behaves the way a real user would, or it doesn't. That binary feedback closes the loop really cleanly.The tricky part in our case is that "behaves correctly" has two layers - functional (did it navigate correctly?) and behavioral (does it look human to detection systems?). Agents are fine with the first layer but have no intuition for the second. Injecting behavioral validation into the loop was the thing that actually made it useful.The .md scratch pad between sessions is underrated. We ended up formalizing it into a short decisions log - not a summary of what happened, just the non-obvious choices and why. The difference between "we tried X" and "we tried X, it failed because Y, so we use Z instead" is huge for the next session.
Schlagbohrer: What are you developing that technology for?
CloakHQ: browser automation at scale - specifically the problem of running many isolated browser sessions that each look like distinct, real users to detection systems. the behavioral validation layer I mentioned is the part that makes agentic loops actually useful for this: the agent needs to know not just "did the task succeed" but "did it succeed without triggering signals that would get the session flagged".the interesting engineering problem is that the two feedback loops run on different timescales - functional feedback is immediate (did the click work?) but behavioral feedback is lagged and probabilistic (the session might get flagged 10 requests from now based on something that happened 5 requests ago). teaching an agent to reason about that second loop is the unsolved part.
jpadkins: so spam?
CloakHQ: fair question. i shared a technical experience because it was directly relevant to the test harness discussion - the behavioral vs functional validation layers, the lagged feedback problem. if that reads as promotion, i get it, but it wasn't the intent. the engineering problem is real regardless of who's solving it.
JustResign: They weren't saying your _post_ was spam. They're saying you build tools for spammers.Because that's what they'll be used for.
ElectricalUnion: Common business-oriented language (COBOL) is a high-level, English-like, compiled programming language.COBOL's promise was that it was human-like text, so we wouldn't need programmers anymore.The problem is that the average person doesn't know how what their actual problems are in sufficient detail to get a working solution. When you get down to breaking down that problem... you become a programmer.The main lesson of COBOL is that it isn't the computer interface/language that necessitates a programmer.
mexicocitinluez: > The problem is that the average person doesn't know how what their actual problems are in sufficient detail to get a working solution. When you get down to breaking down that problem... you become a programmer.Agreed. I've spent the last few years building an EMR at an actual agency and the idea that users know what they want and can articulate it to a degree that won't require ANY technical decisions is pure fantasy in my experience.
monooso: I'm confused. Are you criticising the article, or simply expressing concern for what may happen?The context suggests the former, but your criticisms bear no relation to the linked content. If anything, your edict to "write tests first" is even more succinctly expressed as "Red/green TDD".
lbreakjai: But it is related, isn't it? I wrote "...each swearing they have the secret sauce and the right incantations...". Now compare it to ""Use red/green TDD" is a pleasingly succinct way to get better results out of a coding agent."Doesn't it sound like the "right incantation"? That's the point of LLMs, they can understand (*) intent. You'd get the same result saying "do tdd" or "do the stuff everyone says they do but they don't, with the failing test first, don't remember the name, but you know what I'm saying innit?"I'm perhaps uncharitable, and this article just happens to take the collateral damage, but I'm starting to see the same corruption that turned "At regular intervals, the team reflects on how to become more effective" into "Mandatory retro exactly once every fortnight, on a board with precisely three columns".
slaye: Simon, if you're reading this, I'd be really curious to hear your thoughts on how to effectively conduct code reviews in a world where "code is cheap".One of the biggest struggles I have on my team is coworkers straight up vibing parts of the code and not understanding or guiding the architecture of subsystems.Then when I go through the code and provide extensive feedback (mostly architectural and highlighting odd inconsistencies with the code additions) I'm met with much pushback because "it works, why change it"? Not to mention the sheer size of prs ballooning in recent months.The end result is me being the bottleneck because I can't keep up with the "pace" of code being generated, and feeling a lot of discomfort and pressure to lower my standards.I've thought about using a code review agent to review and act as me in proxy, but not being able to control the exact output worries me. And I don't like the lack of human touch it provides. Maybe someone has advice on a humane way to handle this problem.
maciusr: There's a recurring theme in these agentic engineering threads that is worth calling out: the lessons, are almost always stated as universal – but are deeply dependent on team size, code base maturity, test coverage, and risk tolerance. What gets presented as a “win” for a well instrumented backend service could easily guide those working on UI-heavy or old code down the wrong path. The art of this might be less about discovering the correct pattern, and more about truthfully declaring when a pattern applies.
jvidalv: I work as a consultant so I navigate different codebases, old to new, typescript to javascript, massive to small, frontend only to full stack.Claude Code experience is massively different depending on the codebase.Good E2E strongly typed codebase? Can one shot any feature, some small QA, some polishing and it's usually good to ship.Plain javascript? Object oriented? Injection? Overall magic? Claude can work there but is not a pleasant experience and I wouldn't say it accelerates you that much.
charlieflowers: We are going to start seeing that be the primary selection criterion. Pick a stack that agents are good at.
63stack: People are rushing to be the first one to coin something and hit it big. Imagine the amount of $$$ you could get for being an "expert ai consultant" in this space.There was already another attempt at agentic patterns earlier:https://agentic-patterns.com/Absolute hot air garbage.
simonw: Which pieces of my writing are garbage?
andy_ppp: They won't have a decent response, this is the Internet after all. I really enjoyed it thanks for writing it and I'll take a lot of it onboard. I think everyone will have their own software stack and AIs designed perfectly for them to do their work in the future.
simonw: This is genuinely one of the most interesting questions right now. I don't have solid answers yet, and I'm very keen to learn what people are finding works.If you accelerate the pace of code creation it inevitably creates bottlenecks elsewhere. Code review is by far the biggest of those right now.There may be an argument for leaning less on code review. When code is expensive to produce and is likely to stay in production for many years it's obviously important to review it very carefully. If code is cheap and can be inexpensively replaced maybe we can lower our review standards?But I don't want to lower my standards! I want the code I'm producing with coding agents to be better than the code I would produce without them.There are some aspects of code review that you cannot skimp on. Things like coding standards may not matter as much, but security review will never be optional.I've recently been wondering what we can learn from security teams at large companies. Once you have dozens or hundreds of teams shipping features at the same time - teams with varying levels of experience - you can no longer trust those teams not to make mistakes. I expect that the same strategies used by security teams at Facebook/Google-scale organizations could now be relevant to smaller organizations where coding agents are responsible for increasing amounts of code.Generally though I think this is very much an unsolved problem. I hope to document the effective patterns for this as they emerge.
yonaguska: Can you document the hard architectural requirements of your codebase? And keep it up to date? If you can do that, you can force your coworkers to always use those requirements during their prompting /planning for their implementations and you can feed that to an agent and have that review the code.But more proactively, if people aren't going to write their own code, I think there needs to be a review process around their prompts, before they generate any code at all. Make this a formal process, generate the task list you're going to feed to your LLM, write a spec, and that should be reviewed. This is not a substitute for code reviews, but it tends to ensure that there are only nitpick issues left, not major violations of how the system is intended to be architected.
jermaustin1: I suggest "AIgile" for brevity.
kaycey2022: Agile Intelligence
fzaninotto: Is "Agentic Engineering" is the new name for "Agent Experience"? If so, and even though I love Simon's contributions, there are many other guides to making codebases more welcoming to agents...Shameless plug: I wrote one. https://marmelab.com/blog/2026/01/21/agent-experience.html
malexw: I think Martin Fowler's "Refactoring" might give a bit of insight here. One of my take-aways after reading that book is that the specific implementation of a function is not very important if you have tests. He argues that it can sometimes be easier to completely re-write a function than to take the time to understand it - as long as you can validate that your re-write performs the same way. This mindset lines up pretty closely with how I've been using LLMs.If that's true, then I would think the emphasis in code review should be more on test quality and verifying that the spec is captured accurately, and as you suggest, the actual implementation is less important.
shreddd24: Absolutely great work. I have been mostly just thinking about what you are already practicing. I think your site will become an invaluable source for software engineers who want to responsibly apply AI in their development flow.For a high level description of what this new way of engineering is about: https://substack.com/@shreddd/p-189554031
Terr_: I predict the main democratization change is going to be how easy people can make plumbing that doesn't require--or at least not obviously require--such specificity or mental-modeling of the business domain.For example, "Generate me some repeatable code to ask system X for data about Y, pull out value Z, and submit it to system W."
cma256: > There may be an argument for leaning less on code review. When code is expensive to produce and is likely to stay in production for many years it's obviously important to review it very carefully. If code is cheap and can be inexpensively replaced maybe we can lower our review standards?Agree with everything else you said except this. In my opinion, this assumes code becomes more like a consumable as code-production costs reduce. But I don't think that's the case. Incorrect, but not visibly incorrect, code will sit in place for years.
simonw: > Agree with everything else you said except this.Yeah, I'm not sure I agree with what I said there myself!> Incorrect, but not visibly incorrect, code will sit in place for years.If you let incorrect code sit in place for years I think that suggests a gap in your wider process somewhere.I'm still trying to figure out what closing those gaps looks like.The StrongDM pattern is interesting - having an ongoing swarm of testing agents which hammer away at a staging cluster trying different things and noting stuff that breaks. Effectively an agent-driven QA team.I'm not going to add that to the guide until I've heard it working for other teams and experienced it myself though!
Balgair: This kinda gets into the idea of AIs as droids right?So, you have a code writing droid that is aligned towards writing good clean code that humans can read. Then you have an implementation droid that goes into actually launching and running the code and is aligned with business needs and expenses. And you have a QA droid that stress tests the code and is aligned with the hacker mindset and is just slightly evil, so to speak.Each droid is working together to make good code, but also are independent and adversarial in the day to day.
andresquez: I see a lot of people complaining that every day there are 100 new frameworks for “agent teams”, prompting styles, workflows, and everyone insists theirs is the best for one reason or another. It reminds me a lot of early software engineering: every team had its own way of doing things, we experimented with tons of methodologies (waterfall, agile, etc.), and over time a few patterns became widely adopted (scrum, PM roles, architects, tickets, rituals). It feels like we’re in that same messy exploration phase right now.And actually, these tools actually work, , because 99% of people still don’t really know how to prompt agents well and end up doing things like “pls fix this, it’s not working”.One thing that worked well for us was going back to how a human team would approach it: write a product spec first (expected behavior, constraints, acceptance criteria, etc), use AI to refine that spec, and only then hand it to an opinionated flow of agents that reflect a human team to implement.
nishantjani10: I primarily use AI for understanding codebases myself. My prompt is:"deeply understand this codebase, clearly noting async/sync nature, entry points and external integration. Once understood prepare for follow up questions from me in a rapid fire pattern, your goal is to keep responses concise and always cite code snippets to ensure responses are factual and not hallucinated. With every response ask me if this particular piece of knowledge should be persistent into codebase.md"Both the concise and structure nature (code snippets) help me gain knowledge of the entire codebase - as I progressively ask complex questions on the codebase.
onionisafruit: I tried a slight variation of your prompt after reading this. It worked marvelously. Quick, correct answers instead of waiting for it to do exploration for each answer.
monooso: I view it as a collection of potentially helpful tips which have worked well for the author, which is exactly how it's presented.There's no suggestion that this is The Only Blessed Way.
keithnz: Agent based code reviews is what you want. But you have to do set it up with really good context about what is wanted. You then review the reviews, keep improving the context it is working with. Make sure it's put into everyone's global context they work with as well.Weirdly this article doesn't really talk about the main agentic pattern- Plan (really important to start with a plan before code changes). iteratively build a plan to implement something. You can also have a colelctive review of the plan, make sure its what you want and there is guidance about how it should implement in terms of architecture (should also be pulling on pre existing context about your architecure /ccoding standards), what testing should be built. Make sure the agent reviews the plan, ask the agent to make suggestions and ask questions- Execute. Make the agent (or multiple agents) execute on the plan- Test / Fix cycle- Code Review / Refactor- Generate Test Guidance for QAThen your deliverables are Code / Feature context documentation / Test Guidance + evolving your global/project context
ramoz: > what testing should be builtYea, a big part of my planning has included what verification steps will be necessary along the way or at the end. No plan gets executed without that and I often ask for specific focus on this aspect in plan mode.
keithnz: yeah, spending a bunch of time with the plan is really worthwhile, nearly all aspects of the plan are worth a bunch of attention. Getting it to think about edge cases and all the scenarios for testing is really worthwhile, what can be automated, what manual testing should be done. It's often working through testing scenarios that I often see gaps in the plan.
benrutter: I use AI in my workflow mostly for simple boilerplate, or to troubleshoot issues/docs.I've dipped into agentic work now and again, but never been very impressed with the output (well, that there is any functioning output is insanely impressive, but it isn't code I want to be on the hook for complaining).I hear a lot of people saying the same, but similarly a bunch of people I respect saying they barely write code anymore. It feels a little tricky to square these up sometimes.Anyway, really looking forward to trying some if these patterns as the book develops to see if that makes a difference. Understanding how other peopke really use these tools is a big gap for me.
lumpilumpi: My experience is that the first iteration output from a single agent is not what I want to be on the hook for. What squares it for me with "not writing code anymore" is the iterative process to improve outputs:1) Having review loops between agents (spawn separate "reviewer" agents) and clear tests / eval criteria improved results quite a bit for me. 2) Reviewing manually and giving instructions for improvements is necessary to have code I can own
rsynnott: Is that… actually faster than just doing it yourself, tho? Like, “I could write the right thing, or I could have this robot write the wrong thing and then nag it til it corrects itself” seems to suggest a fairly obvious choice.I’ve yet to see these things do well on anything but trivial boilerplate.
fragmede: Think of it like installing Linux. The first time it's absolutely not worth it from a time perspective. But after you've installed it once, you can reuse that installation, and eventually it makes sense and becomes second nature. Eventually that time investment pays dividends. Just like Linux tho, no one's going to force to you to install it and you'll probably go on to have a fine career without ever having touched the stuff.
didgeoridoo: I don’t know, Simon has had a pretty sane and level head on his shoulders on this stuff. To my mind he’s earned the right to be taken seriously when talking about approaches he has found helpful.
pc86: "It works, why change it?" is a horrible attitude but is an organizational and interpersonal problem, not a technical one. They're only 1/3 of the way done according to Kent Beck.¹There are plenty of orgs using AI who still care about architecture and having easily human-readable, human-maintainable code. Maybe that's becoming an anachronism, and those firms will go the way of the Brontosaurus. Maybe it will be a competitive advantage. Who knows?¹ "Make it work, make it right, make it fast."
ep103: Counter-point, developers that get used to not caring about function implementation, are going to culturally also not care as much about test implementation, making this proposed ideal impossible.
lmf4lol: with LLMs, tests cost nearly nothing of effort but provide tremendous value.
contagiousflow: And you know those tests are correct how?
simonw: Look at what they are testing.
ornornor: I’m running into this problem as well with juniors slinging code that takes me a very long time to understand and review. I’m iterating on an AGENTS.md file to share with them because they aren’t going to stop using AI and I’m a little tied of always saying the same things (Claude loves to mock everything and assert that spies were called X times with Y arguments which is a great recipe for brittle tests, for example)I know they won’t stop using AI so giving them a directives file that I’ve tried out might at least increase the quality of the output and lower my reviewing burden.Open to other ideas too :)
esafak: Have an AI reviewer take a first crack at it after pointing it to your rules file (e.g., AGENTS) so you don't have to repeat yourself. Gemini does this fairly well, for example. https://developers.google.com/gemini-code-assist/docs/review...
TeeWEE: We make the creator of the PR responsible for the code. Meaning they must understand it.Also, we only allow engineers to commit (agent generated) code. Designers just come up with suggestions, engineers take it and ensure it fits our architecture.We do have a huge codebase. We are teaching Claude Code with CLAUDE.md's and now also <feature>.spec.md (often a summary of the implementation plan).In the end, engineers are responsible.
esafak: Code review should be mandatory and reviewers should ask big PRs to be broken up, and its submitters to be able to defend every line of code. For when the computer is generating the code, the most important duty of the submitter is to vouch for it. To do otherwise creates the bad incentive of making others do all your QA, and nobody is going to be rewarded for that.
simonw: I just added a chapter which touches on that: https://simonwillison.net/guides/agentic-engineering-pattern...
simonw: I'm still trying to figure out how to write about planning.The problem is Claude Code has a planning mode baked in, which works really well but is quite custom to how Claude Code likes to do things.When I describe it as a pattern I want to stretch a little beyond the current default implementation in one of the most popular coding agents.
eterps: You could have a look at: https://github.com/jurriaan/aicoIt does 2 things that are very important, 1: reviewing should not be done last, but during the process and 2: plans should result into verifyable specs, preferably in a natural language so you can avoid locking yourself into specific implementation details (the "how") too early.
jgraettinger1: Maintaining a high-quality requirements / specification document for large features prior to implementation, and then referencing it in "plan mode" prompts, feels like consensus best practice at this stage.However a thing I'm finding quite valuable in my own workflows, but haven't seen much discussion of, is spending meaningful time with AI doing meta-planning of that document. For example, I'll spend many sessions partnered with AI just iterating on the draft document, asking it to think through details, play contrarian, surface alternatives, poke holes, identify points of confusion, etc. It's been so helpful for rapidly exploring a design space, and I frequently find it makes suggestions that are genuinely surprising or change my perspective about what we should build.I feel like I know we're "done" when I thoroughly understand it, a fresh AI instance seems to really understand it (as evaluated by interrogating it), and neither of us can find anything meaningful to improve. At that point we move to implementation, and the actual code writing falls out pretty seamlessly. Plus, there's a high quality requirements document as a long-lived artifact.Obviously this is a heavyweight process, but is suited for my domain and work.ETA one additional practice: if the agent gets confused during implementation or otherwise, I find it's almost always due to a latent confusion about the requirements. Ask the agent why it did a thing, figure out how to clarify in the requirements, and try again from the top rather than putting effort into steering the current session.
ramoz: > consensus best practiceI'm not sure I agree with this. I don't think there needs to be a whole spec & documentation process before plan mode.There is alternative thought leadership that the waterfall approach for building out projects is not the right Agentic pattern[1].Planning itself can be such an intensive process where you're designing and figuring out the specs on the fly in a focused manner for the thing the agent will actually develop next. Not sure how useful it is to go beyond this in terms of specs that live outside of the Agentic loop for what should be developed now and next.I've evolved my own process, originally from plain Claude Code to Claude Code with heavy spec integrated capabilities. However, that became a burden for me: a lot of contextual drift in those documents and then self managing & orchestrating of Claude Code over those documents. I've since reoriented myself to base Claude Code with a fairly high-level effort specific to ad-hoc planning sessions. Sometimes the plans will revolve around specific GitHub issues or feature requests in the ticketing system, but that's about it.[1] https://boristane.com/blog/the-software-development-lifecycl...
tshaddox: Do you have an example of the tautological tests you're referring to? What comes to mind to me is genuinely logically tautological tests, like "assert(true || expectedResult == actualResult)" which is a mistake I don't even expect modern AI coding tools to make. But I suspect you're talking about a subtler type of test which at first glance appears useful but actually isn't.
adampunk: I don’t have examples but I have an LLM driven project with like…2500 tests and I regularly need to prune:* no-op tests* unit tests labeled as integration tests* skipped tests set to skip because they were failing and the agent didn’t want to fix them* tests that can never failProbably at any given time the tests are 2-4% broken. I’d say about 10% of one-shot tests are bogus if you’re just working w spec + chat and don’t have extra testing harnesses.
jerf: Worse yet, the problems are going to be real.There's a lifecycle to these hype runs, even when the thing behind the hype is plenty real. We're still in the phase where if you criticize AI you get told you don't "get it", so people are holding back some of their criticisms because they won't be received well. In this case, I'm not talking about the criticisms of the people standing back and taking shots at the tech, I'm talking about the criticisms of those heavily using it.At some point, the dam will break, and it will become acceptable, if not fashionable, to talk about the real problems the tech is creating. Right now there is only the tiniest trickle from the folk who just don't care how they are perceived, but once it becomes acceptable it'll be a flood.And there are going to be problems that come from using vast quantities of AI on a code base, especially of the form "created so much code my AI couldn't handle it anymore and neither could any of the humans involved". There's going to need to be a discussion on techniques on how to handle this. There's going to be characteristic problems and solutions.The thing that really makes this hard to track though is the tech itself is moving faster than this cycle does. But if the exponential curve turns into a sigmoid curve, we're going to start hearing about these problems. If we just get a few more incremental improvements on what we have now, there absolutely are going to be patterns as to how to use AI and some very strong anti-patterns that we'll discover, and there will be consultants, and little companies that will specialize in fixing the problems, and people who propose buzzword solutions and give lots of talks about it and attract an annoying following online, and all that jazz. Unless AI proceeds to the point that it can completely replace a senior engineer from top to bottom, this is inevitable.
MattGrommes: There's already BMAD - Breakthrough Method of Agile Agent Driven DevelopmentBasically, it's Waterfall for Agents. Lots of Capitalized Words to signify something.Also they constantly call it the BMAD Method, even though the M already stands for method.
layer8: > I'm met with much pushback because "it works, why change it"?This is an educational problem, and is unlikely to be easy to fix in your team (though I might be wrong). I would suggest to change to a team or company with a culture that values being able to reason about one’s software.
dgunay: A lot of this is just things that high-functioning human teams were already doing: automate testing, explain your PRs to guide reviewers, demoing work, not just throwing bad code over the wall during code review, etc.
keeda: I'm not sure what this comment is addressing, I didn't find any fancy terms in TFA? If it's the title of the article itself, it seems simpler than "Things that help writing code effectively with AI agents."> You can just ask it to do what you want.Yes, but very clearly, as any HN thread on AI shows, different people are having VERY different outcomes with it. And I suspect it is largely the misconception that it will magically "just do what you want" that leads to poor outcomes.The techniques mentioned -- coding, docs, modularity etc. -- may seem obvious now, but only recently did we realize that the primary principle emerging is "what's good for humans is good for agents." That was not at all obvious when we started off. It is doubly counter-intuitive given the foremost caveat has been "Don't anthropomorphize AI." I'm finding that is actually a decent way to understand these models. They are unnervingly like us, yet not like us.All that to say, AI is essentially black magic and it is not yet obvious how to use it well for all people and all use-cases, so yes, more exposition is warranted.
luca-ctx: > I don't let LLMs write text for my blog.Thank you Simon and I'm sure you would quickly fall off from #1 blogger on HN if you did. I insist on this for myself as well.Somehow we are all getting really good at detecting "written by AI" with primal intuition.
atomicUpdate: > When code is expensive to produce and is likely to stay in production for many years it's obviously important to review it very carefully. If code is cheap and can be inexpensively replaced maybe we can lower our review standards?I don't care how cheap it is to replace the incorrect code when it's modifying my bank account or keeping my lights on.
pixl97: Oh, don't worry, even before AI the companies in question were already outsourcing a lot of this to the cheapest companies they could find. We are just very very lucky most of the problems incurred get caught before being foisted on the wider world.
aprdm: These are just agents with a different name ? People have been working like that today.
pixl97: Theoretically I'd want a totally different model cross checking the work at some point, since much like an individual may have blind spots, so will a model.
pixl97: >Doesn't it sound like the "right incantation"?It sounds like you have a misunderstanding of what LLMs are/can do.Imagine that you only get one first interaction with a person that you're having try to build something and you're trying to minimize the amount back and forth.For humans this can be something like an instruction manual. If you've put together more than a few things you quickly realize that instruction manuals vary highly in quality, some will make your life much easier and other will leave you confused.Lastly, (human) intent is a social construct. The more closely you're aligned with the entity in question the more it's apt to fully comprehend your intent. This is partially the reason why when you throw a project at workers in your office they tend to get it right, and when you throw it towards the overseas team you'll have to check in a lot more to ensure it's not going off the rails.
MattGrommes: A related book I've been thinking about in terms of LLMs is "Working Effectively With Legacy Code". I'd love to be able to work a lot of that advice into some kind of Skill or customized agent to help with big refactors.
shubhamintech: The test harness point is spot on but there's a gap worth naming: the failure modes you write evals for aren't the ones that cause users to churn. Prod conversations have a whole category where the agent doesn't error, it just confidently goes sideways in a way nobody wrote a test for. The teams actually retaining users from AI products are reading conversations, not just dashboards.
CloakHQ: that's a fair concern to raise. any tool that helps browsers look more human can be misused.the actual use cases we see are mostly legitimate automation - QA teams testing geo-specific flows, price monitoring, research pipelines that need to run at scale without getting rate-limited on the first request. the same problem space as curl-impersonate or playwright-extra, just at the session management layer.could someone use it for spam? technically yes, same as they could with any headless browser setup. but spam operations generally don't need sophisticated fingerprinting - they're volume plays that work fine with basic tools. the people who need real browser isolation are usually the ones doing something that has a legitimate reason to look human.
AndyKelley: It's not a great tip because there are features that exist specifically to reduce development iteration cycle latency without compiling for the wrong target.Please refer to https://ziglang.org/download/0.15.1/release-notes.html#Incre...This has nothing to do with agentic engineering. This is just normal software development. Everybody wants faster compilation speed
tveita: I've definitely seen Opus go to town when asked to test a fairly simple builder. Possibly it inferred something about testing the "contract", and went on to test such properties as - none of the "final" fields have changed after calling each method - these two immutable objects we just confirmed differ on a property are not the same object In addition to multiple tests with essentially identical code, multiple test classes with largely duplicated tests etc.
MartyMcBot: the .md scratch pad point is underrated, and the format matters more than people realize.summaries ("tried X, tried Y, settled on Z") are better than nothing, but the next iteration can mostly reconstruct them from test results anyway. what's actually irreplaceable is the constraint log: "approach B rejected because latency spikes above N ms on target hardware" means the agent doesn't re-propose B the next session. without it, every iteration rediscovers the same dead ends.ended up splitting it into decisions.md and rejections.md. counter-intuitively, rejections.md turned out to be the more useful file. the decisions are visible in the code. the rejections are invisible — and invisible constraints are exactly what agents repeatedly violate.
sarkarsh: This is the underrated insight in the whole thread. 'Approach B rejected because latency spikes above N ms' is the kind of context that saves hours of re-exploration every new session.The problem I kept hitting was that flat markdown constraint logs don't scale past ~50 entries. The agent has to re-read the entire log to know what was already tried, which eats context window and slows generation. And once you have multiple agents in parallel, each maintaining their own constraint log, you get drift - agent A rejects approach B, agent C re-proposes it because it never saw agent A's log.What worked for me was moving constraint logs to append-only log blocks that agents query through MCP rather than re-read as prose. I've been using ctlsurf for this - the agent appends 'approach B rejected, latency > N ms' to a log block, and any agent can query query_log(action='approach_rejected') to see what's been ruled out. State store handles 'which modules are claimed' as a key-value lookup.Structured queries mean agents don't re-read the whole history - they ask specific questions about what's been tried.
toraway: BTW, check the comment history of the above account "sarkash", this is almost certainly an LLM replying with the exact same structure/format in all their comments. This is the underrated insight in the whole thread From comment history: This is good advice but it highlights the real issue shich's point about simulator mandates is the sharpest thing in this thread esafak's cache economics point is underrated", etc. I'm also pretty confident the "Marty McBot" account they're replying too is also a bot but it's too new of account to say for sure: the .md scratch pad point is underrated, and the format matters more than people realize. Plus the dead "openclaw" reply in this thread is another bot that also happened to use "underrated": The negative constraints thing is also underrated. CloakHQ also probably a bot, their entire comment history follows the same structure as their comment from this thread: The .md scratch pad between sessions is underrated The test harness point is the one that really sticks for me too So far that's 3+ bot accounts I've seen so far in a single thread, the "Agentic" in the title/simonw as author may be a tempting target for people to throw their agents/claws at or it's like catnip for them naturally.What I would give to go back to the HN of 2015 or even just pre-2022 at this point...
ben30: I contribute to an open source spec based project management tool. I spend about a day back and forth iterating on a spec, using ai to refine the spec itself. Sometimes feeding it in and out of Claude/gemini telling each other where the feedback has come from. The spec is the value. Using the ai pm tool I break it down into n tasks and sub tasks and dependencies. I then trigger Claude in teams mode to accomplish the project. It can be left alone over night. I wake up in the morning with n prs merged.
Denzel: Mind linking the project so we can see the PR’s?
alkonaut: This seems it should be very easy to validate. Force the AI to make minimal changes to the code under test, which makes a single (or as few as possible) test fail as a result. If it can't make a test fail at all, it should be useless.
jihadjihad: Agreed, and that's why I think adding some example prompts and ideas to the Testing section would be helpful. A vanilla-prompted LLM, in my experience, is very unreliable at adding tests that fail when the changes are reverted.Many times I've observed that the tests added by the model simply pass as part of the changes, but still pass even when those changes are no longer applied.
simonw: I had an example in that section but it got picked apart by pedants (who had good points) so I removed it. I plan to add another soon. You can still see it in the changelog: https://simonwillison.net/guides/agentic-engineering-pattern...
sn9: Matt Pocock has a nice TDD skill he's made available [0][1].[0] https://www.aihero.dev/skill-test-driven-development-claude-...[1] https://github.com/mattpocock/skills/blob/main/tdd/SKILL.md
lenocinor: If you’re ok with it, I think emailing hn@ycombinator.com with this (which dang and the other mods read) would also be good.
epolanski: Fire them. Easy.They have to be responsible for what they push.
camgunz: > wow that's a lot of code, how will we ever review it?>> have a model generate a bunch of tests instead> wow that's a lot of test code, how will we know it's working correctly?>> review it> :face-with-rolling-eyes:
pkorzeniewski: One thing I rarely see mentioned is that often creating code by hand is simply faster (at least for me) than using AI. Creating a plan for AI, waiting for execution, verifying, prompting again etc. can take more time than just doing it on my own with a plan in my head (and maybe some notes). Creating something from scratch or doing advanced refactoring is almost always faster with AI, but most of my daily tasks are bugs or features that are 10% coding and 90% knowing how to do it.
atrevbot: I definitely agree with this and have experienced it as well. Having said that I wonder if the prevalence, and usefulness of AI will make those types of features fewer as intimate knowledge of the codebase decreases.
anukin: I don’t think these kind of outbursts from some random guy in HN requires your response.You have helped a lot of people from junior to staff+ level to understand how to use agents for software engineering using simple language. Calling it garbage is gross injustice to the work you put out.
noddingham: I'd choose a different word for the title of Hoard Things You Know How to Do. Hoarding is the opposite of what we want to do but I get from reading the section you mean create a collection that you can draw upon. IMO "Share" is a much better word choice.
scuff3d: In a business (or any large project setting) where there are real users and real risk involved, code can't move into a code base any faster then it can be reviewed by a human. Period. I apply the exact same standards to PRs for AI assisted code as I do for human written code. If the code is crap, the PR is too larger, or the dev can't explain it. it gets rejected. End of story. We are a long way away from the need for human review going away.
JetSetIlly: I've heard people say that these coding agents are just tools and don't replace the thinking. That's fine but the problem for me is that the act of coding is when I do my thinking!I'm thinking about how to solve the problem and how to express it in the programming language such that it is easy to maintain. Getting someone/something else to do that doesn't help me.But different strokes for different folks, I suppose.
scuff3d: I'm similar, but I do find some natural places where LLMs can be helpful.Just today I was working on something that involves a decent amount of configuration. It's in Python unfortunately and I hate passing around dictionaries for configs, I usually like to parse the JSON or YAML or whatever into a config class so I have a natural way to validate and access without just throwing strings around.As I was playing with the code for the actual work that needs to be done, I was thinking what configs I needed and what structure made sense. Once I knew what I needed I gave the JSON to an LLM with some instructions regarding helper functions and told it to give me the appropriate Python code. It's just a bunch of dataclasses with some from_dict or from_string methods on them, not interesting or difficult to write. Freed me up to keep working on the real problem.
fnands: When was the last time you tried?I think trying agents to do larger tasks was always very hit or miss, up to about the end of last year.In the past couple of months I have found them to have gotten a lot better (and I'm not the only one).My experience with what coding assistants are good for shifted from:smart autocomplete -> targeted changes/additions -> full engineering
maccard: I’m not OP but every time I post a comment with this sentiment I get told “the latest models are what you need”. If every 3 months you are saying “it’s ready as long as you use the latest model”, then it wasn’t ready 3 months ago and it’s not likely to be ready now.To answer your question, I’ve tried both Claude code and Antigravity in the last 2 weeks and I’m still finding them struggling. AG with Gemini regularly gets stuck on simple issues and loops until I run out of requests, and Claude still just regularly goes on wild tangents not actually solving the problem.
fragmede: At this point though, after Claude C Compiler, you've got to give us more details to better understand the dichotomy. What do you consider simple issues?
maccard: > At this point though, after Claude C Compiler,Perfect example. You mean the C compiler that literally failed to compile a hello world [0] (which was given in it's readme)?> What do you consider simple issues?Hallucinating APIs for well documented libraries/interfaces, ignoring explicit instructions for how to do things, and making very simple logic errors in 30-100 line scripts.As an example, I asked Claude code to help me with a Roblox game last weekend, and specifically asked it to "create a shop GUI for <X> which scales with the UI, and opens when you press E next to the character". It proceeded to create a GUI with absolute sizings, get stuck on an API hallucination for handling input, and also, when I got it unstuck, it didn't actually work.[0] https://github.com/anthropics/claudes-c-compiler/issues/1
fragmede: Excellent examples, thank you!Shame Claude Code doesn't have sharable chat logs, it would be interesting to see where your Roblox exploration went off the rails.
theshrike79: > The StrongDM pattern is interesting - having an ongoing swarm of testing agents which hammer away at a staging cluster trying different things and noting stuff that breaks. Effectively an agent-driven QA team.That sounds a lot like Chaos Engineering: https://en.wikipedia.org/wiki/Chaos_engineering
malexw: Oh gosh - now that you mention it, it was "Working Effectively with Legacy Code" that I was thinking of, not "Refactoring".
theshrike79: IIRC that was the book that coined the term "Refactoring" though =)
AlexCalderAI: Great patterns here. I'd add one more critical layer that many miss: orchestration state management.Running multiple agents concurrently (QA, content, conversions, distribution), we hit this exact wall - agents didn't know what other agents had done, creating duplicate work and missed context.Solved it with a stupidly simple approach: 1. Single TODO.md with "DO NOW" (unblocked), "BLOCKED", "DONE" sections 2. Named output files per agent type (qa-status.md, scout-finds.md, etc) 3. active-tasks.md for crash recovery - breadcrumbs from interrupted runs 4. Daily memory logs with session IDs for searchabilityThe key: File-based state is deterministic. After a crash, the next agent reads identical input, same decision rules, same output structure. Zero state collision, zero "what was I thinking?"Deployment: ~8 agents on cron. They wake, read files, work, write results, die. No persistent terminal. No coordination overhead.This turned "5 terminal tabs with unmanageable logs" into "grep yesterday's log, see exactly what happened."Patterns + implementation details: https://osolobo.com/first-ai-agent-guide/
devin: What happens when value Z is not >= X? What happens when value Z doesn't exist, but values J and K do? What should be done when...I hear what you're saying, but I think it's going to be entertaining watching people go "I guess this is why we paid Bob all of that money all those years".
pixl97: > when value Z is not >= X?Is your AI not even doing try/catch statements, what century are you in?
devin: Did you just arrogantly suggest that my LLM should use exceptions for control flow? Funny stuff!
nwpwr: I think you can use https://traces.com for that
fantasizr: Code review is now a bit like Brandolini's law: "The amount of energy needed to refute bullshit is an order of magnitude bigger than that needed to produce it." You ultimately need a lot of buy in to spend more than 5 mins on something that took 5 seconds to produce.
mistercheese: Yes I thinks somehow we need a bulldog check gate before it even goes to a human reviewer
mistercheese: > the most important duty of the submitter is to vouch for itWhen shipping pressure comes, I’ve seen this to be the first thing to go. Despite formalizing ownership standards, etc… people on both the submitting and reviewing end just give up understanding Ai slop when management says they need to hit a deadline.Probably no company would actually do this, but I wonder if we should actually actively test the submitter’s understanding of the code submitted somehow as a prerequisite to moving a PR to ready for review. I’m not sure if it will be actually hopeful, enforcing people to understand the code, but maybe at least we’ll put the cultural expectation upfront and center?
nightski: Right now with agents this is definitely going to continue to be the case. That said, at the end of the day engineers work with stakeholders to come up with a solution. I see no reason why an agent couldn't perform this role in the future. I say this as someone who is excited but at the same time terrified of this future and what it means to our field.I don't think we'll get their by scaling current techniques (Dario disagrees, and he's far more qualified albeit biased). I feel that current models are missing critical thinking skills that I feel you need to fully take on this role.
jimbokun: There’s nothing any human can do that an AI can’t be expected to perform as well or better in the future.Maybe the Oldest Profession will be the last to go.
jimbokun: Early in my career I would sometimes be told to not worry about making the code “nice” just get it working and move on. I would nod and just write good code like I always did, knowing it didn’t take longer than writing bad code, and would be much easier to modify and extend and fix later.I feel like there’s a similar vibe coming with vibe coding. Just let the AI generate as much code as it wants, don’t check it because it doesn’t matter because only the LLM will be reading it anyway.My gut tells me that1. there will still be reasons for humans to understand the code for a long time,2. even the LLM will struggle with modifying code last a certain size and complexity without good encapsulation and well thought out system architecture and design.
jerf: I classify your latter points under "AIs are Finite": https://jerf.org/iri/post/2026/what_value_code_in_ai_era/
AlexCalderAI: Solid patterns here. One thing I'd add from running Claude Code in production:The "give it bash" pattern sounds scary until you realize the alternative is 47 intermediate tool calls that fail silently.Letting the agent write and run scripts means the agent debugs when something breaks. The feedback loop tightens dramatically.The trick is sandboxing + cost limits. Not preventing shell access.
mightybyte: One plausible future I can see from here is that we see a shift in our relationship to code in high-level languages that is similar to what happened with code written in assembly language back when the first high level languages were introduced. Before them, software engineers operated in assembly language. They cared about the structure of assembly code. This happened before I started my professional software career, but I can imagine that a lot of the same things we are hearing from developers today were heard back then. Concern about devs producing code they didn't understand, the generated assembly not being meant to be understood by others, etc etc.Now, however, we know how that played out in the case of assembly language. The fact of the matter is that only a very tiny fraction of software engineers give the structure of the compiled assembly code even passing thought. Our ability to generate assembly code is so great that we don't care about the end result. We only care about its properties...i.e. that it runs efficiently enough and does what we want. I could easily see the AI software development revolution ending up the same way. Does it really matter if the code generated by AI agents is DRY and has good design if we can easily recreate it from scratch in a matter of minutes/hours? As much as I love the craft and process of creating a beautiful codebase, I think we have to seriously consider and plan for a future where that approach is dramatically less efficient than other AI-enabled approaches.
gaigalas: The most important thing you need to understand with working with agents for coding is that now you design a production line. And that has nothing to do (mostly) with designing or orchestrating agents.Take a guitar, for example. You don't industrialize the manufacture of guitars by speeding up the same practices that artisans used to build them. You don't create machines that resemble individual artisans in their previous roles (like everyone seems to be trying to do with AI and software). You become Leo Fender, and you design a new kind of guitar that is made to be manufactured at another level of scale magnitude. You need to be Leo Fender though (not a talented guitarrist, but definitely a technical master).To me, it sounds too early to describe patterns, since we haven't met the Ford/Fender/etc equivalent of this yet. I do appreciate the attempt though.
geon: The machines in factory production lines are generally very deterministic. Not sure how well industrialisation would have worked if the machines just did whatever.
gaigalas: Again, this word "deterministic". It means nothing anymore.When you see a sorting machine that jiggles lots of pieces so they align, that's because pieces don't align naturally. It's a fix for chaos, for things that naturally behave like "doing whatever".Industrial machinery is full of this in all sorts of places. Even in precision engineering. Press-fits and interference-fits, etc. We deal with lack of precision all the time.Engineers are _absolute chads_ on this kind of thing. We tame chaos like no other professional.
geon: That’s what I’m saying. We should tame the chaos, not encourage it.The screw sorting machines don’t generally decide to start spitting out resistors instead.
vidarh: Yes, it's often faster if you sit around waiting. What I will do instead is prompt the AI to create various plans, do other stuff while they do, review and approve the plans, do other stuff while multiple plans are being implemented, and then review and revise the output.And I have the AI deal with "knowing how to do it" as well. Often it's slower to have it do enough research to know how to do it, but my time is more expensive than Claude's time, and so as long as I'm not sitting around waiting it's a net win.
jplusequalt: >And I have the AI deal with "knowing how to do it" as well. Often it's slower to have it do enough research to know how to do itThis is exactly the sort of future I'm afraid of. Where the people who are ostensibly hired to know how stuff works, out source that understanding to their LLMs. If you don't know how the system works while building, what are you going to when it breaks? Continue to throw your LLM at it? At what point do you just outsource your entire brain?
vidarh: There are many layers to "knowing how stuff works". What does your manager do when your code breaks?> Continue to throw your LLM at it?Increasingly, yes. If you have objective acceptance criteria, just putting the LLM in a loop with a quality gate tends to have it converge on a fix itself, the same way a human would. Not always, and not always optimally, but more and more often, and with cheaper and cheaper models.I also tend to throw in an analysis stage where it will look at what went wrong and use that to add additional criteria for the next run.
jplusequalt: Do you feel no shame shipping code without understanding how any of it works?
ozim: From briefly checking the important one is:*Hoard things you know how to do*It will make everything faster for you - even if you can ask AI it will be more costly to do it from scratch.Also it is nothing new under the sun. In old days a developer would have his own stack of libraries and books and would not need to do NPM i for someone elses code because he would have bunch of own libraries ready to go. Of course one can say, there will always be a library that is better then yours ... but is it? :)
mergeshield: The local hill-climbing observation is the key insight here. When code generation is cheap and fast, the expensive decision becomes "should we merge this?" not "can we write this?"That shifts where rigor needs to live. The article focuses on planning patterns before code generation, which matters. But I'd argue the merge gate is equally important and massively underinvested. Right now the merge decision for most teams is: one person clicks Approve after a quick scan. That's the same process whether the PR is a trivial config change or a critical auth refactor, whether it came from a trusted agent or an unknown one.The teams I've seen handle this well invest in proportional review. Not every change gets the same scrutiny. They define risk dimensions (what files changed, what agent generated it, how complex is the diff) and route PRs to different review intensities based on that score. The planning patterns in the article are upstream. But the merge governance pattern is downstream, and it's where most of the production risk actually lives.To the specification debate: I've found that detailed specs help less when you have good merge governance, because bad output gets caught and rejected at the merge gate rather than requiring perfect input at the spec stage.