Discussion
armcat: I still find it incredible at the power that was unleashed by surrounding an LLM with a simple state machine, and giving it access to bash
stanleykm: unfortunately all the agent cli makers have decided that simply giving it access to bash is not enough. instead we need to jam every possible functionality we can imagine into a javascript “TUI”.
MrScruff: > This is speculative, but I suspect that if we dropped one of the latest, most capable open-weight LLMs, such as GLM-5, into a similar harness, it could likely perform on par with GPT-5.4 in Codex or Claude Opus 4.6 in Claude Code.Unless I'm misunderstanding what's being described here, running Claude Code with different backend models is pretty common.https://docs.z.ai/scenario-example/develop-tools/claudeIt doesn't perform on par with Anthropic's models in my experience.
kamikazeturtles: > It doesn't perform on par with Anthropic's models in my experience.Why do you think that is the case? Is Anthropic's models just better or do they train the models to somehow work better with the harness?
crustycoder: A timely link - I've just spent the last week failing to get a ChatGPT Skill to produce a reproducible management reporting workflow. I've figured out why and this article pretty much confirms my conclusions about the strengths & weaknesses of "pure" LLMS, and how to work around them. This article is for a slightly different problem domain, but the general problems and architecture needed to address them seem very similar.
Yokohiii: The example is really lean and straightforward. I don't use coding agents, but this is some good overview and should help everyone to understand that coding agents may have sophisticated outcomes, but the raw interaction isn't magical at all.It's also a good example that you can turn any useful code component that requires 1k LOC into a mess of 500k LOC.
beshrkayali: > long contexts are still expensive and can also introduce additional noise (if there is a lot of irrelevant info)I think spec-driven generation is the antithesis of chat-style coding for this reason. With tools like Claude Code, you are the one tracking what was already built, what interfaces exist, and why something was generated a certain way.I built Ossature[1] around the opposite model. You write specs describing behavior, it audits them for gaps and contradictions before any code is written, then produces a build plan toml where each task declares exactly which spec sections and upstream files it needs. The LLM never sees more than that, and there is no accumulated conversation history to drift from. Every prompt and response is saved to disk, so traceability is built in rather than something you reconstruct by scrolling back through a chat. I used it over the last couple of days to build a CHIP-8 emulator entirely from specs[2]. I have some more example projects on GitHub[3]1: https://github.com/ossature/ossature2: https://github.com/beshrkayali/chomp83: https://github.com/ossature/ossature-examples
Yokohiii: That is why I am currently looking into building my own simple, heavily isolated coding agent. The bloat is already scary, but the bad decisions should make everyone shiver. Ten years ago people would rant endlessly about things with more then one edge, that requires a glimpse of responsibility to use. Now everyone seems to be either in panic or hype mode, ignoring all good advice just to stay somehow relevant in a chaotic timeline.
HarHarVeryFunny: If all you want is a program that calls the model in a loop and offers a bash tool, then ask Claude Code to build that. You won't like it though!For a preview of what it'd be like, just tell your AI chat app that you'll run bash commands for it, and please change the app in your "current directory" to "sort the output before printing it", or some such request.
Yokohiii: I like it a lot, I find the chat driven workflow very tiring and a lot of information gets lost in translation until LLMs just refuse to be useful.How does the human intervention work out? Do you use a mix of spec and audit editing to get into the ready to generate state? How high is the success/error rate if you generate from tasks to code, do LLMs forget/mess up things or does it feel better?The spec driven approach is potentially better for writing things from scratch, do you have any plans for existing code?
stanleykm: i did.. and thats what i use. obviously its a little more than just a tool that calls bash but it is considerably less than whatever they are doing in coding agents now.
esafak: Tools gave humans the edge over other animals.
Yokohiii: And those tools regularly burnt cities to ashes. Took a long time to get it under control.
esafak: They're just dumber. I've used plenty of models. The harness is not nearly as important.
vidarh: The harness if anything matters more with those other models because of how much dumber they are... You can compensate for some of the stupidity (but by no means all) with harnesses that tries to compensate in ways that e.g. Claude Code does not because it isn't necessary to do so for Anthropics own models.
Yokohiii: I think you get him wrong? He is already concerned about "bash on steroids" and current tools add concerning amounts of steroids to everything.
peterm4: This looks great, and I’ve bookmarked to give it a go.Any reason you’ve opted for custom markdown formats with the @ syntax rather than using something like frontmatter?Very conscious that this would prevent any markdown rendering in github etc.
beshrkayali: Thanks!> How does the human intervention work out? Do you use a mix of spec and audit editing to get into the ready to generate state?Yes, the flow is: you write specs then you validate them `ossature validate` which parses them and checks they are structurally sound (no LLM involved), then you'd run `ossature audit` which flags gaps or contradictions in the content, and from that it produces a toml build plan that you can read and edit directly before anything is generated. You can reorder tasks, add notes for the llm, adjust verification commands, or skip steps entirely. So when you run `ossature build` to generate, the structure is already something you have signed off on.> The spec driven approach is potentially better for writing things from scratch, do you have any plans for existing code?Right now it is best for greenfield, as you said. I have been thinking about a workflow where you generate specs from existing code and then let Ossature work from those, but I am honestly not sure that is the right model either. The harder case is when engineers want to touch both the code and the specs, and keeping those in sync through that back and forth is something I want to support but have not figured out a clean answer for yet. It's on the list, if you have any thoughts please feel free to open an issue! I want to get through some of the issues I am seeing with just spec editing workflow (and re-audit/re-planning) first, specifically around how changes cascade through dependent tasks.Regarding success rate, each task requires a verification command to run and pass after generation and if it fails, a separate fixer agent tries to repair it using the error output. The number of retry attempts is configurable. I did notice that the more concise and clear the spec is the more likely it is for capable models to generate code that works (obviously) but that's what auditing is supposed to help with. One interesting case about the chip-8 emulator I mentioned above is that even mentioning the correct name of the solution to a specific problem was not enough, I had to spell out the concrete algorithm in the spec (wrote more details here[1]). But the full prompt and response for every task is saved to disk, so when something does go wrong one can read the exact prompt/response and fix-attempts prompt/response for each task.I wrote more details in an intro post[2] about Ossature, if useful.1: https://log.beshr.com/chip8-emulator-from-spec/2: https://ossature.dev/blog/introducing-ossature/