Discussion
Leanstral: Open-Source foundation for trustworthy vibe-coding
jasonjmcghee: Curious if anyone else had the same reaction as meThis model is specifically trained on this task and significantly[1] underperforms opus.Opus costs about 6x more.Which seems... totally worth it based on the task at hand.[1]: based on the total spread of tested models
patall: Maybe a naive question: given that they see better performance with more passes but the effect hits a limit after a few passes, would performance increase if they used different models per pass, i.e leanstral, kimi, qwen and leanstral again instead of 4x leanstral?
lefrenchy: Does Mistral come close to Opus 4.6 with any of their models?
andai: Trustworthy vibe coding. Much better than the other kind!Not sure I really understand the comparisons though. They emphasize the cost savings relative to Haiku, but Haiku kinda sucks at this task, and Leanstral is worse? If you're optimizing for correctness, why would "yeah it sucks but it's 10 times cheaper" be relevant? Or am I misunderstanding something?On the promising side, Opus doesn't look great at this benchmark either — maybe we can get better than Opus results by scaling this up. I guess that's the takeaway here.
flowerbreeze: They haven't made the chart very clear, but it seems it has configurable passes and at 2 passes it's better than Haiku and Sonnet and at 16 passes starts closing in on Opus although it's not quite there, while consistently being less expensive than Sonnet.
kittikitti: This is great, congratulations to the Mistral team! I'm looking forward to the code arena benchmark results. Thanks for sharing.
andai: This is called a "LLM alloy", you can even do it in agentic, where you simply swap the model on each llm invocation.It does actually significantly boost performance. There was an article on here about it recently, I'll see if I can find it.Edit: https://news.ycombinator.com/item?id=44630724
DarkNova6: Not at the moment, but a release of Mistral 4 seems close which likely bridges the gap.
re-thc: Mistral Small 4 is already announced.
Havoc: What are these "passes" they reference here? Haven't seen that before in LLM evalsCould definitely be interesting for having another model run over the codebase when looking for improvements
rockinghigh: It's the number of attempts at answering the question.
chucky_z: I use mistral-medium-3.1 for a lot of random daily tasks, along with the vibe cli. I'd state from my personal opinion that mistral is my preferred 'model vendor' by far at this point. They're extremely consistent between releases while each of them just feels better. I also have a strong personal preference to the output.I actively use gemini-3.1-pro-preview, claude-4.6-opus-high, and gpt-5.3-codex as well. I prefer them all for different reasons, however I usually _start_ with mistral if it's an option.
patall: That sounds quite interesting. Makes me wonder if sooner or later they will have to train multiple independent models that cover those different niches. But maybe we will see that sooner or later. Thanks for the link.
lsb: The real world success they report reminds me of Simon Willison’s Red Green TDD: https://simonwillison.net/guides/agentic-engineering-pattern...> Instead of taking a stab in the dark, Leanstral rolled up its sleeves. It successfully built test code to recreate the failing environment and diagnosed the underlying issue with definitional equality. The model correctly identified that because def creates a rigid definition requiring explicit unfolding, it was actively blocking the rw tactic from seeing the underlying structure it needed to match.
cyanydeez: One would think that LoRAs being so successful in StableDiffusion, that more people would be focused on constructing framework based LoRas; but the economics of all this probably preclude trying to go niche in any direction and just keep building the do-all models.
tjwebbnorfolk: Mistral hasn't been in the running for SOTA for quite awhile now
elAhmo: I don’t know a single person using Mistral models.
pelagicAustral: Me neither, they're not ready for prime imo. I have a yearly sub and the product is just orders of magnitude behind Anthropic's offering. I use Code for real world stuff and I am happy with the result, Mistral is just not something I can trust right now.
skanga: TDD == Prompt Engineering, for Agentic coding tasks.
DrewADesign: It’s really not hard — just explicitly ask for trustworthy outputs only in your prompt, and Bob’s your uncle.
miacycle: Assuming that what you're dealing with is assertable. I guess what I mean to say is that in some situations is difficult to articulate what is correct and what isn't depending in some situations is difficult to articulate what is correct and what isn't depending upon the situation in which the software executes.
esperent: [delayed]