Discussion

jmward01: Haiku not getting an update is becoming telling. I suspect we are reaching a point where the low end models are cannibalizing high end and that isn't going to stop. How will these companies make money in a few years when even the smallest models are amazing?

bicepjai: This card is a 272 page report. So now we are redefining names :)

albert_e: Does the model card fit in the model's context :)

blixt: Isn't it pretty common for the smaller models to release a little while after the bigger ones, for all the big model providers?

jmward01: The last update for Haiku was in October, or in startup land, 10 years ago.

joeumn: I'm actually surprised at how it performed compared to 4.6 and also compared to mythos. Will be fun to use.

STRiDEX: Dumb question but why are chemical weapons always addressed as a risk with llms? Is the idea that they contain how to make chemical weapons or that they would guide someone on how?Would there not already be websites that contain that information? How is an llm different, i guess, from some sort of anarchist cookbook thing.

CodingJeebus: WAG but I wonder if a hijacked LLM could also assist with figuring out how to obtain required materials, not just provide the recipe.

rgbrenner: In the same way that all coding docs are available publicly

Symmetry: > The technical error that caused accidental chain-of-thought supervision in some prior models (including Mythos Preview) was also present during the training of Claude Opus 4.7, affecting 7.8% of episodes.>_>

dkhenry: The Gemma models are at this point. A 31B model that can fit on a consumer card is as good as Sonnet 4.5. I haven't put it through as much on the coding front or tool calling as I have the Claude or GPT models, but for text processing it is on par with the frontier models.

make3: absolutely not on par you're smoking

dkhenry: You make a compelling argument, but thankfully I have data to back up my anecdotal experienceThis comparison shows them neck and neck https://benchlm.ai/compare/claude-sonnet-4-5-vs-gemma-4-31bAs Does this one https://llm-stats.com/models/compare/claude-sonnet-4-6-vs-ge...And the pelican benchmark even shows them pretty close https://simonwillison.net/2026/Apr/2/gemma-4/ https://simonwillison.net/2025/Sep/29/claude-sonnet-4-5/Also this isn't a fringe statement, you can see most people who have done an evaluation agree with me

Philpax: Both. There's the risk of them instructing a user on how to produce a known formulation (the Anarchist Cookbook solution, as you say), which is irritating but not that problematic.The bigger issue is that they are potentially capable of producing novel formulations capable of producing harm, and guiding someone through this process. That is, consider a world in which someone with malicious desires has access to a model as capable at chemistry / biology as Mythos is at offensive cybersecurity abilities.This is obviously limited by the fact that the models don't operate in the physical world, but there's plenty of written material out there.

rogerrogerr: The world has been blessed by two connected things:1. Smart people have economic opportunities that align them away from being evil2. People who are evil tend not to be smart.We're breaking both of these assumptions.

chrisweekly: "Smart people have economic opportunities that align them away from being evil"For some definition of evil, some of the time, ok. But as economic opportunities compound (looking at the behavior of the ultra-rich), it seems there's at least strong correlation in the other direction, if not full-on "root of all evil" causation.

bachittle: So Opus 4.7 is measurably worse at long-context retrieval compared to Opus 4.6. Opus 4.6 scores 91.9% and Opus 4.7 scores 59.2%. At least they're transparent about the model degradation. They traded long-context retrieval for better software engineering and math scores.

freedomben: Agreed, I appreciate the transparency (and Anthropic isn't normally very transparent). It's also great to know because I will change how I approach long contexts knowing it struggles more with them.

film42: To be honest, I think it's just a more honest score of what Opus 4.6 actually was. Once contexts get sufficiently large, Opus develops pretty bad short term memory loss.

anonyfox: well it will saturate your 5h limit window at least

malcolmgreaves: That’s not quite true. Take a look at all the billionaires destroying society. Being evil is the surest way to get to get rich. In fact it’s the only way to amass that level of capital: there’s no ethical billionaire.

mikek: This feels like a wild overgeneralization. People can become rich without resorting to evil methods, especially now with global markets and software. Case in point: Minecraft was wildly successful, and now Notch is a billionaire.

aliljet: Have they effectively communicated what a 20x or 10x Claude subscription actually means? And with Claude 4.7 increasing usage by 1.35x does that mean a 20x plan is now really a 13x plan (no token increase on the subscription) or a 27x plan (more tokens given to compensate for more computer cost) relative to Claude Opus 4.6?

ModernMech: Feels like buying toilet paper.

jmward01: I think one area I find hard to get around is context length. Everything self hosted is so limited on length that it is marginal to use. Additionally I think that the tools (like claude code) are clearly in the training mix for Anthropic's models so they seem to get a boost over other models pushed into that environment. That being said, open source and local inference is -really- good and only going to get better. There is no doubt that the current frontier biz model is not sustainable.

hxugufjfjf: Eeeeh not the best example maybe?

Aboutplants: It’s marketing, Fear is one of the most effective marketing tools. That and purpose of government attention

rogerrogerr: Sure, but that’s not “slaughter a stadium of people with drones” evil or “poison the water supply” evil or “take out unprotected electrical substations” evil.So much infrastructure is very soft because the evil people aren’t smart enough to conceive of or conduct an attack.

fwip: I think you might find that, if you reconsider who the 'evil' people are, you might find that we're already doing that sort of thing.

vessenes: This is an interesting document, in that it reads like a Claude Mythos model card that was hastily edited to be an Opus 4.7 model card.I surmise that someone at the top put the Mythos release on hold, and the product team was told "ship this other interim step model instead. quickly."I wonder if 4.7 will be seen as a net step-up in quality; there are some regressions noted in the document, and it's clearly substantially worse than Mythos, at least according to its own model card. Should be an interesting few months -- if I were at oAI I'd be rushing to get something out that's clearly better, and pressing for weakness here.

the13: What makes you think that? "it reads like a Claude Mythos model card that was hastily edited to be an Opus 4.7 model card"

vessenes: There are more mentions of Mythos than 4.6. Mythos results are nearly everywhere, and vastly exceed 4.7's capacity in almost every case. There are sections that report only research on Mythos, none on 4.7. E.g. user surveys about how beneficial Mythos is internally at Anthropic.

qingcharles: Google is putting a lot of research into small models. Most of my AI budget is now going to small models because I am doing lots of tiny tasks that the small models do great with. I would think a decent chunk of Goog's API revenue probably comes from their small models.

msla: PDF, because it isn't marked.

marginalia_nu: It's not 1998 any more. All browsers read PDFs now.

barneybooroo: Yeah, the section expanding on how they evaluated Mythos internally is a bit baffling considering how irrelevant it is.

RobinL: Could this be because they've found the 1m context uneconomical (ie costs too much to serve, or burns through users quota too quickly causing complaints), and so they're no longer targeting it as a goal

Someone1234: Opus 4.7 is also worse at 256K context. Go look at page 195 and page 196. It is across the board regression, not just 1M context.

RobinL: Thanks, interesting. Does this make it more surprising that the other benchmarks have improved? I'm not sure I understand the benchmarks well enough - but I'm wondering whether with agentic workflows it's possible to get away with a smaller more focussed context (and hence lower cost) whilst achieving the same or better performance, because of agentic model's ability to decide what the put in context as they work

timvb: what's all this mean in real world use?

deflator: Model Welfare? Are they serious about this? Or is it just more hype? I really don't trust anything this company says anymore. "We have a model that is too dangerous to release" is like me saying that I have a billion dollars in gold that nobody is allowed to see but I expect to be able to borrow against it.

hgoel: Maybe referring to it as welfare is odd, but these points are important. It isn't a good look to have a model that tends to get into self-deprecating loops like one of Google's older models, it's an even worse look and potential legal liability if your model becomes associated with a suicide. An overly negative chat model would also just be unpleasant to use.With the weights being mostly opaque, these kinds of evaluations are an important piece of reducing the harm an AI model can cause.

computomatic: They have communicated it as 5x is 5 x Pro, and 20x is 20 x Pro (I haven’t looked lately so not sure if that’s changed).They have also repeatedly communicated that the base unit (Pro allotment) is subject to change and does change often.As far as I can tell, that implies there is no guarantee that those subscriptions get some specific number of tokens per unit of time. It’s not a claim they make.

msikora: I think as far as the maybe more important weekly allotment Max 5 is 10x Pro and Max 20 is 20x Pro. For the 5 hour window it is as the names would suggest though.

Jensson: Its not capitalists doing that though, its politicians, and politicians in non capitalist countries tend to be more evil.

fwip: Correct me if I'm wrong, but there aren't any non-capitalist countries currently waging war on others.

ben_w: Capitalism is a continuum, not a binary, hence occasional discussion "China is communist!" "No, it is state-capitalism!"Is Russia currently capitalist, or non-capitalist? Which is Myanmar?Anyway, personally I think it's the wrong axis; while capitalism and democracy and free press are often correlated, I think that the latter two are the important ones for actually choosing the lesser evils, though capitalism does generate more options to choose between.

Reader /

Discussion