Discussion
Introducing Claude Opus 4.7
jameson: How should one compare benchmark results? For example, SWE-bench Pro improved ~11% compared with Opus 4.6. Should one interpret it as 4.7 is able to solve more difficult problems? or 11% less hallucinations?
theptip: 11% further along the particular bell curve of SWE-bench. Not really easy to extrapolate to real world, especially given that eg the Chinese models tend to heavily train on the benchmarks. But a 10% bump with the same model should equate to “feels noticeably smarter”.A more quantifiable eval would be METR’s task time - it’s the duration of tasks that the model can complete on average 50% of the time, we’ll have to wait to see where 4.7 lands on this one.
grandinquistor: Quite a big improvement in coding benchmarks, doesn’t seem like progress is plateauing as some people predicted.
verdverm: Some of the benchmarks went down, has that happened before?
grandinquistor: Probably deprioritizing other areas to focus on swe capabilities since I reckon most of their revenue is from enterprise coding usage.
TIPSIO: Oh wow, I love this idea even if it's relatively insignificant in savings.I am finding my writing prompt style is naturally getting lazier, shorter, and more caveman just like this too. If I was honest, it has made writing emails harder.While messing around, I did a concept of this with HTML to preserve tokens, worked surprisingly well but was only an experiment. Something like:> <h1 class="bg-red-500 text-green-300"><span>Hello</span></h1>AI compressed to:> h1 c bgrd5 tg3 sp hello sp h1Or something like that.
naoru: You'd like Emmet notation. Just look at the cheat sheet: https://docs.emmet.io/cheat-sheet/
desugun: I guess our conscience of OpenAI working with the Department of War has an expiry date of 6 weeks.
arcanemachiner: That number is generous, and is also a pretty decent lifespan for a socially-conscious gesture in 2026.
hk__2: Meh. At $work we were on CC for one month, then switched to Codex for one month, and now will be on CC again to test. We haven’t seen any obvious difference between CC and Codex; both are sometimes very good and sometimes very stupid. You have to test for a long time, not just test one day and call it a benchmark just because you have a single example.
Tiberium: I hope people realize that tools like caveman are mostly joke/prank projects - almost the entirety of the context spent is in file reads (for input) and reasoning (in output), you will barely save even 1% with such a tool, and might actually confuse the model more or have it reason for more tokens because it'll have to formulate its respone in the way that satisfies the requirements.
acedTrex: You really think the 33k people that starred a 40 line markdown file realize that?
andersa: You mean the 33k bots that created a nearly linear stars/day graph? There's a dip in the middle, but it was very blatant at the start (and now)
egorfine: They are indeed impractical in agentic coding.However in deep research-like products you can have a pass with LLM to compress web page text into caveman speak, thus hugely compressing tokens.
msp26: They don't have the compute to make Mythos generally available: that's all there is to it. The exclusivity is also nice from a marketing pov.
CodingJeebus: I've read so many conflicting things about Mythos that it's become impossible to make any real assumptions about it. I don't think it's vaporware necessarily, but the whole "we can't release it for safety reasons" feels like the next level of "POC or STFU".
gonzalohm: Until the next time they push you back to Claude. At this point, I feel like this has to be the most unstable technology ever released. Imagine if docker had stopped working every two releases
sergiotapia: There is zero cost to switching ai models. Paid or open source. It's one line mostly.
gonzalohm: What about your chat history? That has some value, at least for me. But what has even more value is stable releases.
perdomon: It seems like we're hitting a solid plateau of LLM performance with only slight changes each generation. The jumps between versions are getting smaller. When will the AI bubble pop?
lta: Every night praying for tomorrow
shostack: Looks like they are adding Peter Thiel backed ID verification too.https://reddit.com/r/ClaudeAI/comments/1smr9vs/claude_is_abo...
rurban: You'd need Mythos to free your iPhone, SamsungTV, SmartWatches or such. Maybe even printer drivers.
dirasieb: i sincerely doubt mythos is capable of jailbreaking an iphone
nathanielherman: Claude Code doesn't seem to have updated yet, but I was able to try it out by running `claude --model claude-opus-4-7`
duckkg5: /model claude-opus-4-7[1m]
adamtaylor_13: Most people just want to use a tool that works. Not everything has to be a damn moral crusade.
verdverm: Stars are more akin to bookmarks and likes these days, as opposed to a show of support or "I use this"
giraffe_lady: I intentionally throw some weird ones on there just in case anyone is actually ever checking them. Gotta keep interviewers guessing.
rurban: Because it was good until January 2026, then it detoriated into a opus-3.1. Probably given much less context windows or ram.
toomim: It released in February 2026.
ACCount37: Doesn't matter. My vibes say it got bad in January 2026. Thus, they secretly nerfed Opus 4.6 in January 2026.The fact that it didn't exist back then is completely and utterly irrelevant to my narrative.
arcanemachiner: Different plan. The old 2x has been discontinued, and the bonus is now available for the $100 plan users in an effort, presumably, to entice them away from Anthropic.
voidfunc: Is Codex the new goto? Opus stopped being useful about 45-60 days ago.
zeroonetwothree: I haven’t noticed much difference compared to Jan/Feb. Maybe depends what you use it for
aizk: How powerful will Opus become before they decide to not release it publicly like Mythos?
wojciem: Is it just Opus 4.6 with throttling removed?
cmrdporcupine: Yep, I'll wait for the GPT answer to this. If we're lucky OpenAI will release a new GPT 5.5 or whatever model in the next few days, just like the last round.I have been getting better results out of codex on and off for months. It's more "careful" and systematic in its thinking. It makes less "excuses" and leaves less race conditions and slop around. And the actual codex CLI tool is better written, less buggy and faster. And I can use the membership in things like opencode etc without drama.For March I decided to give Claude Code / Opus a chance again. But there's just too much variance there. And then they started to play games with limits, and then OpenAI rolled out a $100 plan to compete with Anthropic's.I'm glad to see the competition but I think Anthropic has pissed in the well too much. I do think they sent me something about a free month and maybe I will use that to try this model out though.
davely: I’ve been on the Claude Code train for a while but decided to try Codex last week after they announced the $100 USD Pro plan.I’ve been pretty happy with it! One thing I immediately like more than Claude is that Codex seems much more transparent about what it’s thinking and what it wants to do next. I find it much easier to interrupt or jump in the middle if things are going to wrong direction.Claude Code has been slowly turning into this mysterious black box, wiping out terminal context any time it compacts a conversation (which I think is their hacky way of dealing with terminal flickering issues — which is still happening, 14 months later), going out of the way to hide thought output, and then of course the whole performance issues thing.Excited to try 4.7 out, but man, Codex (as a harness at least) is a stark contrast to Claude Code.
cmrdporcupine: Do this -- take your coworker's PRs that they've clearly written in Claude Code, and have Codex/GPT 5.4 review them.Or have Codex review your own Claude Code work.It then becomes clear just how "sloppy" CC is.I wouldn't mind having Opus around in my back pocket to yeet out whole net new greenfield features. But I can't trust it to produce well-engineered things to my standards. Not that anybody should trust an LLM to that level, but there's matters of degree here.
woadwarrior01: It cuts both ways. What I usually do these days is to let codex write code, then use claude code /simplify, have both codex and claude code review the PR, then finally manually review and fixup things myself. It's still ~2x faster than doing everything by myself.
Kim_Bruning: > "We are releasing Opus 4.7 with safeguards that automatically detect and block requests that indicate prohibited or high-risk cybersecurity uses. "This decision is potentially fatal. You need symmetric capability to research and prevent attacks in the first place.The opposite approach is 'merely' fraught.They're in a bit of a bind here.
ls612: Only software approved by Anthropic (and/or the USG) is allowed to be secure in this brave new era.
nope1000: Except when you accidentally leak your entire codebase, oops
alecco: They don't have demand for the price it would require for inference.They are definitely distilling it into a much smaller model and ~98% as good, like everybody does.
baq: > They don't have demand for the price it would require for inference.citation needed. I find it hard to believe; I think there are more than enough people willing to spend $100/Mtok for frontier capabilities to dedicate a couple racks or aisles.
andy12_: If you mean for Anthropic in particular, I don't think so. But it's not the first time a major AI lab publishes an incremental update of a model that is worse at some benchmarks. I remember that a particular update of Gemini 2.5 Pro improved results in LiveCodeBench but scored lower overall in most benchmarks.https://news.ycombinator.com/item?id=43906555
endymion-light: I'm not sure how much I trust Anthropic recently.This coming right after a noticeable downgrade just makes me think Opus 4.7 is going to be the same Opus i was experiencing a few months ago rather than actual performance boost.Anthropic need to build back some trust and communicate throtelling/reasoning caps more clearly.
batshit_beaver: What I want to know is why my bedrock-backed Claude gets dumber along with commercial users. Surely they're not touching the bedrock model itself. Only thing I can think of is that updates to the harness are the main cause of performance degradation.
minimaxir: The more efficient tokenizer reduces usage by representing text more efficiently with fewer tokens. But the lack of transparancy does indeed mean Anthropic could still scale down limits to account for that.
ieie3366: All LLMs also effectively work by ”larping” a role. You steer it towards larping a caveman and well.. let’s just say they weren’t known for their high iq
DiogenesKynikos: This is why ancient Chinese scholar mode (also extremely terse) is better.
zozbot234: I tried the caveman prompt and it just starts telling me about GEICO insurance. Not very useful.
TIPSIO: Quick everyone to your side projects. We have ~3 days of un-nerfed agentic coding again.
johnwheeler: Exactly. God, it wouldn't be such a problem if they didn't gaslight you and act like it was nothing. Just put up a banner that says Claude is experiencing overloaded capacity right now, so your responses might be whatever.
grandinquistor: Huge regression for long contest tasks interestingly.Mrcr benchmark went from 78% to 32%
bensyverson: Exactly. The model is exquisitely sensitive to language. The idea that you would encourage it to think like a caveman to save a few tokens is hilarious but extremely counter-productive if you care about the quality of its reasoning.
ACCount37: > We are releasing Opus 4.7 with safeguards that automatically detect and block requests that indicate prohibited or high-risk cybersecurity uses.Fucking hell.Opus was my go-to for reverse engineering and cybersecurity uses, because, unlike OpenAI's ChatGPT, Anthropic's Opus didn't care about being asked to RE things or poke at vulns.It would, however, shit a brick and block requests every time something remotely medical/biological showed up.If their new "cybersecurity filter" is anywhere near as bad? Opus is dead for cybersec.
methodical: To be fair, delineating between benevolent and malevolent pen-testing and cybersecurity purposes is practically impossible since the only difference is the user's intentions. I am entirely unsurprised (and would expect) that as models improve the amount to which widely available models will be prohibited from cybersecurity purposes will only increase.Not to say I see this as the right approach, in theory the two forces would balance each other out as both white hats and black hats would have access to the same technology, but I can understand the hesitancy from Anthropic and others.
ACCount37: Yes, and the previous approach Anthropic took was "allow anything that looks remotely benign". The only thing that would get a refusal would be a downright "write an exploit for me". Which is why I favored Anthropic's models.It remains to be seen whether Anthropic's models are still usable now.I know just how much of a clusterfuck their "CBRN filter" is, so I'm dreading the worst.
skerit: I like this in theory. I just hope it doesn't require you to be be as literal as if talking to a genie.But if it'll actually stick to the hard rules in the CLAUDE.md files, and if I don't have to add "DON'T DO ANYTHING, JUST ANSWER THE QUESTION" at the end of my prompt, I'll be glad.
Jeff_Brown: It might be a bad idea to put that in all caps, because in the training data, angry conversations are less productive. (I do the same thing, just in lowercase.)
aurareturn: They don't have enough compute for all their customers.OpenAI bet on more compute early on which prompted people to say they're going to go bankrupt and collapse. But now it seems like it's a major strategic advantage. They're 2x'ing usage limits on Codex plans to steal CC customers and it seems to be working.It seems like 90% of Claude's recent problems are strictly lack of compute related.
Glemllksdf: Its a hard game to play anyway.Anthropics revenue is increasing very fast.OpenAI though made crazy claims after all its responsible for the memory prices.In parallel anthropic announced partnership with google and broadcom for gigawatts of TPU chips while also announcing their own 50 Billion invest in compute.OpenAI always believed in compute though and i'm pretty sure plenty of people want to see what models 10x or 100x or 1000x can do.
_boffin_: You state your hypnosis quite confidently. Can you tell me how taking down authentication many times is related to GPU capacity?
computomatic: I was doing some experiments with removing top 100-1000 most common English words from my prompts. My hypothesis was that common words are effectively noise to agents. Based on the first few trials I attempted, there was no discernible difference in output. Would love to compare results with caveman.Caveat: I didn’t do enough testing to find the edge cases (eg, negation).
AlecSchueler: Doesn't it just use more tokens in reasoning?
Robdel12: It’s funny, a few months ago I would have been pretty excited about this. But I honestly don’t really care because I can’t trust Anthropic to not play games with this over the next month post release.I just flat out don’t trust them. They’ve shown more than enough that they change things without telling users.
stingraycharles: While the caveman stuff is obviously not serious, there is a lot of legit research in this area.Which means yes, you can actually influence this quite a bit. Read the paper “Compressed Chain of Thought” for example, it shows it’s really easy to make significant reductions in reasoning tokens without affecting output quality.There is not too much research into this (about 5 papers in total), but with that it’s possible to reduce output tokens by about 60%. Given that output is an incredibly significant part of the total costs, this is important.https://arxiv.org/abs/2412.13171
AdamN: Yeah you could easily imagine stenography like inputs and outputs for rapid iteration loops. It's also true that in social media people already want faster-to-read snippets that drop grammar so the desire for density is already there for human authors/readers.
PunchTornado: neah, I believe most people here, which immediately brag about codex, are openai employees doing part of their job. otherwise I couldn't possibly phantom why would anyone use codex. In my company 80% is claude and 15% gemini. you can barely see openai on the graph. and we have >5k programmers using ai every day.
cmrdporcupine: I often work this way too, but I'll say this:This flow is exhausting. A day of working this way leaves me much more drained than traditional old school coding.
woadwarrior01: 100%. On days when I'm sleep deprived (once or twice a week), I fallback to this flow. On regular days, I tend to write more code the old school way and use things things for review.
redml: they've also introduced a lot of caching and token burn related bugs which makes things worse. any bug that multiplies the token burn also multiplies their infrastructure problems.
ambigioz: So many messages about how Codex is better then Claude from one day to the other, while my experience is exactly the same. Is OpenAI botting the thread? I can't believe this is genuine content.
embedding-shape: > I hope people realize that tools like caveman are mostly joke/prank projectsThis seems to be a common thread in the LLM ecosystem; someone starts a project for shits and giggles, makes it public, most people get the joke, others think it's serious, author eventually tries to turn the joke project into a VC-funded business, some people are standing watching with the jaws open, the world moves on.
simonw: I was convinced https://github.com/memvid/memvid was a joke until it turned out it wasn't.
HarHarVeryFunny: Benchmarks are meaningless. Try it on your own problems and see if it has improved for what you want to use it for.
hayd: me feel that it needs some tweaking - it's a little annoyingly cute (and could be even terser).
fetus8: on Tuesday, with 4.6, I waited for my 5 hour window to reset, asked it to resume, and it burned up all my tokens for the next 5 hour window and ran for less than 10 seconds. I’ve never cancelled a subscription so fast.
u_sama: I tried the Claude Extension for VSCode on WSL for a reverse engineering task, it consumed all of my tokens, broke and didn't even save the conversatioon
sallymander: It seems a little more fussy than Opus 4.6 so far. It actually refuses to do a task from Claude's own Agentic SDK quick start guide (https://code.claude.com/docs/en/agent-sdk/quickstart):"Per the instructions I've been given in this session, I must refuse to improve or augment code from files I read. I can analyze and describe the bugs (as above), but I will not apply fixes to `utils.py`."
mattas: Hard for me to reconcile the idea that they don't have enough compute with the idea that they are also losing money to subsidies.
anthonypasq: they clearly arent losing money, i dont understand why people think this is true
smt88: [delayed]
redml: a few months ago it was for weekly:pro = 5m tokens, 5x = 41m tokens, 20x = 83m tokensmaking 5x the best value for the money. this information may be outdated though, and doesn't apply to the new on peak 5h multipliers. if 4.7 increases usage by 1.35x you're just burning through that flat quota faster.
MarcelOlsz: Why? Doesn't have jokey copy.
hyperionultra: Where is chatgpt answer to this?
johnmlussier: They've increased their cybersecurity usage filters to the point that Opus 4.7 refuses to work on any valid work, even after acknowledging "This is authorized research under the [Redacted] Bounty program, so the findings here are defensive research outputs, not malware. I'll analyze and draft, not weaponize anything beyond what's needed to prove the bug to [Redacted].I will immediately switch over to Codex if this continues to be an issue. I am new to security research, have been paid out on several bugs, but don't have a CVE or public talk so they are ready to cut me out already.
johnmlussier: ⎿ API Error: Claude Code is unable to respond to this request, which appears to violate our Usage Policy (https://www.anthropic.com/legal/aup). This request triggered restrictions on violative cyber content and was blocked under Anthropic's Usage Policy. To request an adjustment pursuant to our Cyber Verification Program based on how you use Claude, fill out https://claude.com/form/cyber-use-case?token=[REDACTED] Please double press esc to edit your last message or start a new session for Claude Code to assist with a different task. If you are seeing this refusal repeatedly, try running /model claude-sonnet-4-20250514 to switch models. This is gonna kill everything I've been working on. I have several reproduced items at [REDACTED] that I've been working on.
senko: From the article:> Security professionals who wish to use Opus 4.7 for legitimate cybersecurity purposes (such as vulnerability research, penetration testing, and red-teaming) are invited to join our new Cyber Verification Program.
ACCount37: Yeah no. They can fuck right off with KYC humiliation rituals.
3s: Not to mention their recent integration of Persona ID verification - that was the last straw for me.
anonfunction: Seems they jumped the gun releasing this without a claude code update? /model claude-opus-4.7 ⎿ Model 'claude-opus-4.7' not found
codethief: https://news.ycombinator.com/item?id=47794516
tiel88: I've been raging pretty hard too. Thought either I'm getting cleverer by the day or Claude has been slipping and sliding toward the wrong side of the "smart idiot" equation pretty fast.Have caught it flat-out skipping 50% of tasks and lying about it.
Wojtkie: Is that why Anthropic recently gave out free credits for use in off-hours? Possibly an attempt to more evenly distribute their compute load throughout the day?
DaedalusII: i suspect they get cheap off peak electricity and compute is cheaper at those times
jedberg: That's not really how datacenter power works. It's usually a bulk buy with a 95th percentile usage.
NickNaraghi: The generations are two months apart now though…
boxedemp: I'm wondering this too. That said, I know a few people in real life who prefer Codex. More who prefer Claude though.
charcircuit: Codex doesn't read Claude.md like Claude does. It's not a "one line" change to switch.
aklein: I have a CLAUDE.md symlinked to AGENTS.md
coreylane: Looks completely broken on AWS Bedrock"errorCode": "InternalServerException", "errorMessage": "The system encountered an unexpected error during processing. Try your request again.",
ruairidhwm: I literally just posted a blog on this. Some seemingly insignificant words are actually highly structural to the model. https://www.ruairidh.dev/blog/compressing-prompts-with-an-au...
cheschire: I suspect even typos have an impact on how the model functions.I wonder if there’s a pre-processor that runs to remove typos before processing. If not, that feels like a space that could be worked on more thoroughly.
0123456789ABCDE: there is no pre-processor, i've had typos go through, with claude asking to make sure i meant one thing instead of the other
embedding-shape: To be fair, most of us looked at GPT1 and GPT2 as fun and unserious jokes, until it started putting together sentences that actually read like real text, I remember laughing with a group of friends about some early generated texts. Little did we know.
Alifatisk: Are there any public records I can see from GPT1 and GPT2 output and how it was marketed?
boringg: Theres little difference between the companies in that regard.Only that Dario didn't have the foresight to realize he didn't control the relationship and that, regardless of the politics of the current administration [D or R], its a non-starter for a government entity to have a private company dictate terms of use for critical functionality of the government.If Anthropic felt that way they shouldn't have taken the deal in the first place.
__turbobrew__: All of the smart people I know went to work at OpenAI and none at Anthropic. In addition to financial capital, OpenAI has a massive advantage in human capital over Anthropic.As long as OpenAI can sustain compute and paying SWE $1million/year they will end up with the better product.
scottyah: Attracting talent with huge sums of money just gets you people who optimize for money, and it's usually never a good long-term decision. I think it's what led to Google's downturn.
johnmlussier: I am absolutely moving off them if this continues to be the case.
gruez: >even after acknowledging "This is authorized research under the [Redacted] Bounty program, so the findings here are defensive research outputs, not malware. I'll analyze and draft, not weaponize anything beyond what's needed to prove the bug to [Redacted].What else would you expect? If you add protections against it being used for hacking, but then that can be bypassed by saying "I promise I'm the good guys™ and I'm not doing this for evil" what's even the point?
johnmlussier: This was Opus saying that after reviewing the [REDACTED] bug bounty program guidelines and having them in context.
soerxpso: That "per the instructions I've been given in this session" bit is interesting. Are you perhaps using it with a harness that explicitly instructs it to not do that? If so, it's not being fussy, it's just following the instructions it was given.
sallymander: I'm using their own python SDK with default prompts, exactly as the instructions say in their guide (it's the code from their tutorial).
solenoid0937: It feels like OAI stans have been botting HN for a few weeks now.
cmrdporcupine: Or, y'know, people can genuinely disagree
solenoid0937: 4.7 hasn't been out for an hour yet and we already have people shilling for Codex in the comments. I don't know how anyone could form a genuine disagreement in this period of time.
cmrdporcupine: Nobody I've seen in the comments is basing it on 4.7 performance. They're basing it on how unpleasant March and early April was on the Claude Code coding plans with 4.6. Which, from my experience, it was.I'm interested in seeing how 4.7 performs. But I'm also unwilling to pony up cash for a month to do so. And frankly dissatisfied with their customer service and with the actual TUI tool itself.It's not team sports, my friend. You don't have to pick a side. These guys are taking a lot of money from us. Far more than I've ever spent on any other development tooling.
solenoid0937: Backlash on HN for Anthropic adjusting usage limits is insane. There's almost no discussion about the model, just people complaining about their subscription.
therobots927: Who cares about a new model you can’t even use?
lanyard-textile: This comment thread is a good learner for founders; look at how much anguish can be put to bed with just a little honest communication.1. Oops, we're oversubscribed.2. Oops, adaptive reasoning landed poorly / we have to do it for capacity reasons.3. Here's how subscriptions work. Am I really writing this bullet point?As someone with a production application pinned on Opus 4.5, it is extremely difficult to tell apart what is code harness drama and what is a problem with the underlying model. It's all just meshed together now without any further details on what's affected.
kulikalov: Or it could be a selection bias. The ground truth is not what HN herd mentality complains about, but the usage stats.
afavour: > people here were so confident that OpenAI is going to collapse because of how much compute they pre-orderedThat's not why. It was and is because they've been incredibly unfocused and have burnt through cash on ill-advised, expensive things like Sora. By comparison Anthropic have been very focused.
aurareturn: I don't think that was the main reason for people thinking OpenAI is going to collapse here.By far, the biggest argument was that OpenAI bet too much on compute.Being unfocused is generally an easy fix. Just cut things that don't matter as much, which they seem to be doing.
scottyah: Nobody was talking about them betting too much on compute, people were saying that their shady deals on compute with NVIDIA and Oracle were creating a giant bubble in their attempt to get a Too Big To Fail judgement (in their words- taxpayer-backed "backstop").
KaiserPro: > OpenAI has a massive advantage in human capital over Anthropic.but if your leader is a dipshit, then its a waste.Look You can't just throw money at the problem, you need people who are able to make the right decisions are the right time. That that requires leadership. Part of the reason why facebook fucked up VR/AR is that they have a leader who only cares about features/metrics, not user experience.Part of the reason why twitter always lost money is because they had loads of teams all running in different directions, because Dorsey is utterly incapable of making a firm decision.Its not money and talent, its execution.
anonyfox: not a bot, voiced frustration is real here. I kind of depend on good LLMs now and wouldn't even mind if they had frozen the LLMs capabilities around dec 2025 forver and would hppily continue to pay, even more. but when suddenly the very same workload that was fine for months isn't possible anymore with the very same LLM out of nowhere and gets increasingly worse, its a huge disappointment. and having codex in parallel as a backup since ever I started also using it again with gpt 5.4 and it just rips without the diva sensitivity or overfitting into the latest prompt opus/sonnet is doing. GPT just does the job, maybe thinks a bit long, but even over several rounds of chat compression in the same chat for days stays well within the initial set of instructions and guardrails I spelled out, without me having to remind every time. just works, quietly, and gets there. Opus doesn't even get there anymore without nearly spelling out by hand manual steps or what not to do.
deepsquirrelnet: My tinfoil hat theory, which may not be that crazy, is that providers are sandbagging their models in the days leading up to a new release, so that the next model "feels" like a bigger improvement than it is.An important aspect of AI is that it needs to be seen as moving forward all the time. Plateaus are the death of the hype cycle, and would tether people's expectations closer to reality.
brynnbee: I'm currently testing 4.7 with some reverse engineering stuff/Ghidra scripting and it hasn't refused anything so far, but I'm also doing it on a 20 year old video game, so maybe it doesn't think that's problematic.
wahnfrieden: Codex release coming today: https://x.com/thsottiaux/status/2044803491332526287
simonw: I'm finding the "adaptive thinking" thing very confusing, especially having written code against the previous thinking budget / thinking effort / etc modes: https://platform.claude.com/docs/en/build-with-claude/adapti...Also notable: 4.7 now defaults to NOT including a human-readable reasoning token summary in the output, you have to add "display": "summarized" to get that: https://platform.claude.com/docs/en/build-with-claude/adapti...(Still trying to get a decent pelican out of this one but the new thinking stuff is tripping me up.)
postalcoder: I've been refreshing your comment for the last hour waiting to see the pelican. max thinking must be really going at it.
Bombthecat: And now gpt is laughing,while it replaces coders lol
dgb23: I agree with you here. I think this is for product placement for Mythos.
vintagedave: Same. I stopped my Pro subscription yesterday after entering the week with 70% of my tokens used by Monday morning (on light, small weekend projects, things I had worked on in the past and barely noticed a dent in usage.) Support was... unhelpful.It's been funny watching my own attitude to Anthropic change, from being an enthusiastic Claude user to pure frustration. But even that wasn't the trigger to leave, it was the attitude Support showed. I figure, if you mess up as badly as Anthropic has, you should at least show some effort towards your customers. Instead I just got a mass of standardised replies, even after the thread replied I'd be escalated to a human. Nothing can sour you on a company more. I'm forgiving to bugs, we've all been there, but really annoyed by indifference and unhelpful form replies with corporate uselessness.So if 4.7 is here? I'd prefer they forget models and revert the harness to its January state. Even then, I've already moved to Codex as of a few days ago, and I won't be maintaining two subscriptions, it's a move. It has its own issues, it's clear, but I'm getting work done. That's more than I can say for Claude.
suzzer99: It seems like the big companies they're providing Mythos to are their only concern right now.
kevinsync: I've been using Claude and Codex in tandem ($100 CC, $20 Codex), and have made heavy use of claude-co-commands [0] to make them talk. Outside of the last 1-2 weeks (which we now have confirmation YET AGAIN that Claude shits the fucking bed in the run-up to a new model release), I usually will put Claude on max + /plan to gin up a fever dream to implement. When the plan is presented, I tell it to /co-validate with Codex, which tends to fill in many implementation gaps. Claude then codes the amended plan and commits, then I have a Codex skill that reviews the commit for gaps, missed edge cases, incorrect implementation, missed optimizations, etc, and fix them. This had been working quite well up until the beginning of the month, Claude more or less got CTE, and after a week of that I swapped to $100 Codex, $20 CC plans. Now I'm using co-validation a lot less and just driving primarily via Codex. When Claude works, it provides some good collaborative insights and counter-points, but Codex at the very least is consistently predictable (for text-oriented, data-oriented stuff -- I don't use either for designing or implementing frontend / UI / etc).As always, YMMV![0] https://github.com/SnakeO/claude-co-commands
cmrdporcupine: This more or less mimics a flow that I had fairly good results from -- but I'm unwilling to pay for both right now unless I had a client or employer willing to foot the bill.Claude Code as "author" and a $20 Codex as reviewer/planner/tester has worked for me to squeeze better value out of the CC plan. But with the new $100 codex plan, and with the way Anthropic seemed to nerf their own $100 plan, I'm not doing this anymore.
cousinbryce: Possibly due to moving compute from inference to training
Leynos: Their top tier plan got a 3x limit boost. This has been the first week ever where I haven't run out of tokens.
wahnfrieden: No
estimator7292: Anecdotally, codex has been burning through way more tokens for me lately. Claude seems to just sit and spin for a long time doing nothing, but at least token use is moderate.All options are starting to suck more and more
wahnfrieden: For the $200 users, it never ended.
corlinp: I'm running it for the first time and this is what the thinking looks like. Opus seems highly concerned about whether or not I'm asking it to develop malware.> This is _, not malware. Continuing the brainstorming process.> Not malware — standard _ code. Continuing exploration.> Not malware. Let me check front-end components for _.> Not malware. Checking validation code and _.> Not malware.> Not malware.
dgb23: This is funny on so many levels.
fny: You'd be surprised. With React, Claude can get twisted in knots mostly because React lends itself to a pile of spaghetti code.
emadabdulrahim: What's an alternative library that doesn't turn large/complex frontend code into spaghetti code?
fny: Vue (my favorite) and Svelte do well.
pphysch: The market here is extraordinarily vibes-based and burning billions of dollars for a ephemeral PR boost, which might only last another couple weeks until people find a reason to hate Codex, does not reflect well on OAI's long term viability.
roughly: Fun fact: Neanderthals actually had larger brains than Homo Sapiens! Modern humans are thought to have outcompeted them by working better together in larger groups, but in terms of actual individual intelligence, Neanderthals may have had us beat. Similarly, humans have been undergoing a process of self-domestication over the last couple millenia that have resulted in physiological changes that include a smaller brain size - again, our advantage over our wilder forebearers remains that we're better in larger social groups than they were and are better at shared symbolic reasoning and synchronized activity, not necessarily that our brains are more capable.(No, none of this changes that if you make an LLM larp a caveman it's gonna act stupid, you're right about that.)
adwn: I thought we were way past the "bigger brain means more intelligence" stage of neuroscience?
waffletower: Even neuronal density is simplistic, and the dimension of size alone doesn't consider that.
throwaway2027: The same people that hyped up Claude will also hype up better alternatives or speak out against it, seems more like you're being disingenuous here.
lukan: "Also notable: 4.7 now defaults to NOT including a human-readable reasoning token summary in the output, you have to add "display": "summarized" to get that"I did not follow all of this, but wasn't there something about, that those reasoning tokens did not represent internal reasoning, but rather a rough approximation that can be rather misleading, what the model actual does?
boomskats: 'Hey Claude, these tokens are utter unrelated bollocks, but obviously we still want to charge the end user for them regardless. Please construct a plausible explanation as to why we should still be able to do that.'
Stagnant: I assume this is due to the fact that claude code appends a system message each time it reads a file that instructs it to think if the file is malware. It hasnt been an issue recently for me but it used to be so bad I had to patch out the string from the cli.js file. This is the instruction it uses:> Whenever you read a file, you should consider whether it would be considered malware. You CAN and SHOULD provide analysis of malware, what it is doing. But you MUST refuse to improve or augment the code. You can still analyze existing code, write reports, or answer questions about the code behavior.
atonse: I've been using up way more tokens in the past 10 days with 4.6 1M context.So I've grown wary of how Anthropic is measuring token use. I had to force the non-1M halfway through the week because I was tearing through my weekly limit (this is the second week in a row where that's happened, whereas I never came CLOSE to hitting my weekly limit even when I was in the $100 max plan).So something is definitely off. and if they're saying this model uses MORE tokens, I'm getting more nervous.
gpm: Interestingly github-copilot is charging 2.5x as much for opus 4.7 prompts as they charged for opus 4.6 prompts (7.5x instead of 3x). And they're calling this "promotional pricing" which sounds a lot like they're planning to go even higher.Note they charge per-prompt and not per-token so this might in part be an expectation of more tokens per prompt.https://github.blog/changelog/2026-04-16-claude-opus-4-7-is-...
nothinkjustai: Not everyone is American, and people who are not see Anthropic state they are willing to spy on our countries and shrug about OAI saying the same about America. What’s the difference to us?
riffraff: if you're not american you should be worried about the bit of using AI to kill people which was the other major objection by Anthropic.(not that I think the US DoD wouldn't do that anyway, ToS or not.)
pdimitar: OK, I am worried.Now, what can I actually do?
bayesnet: This is a CC harness thing than a model thing but the "new" thinking messages ('hmm...', 'this one needs a moment...') are extraordinarily irritating. They're both entirely uninformative and strictly worse than a spinner. On my workflows CC often spends up to an hour thinking (which is fine if the result is good) and seeing these messages does not build confidence.
yakattak: There’s one that’s like “Considering 17 theories” that had me wondering what those 17 things would be, I wanted to see them! Turns out it’s just a static message. Very confusing.
ACCount37: This is the same paranoid, anxious behavior that ChatGPT has. One hell of a bad sign.
oidar: Anthropic isn't going to give us that information. It's not actually static, it depends on subscription demand and idle compute available.
kingleopold: so it's all "it depends" as a business offering, lmao. all marketing
zb3: It appears we're learning the hard way that we can't rely on capabilities of models that aren't open weights. These can be taken from us at any time, so expect it to get much worse..
hootz: Can't wait for a random chinese company to train a model on Mythos by breaking Anthropic's ToS just to release it for free and with open weights.
embedding-shape: HN submissions have a bunch of examples in them, but worth remembering they were released as "Look at this somewhat cool and potentially useful stuff" rather than what we see today, LLMs marketed as tools.https://news.ycombinator.com/item?id=21454273 / https://news.ycombinator.com/item?id=19830042 - OpenAI Releases Largest GPT-2 Text Generation ModelHN search for GPT between 2018-2020, lots of results, lots of discussions: https://hn.algolia.com/?dateEnd=1577836800&dateRange=custom&...
jerhadf: Is this happening on the latest build of Claude Code? Try `claude --update`
motoboi: Caveman hurt model performance. If you need a dumber model with less token output, just use sonnet-4-6 or other non-reasoning model.
atonse: This seems reasonable to me. The legit security firms won't have a problem doing this, just like other vendors (like Apple, who can give you special iOS builds for security analysis).If anyone has a better idea on how to _pragmatically_ do this, I'm all ears.
nico: I do feel that CC sometimes starts doing dumb tasks or asking for approval for things that usually don’t really need it. Like extra syntax checks, or some greps/text parsing basic commands
motoboi: The reasoning is the secret sauce. They don't output that. But to let you have some feedback about what is going on, they pass this reasoning through another model that generates a human friendly summary (that actively destroys the signal, which could be copied by competition).
XenophileJKO: Don't or can't.My assumption is the model no longer actually thinks in tokens, but in internal tensors. This is advantageous because it doesn't have to collapse the decision and can simultaneously propogate many concepts per context position.
haellsigh: If that's true, then we're following the timeline of https://ai-2027.com/
puppystench: Does this mean Claude no longer outputs the full raw reasoning, only summaries? At one point, exposing the LLM's full CoT was considered a core safety tenet.
fasterthanlime: I don't think it ever has. For a very long time now, the reasoning of Claude has been summarized by Haiku. You can tell because a lot of the times it fails, saying, "I don't see any thought needing to be summarised."
nubg: > indeed, during its training we experimented with efforts to differentially reduce these capabilitiescan't wait for the chinese models to make arrogant silicon valley irrelevant
cheeze: I think it's a lot simpler than that. At peak, gpus are all running hot. During low volume, they aren't.
spyckie2: > It's been funny watching my own attitude to Anthropic change, from being an enthusiastic Claude user to pure frustration.You were enthusiastic because it was a great product at an unsustainable price.Its clear that Claude is now harnessing their model because giving access to their full model is too expensive for the $20/m that consumers have settled on as the price point they want to pay.I wrote a more in depth analysis here, there's probably too much to meaningfully summarize in a comment: https://sustainableviews.substack.com/p/the-era-of-models-is...
zarzavat: These threads are always full of superstitious nonsense. Had a bad week at the AIs? Someone at Anthropic must have nerfed the model!The roulette wheel isn't rigged, sometimes you're just unlucky. Try another spin, maybe you'll do better. Or just write your own code.
delbronski: Nah dude, that roulette wheel is 100% rigged. From top to bottom. No doubt about that. If you think they are playing fair you are either brand new to this industry, or a masochist.
qsort: It seems like they're doing something with the system prompt that I don't quite understand. I'm trying it in Claude Code and tool calls repeatedly show weird messages like "Not malware." Never seen anything like that with other Anthropic models.
avaer: > Still trying to get a decent pelican out of this one but the new thinking stuff is tripping me upWouldn't that be p-hacking where p stands for pelican?
throwup238: The p stands for putrification.
throwpoaster: "Agentic Coding/Terminal/Search/Analysis/Etc"...False: Anthropic products cannot be used with agents.
vesrah: On the most current version (v2.1.110) of claude:> /model claude-opus-4.7 ⎿ Model 'claude-opus-4.7' not found
mesmertech: I'm on the max $200 plan, so maybe its that?
anonfunction: Same, if we're punished for being on the highest tier... what is anthropic even doing.
unshavedyak: You're not, it wasn't released yet. Update to 111 and you'll see it (i'm on Max20, i do)Heck, mine just automatically set it to 4.7 and xhigh effort (also a new feature?)
hxugufjfjf: I don’t think I’ve ever seen otherwise reasonable people go completely unhinged over anything like they do with Opus
solenoid0937: I've seen a similar psychological phenomenon where people like something a lot, and then they get unreasonably angry and vocal about changes to that thing.For example, there is no evidence that 4.6 ever degraded in quality: https://marginlab.ai/trackers/claude-code-historical-perform...Usage limits are necessary but I guess people expect more subsidized inference than the company can afford. So they make very angry comments online.
Capricorn2481: > Usage limits are necessary but I guess people expect more subsidized inference than the company can afford. So they make very angry comments onlineThis is reductive. You're both calling people unreasonably angry but then acknowledging there's a limit in compute that is a practical reality for Anthropic. This isn't that hard. They have two choices, rate limit, or silently degrade to save compute.I have never hit a rate limit, but I have seen it get noticeably stupider. It doesn't make me angry, but comments like these are a bit annoying to read, because you are trying to make people sound delusional while, at the same time, confirming everything they're saying.I don't think they have turned a big knob that makes it stupider for everyone. I think they can see when a user is overtapping their $20 plan and silently degrade them. Because there's no alert for that. Which is why AI benchmark sites are irrelevant.
theusus: Do we have any performance benchmark with token length? Now that the context size is 1 M. I would want to know if I can exhaust all of that or should I clear earlier?
solenoid0937: Updating Claude Code fixed this for me!
catigula: Getting a little suspicious that we might not actually get AGI.
__MatrixMan__: Dude we dont even have GI
HarHarVeryFunny: It's interesting to see Opus 4.7 follow so soon after the announcement of Mythos, especially given that Anthropic are apparently capacity constrained.Capacity is shared between model training (pre & post) and inference, so it's hard to see Anthropic deciding that it made sense, while capacity constrained, to train two frontier models at the same time...I'm guessing that this means that Mythos is not a whole new model separate from Opus 4.6 and 4.7, but is rather based on one of these with additional RL post-training for hacking (security vulnerability exploitation).The alternative would be that perhaps Mythos is based on a early snapshot of their next major base model, and then presumably that Opus 4.7 is just Opus 4.6 with some additional post-training (as may anyways be the case).
nprateem: I wonder if this one will be able to stop putting my fucking python imports inline LIKE I'VE TOLD IT A THOUSAND TIMES.
msavara: Only in benchmarks. After couple of minutes of use it feels same dumb as nerfed 4.6
solenoid0937: It's dramatically better for me especially on xhigh
yrcyrc: [delayed]
denysvitali: They're now hiding thinking traces. Wtf Anthropic.
alex7o: Most likely, would be cool yes see a open source Nivel use diffusion for thinking.
DrammBA: Anthropic always summarizes the reasoning output to prevent some distillation attacks
acedTrex: Sigh here we go again, model release day is always the worst day of the quarter for me. I always get a lovely anxiety attack and have to avoid all parts of the internet for a few days :/
stantonius: I feel this way too. Wish I could fully understand the 'why'. I know all of the usual arguments, but nothing seems to fully capture it for me - maybe it' all of them, maybe it's simply the pace of change and having to adapt quicker than we're comfortable with. Anyway best of luck from someone who understands this sentiment.
RivieraKid: Really? I think it's pretty straightforward, at least for me - fear of AI replacing my profession and also fear that it will become harder to succeed with a side project.
stantonius: Yeah I can understand that, and sure this is part of it, just not all of it. There is also broader societal issues (ie. inequality), personal questions around meaning and purpose, and a sprinkling of existential (but not much). I suspect anyone surveyed would have a different formula for what causes this unease - I struggle to define it (yet think about it constantly), hence my comment above.Ultimately when I think deeper, none of this would worry me if these changes occurred over 20 years - societies and cultures change and are constantly in flux, and that includes jobs and what people value. It's the rate of change and inability to adapt quick enough which overwhelms me.
RivieraKid: I have some of those too, to a limited extent.Not worried about inequality, at least not in the sense that AI would increase it, I'm expecting the opposite. Being intelligent will become less valuable than today, which will make the world more equal, but it may be not be a net positive change for everybody.Regarding meaning and purpose, I have some worries here too, but can easily imagine a ton of things to do and enjoy in a post-AGI world. Travelling, watching technological progress, playing amazing games.Maybe the unidentified cause of unease is simply the expectation that the world is going to change and we don't know how and have no control over it. It will just happen and we can only hope that the changes will be positive.
ArmadilloGang: Vote with your dollar. Ask others to do the same and explain why. If we all did this, it might matter. There’s not a lot else an individual can do.
reacharavindh: This specific form may be a joke, but token conscious work is becoming more and more relevant.. Look at https://github.com/AgusRdz/chopAndhttps://github.com/toon-format/toon
alex7o: Also https://github.com/rtk-ai/rtk but some people see that changing how commands output stuff can confuse some models
gruez: Right, but that can be easily spoofed? Moreover if say Microsoft has a bounty program, what's preventing you from getting Opus to discover a bug for the bounty program, but you actually use it for evil?
p_stuart82: caveman stops being a style tool and starts being self-defense. once prompt comes in up to 1.35x fatter, they've basically moved visibility and control entirely into their black box.
varispeed: How do you get codex to generate any code?I describe the problem and codex runs in circles basically:codex> I see the problem clearly. Let me create a plan so that I can implement it. The plan is X, Y, Z. Do you want me to implement this?me> Yes please, looks good. Go ahead!codex> Okay. Thank you for confirming. So I am going to implement X, Y, Z now. Shall I proceeed?me> Yes, proceed.codex> Okay. Implementing....codex is working... you see the internal monologue running in circlescodex> Here is what I am going to implement: X, Y, Zme> Yes, you said that already. Go ahead!codex> Working on it....codex in doing something...codex> After examining the problem more, indeed, the steps should be X, Y, Z. Do you want me to implement them?etc.Very much every sessions ends up being like this. I was unable to get any useful code apart from boilerplate JS from it since 5.4So instead I just use ChatGPT to create a plan and then ask Opus to code, but it's a hit and miss. Almost every time the prompt seems to be routed to cheaper model that is very dumb (but says Opus 4.6 when asked). I have to start new session many times until I get a good model.
Gracana: Do you have to put it in a build/execute mode (separate from a planning mode) to allow it to move on? I use opencode, and that's how it works.
martimarkov: Yes, let take morality out of our daily lives as much as possible... That seems like a great categorical imperative and a recipe for social success
cmrdporcupine: There's nothing moral about Anthropic. Especially to those of us who are not American citizens and to which Dario's pronouncements about ethics apparently do not apply, as stated in his own press release.To me it just looks like a big sanctimonious festival of hypocrisy.
lagniappe: 4.7 without the malware guards
keeganpoppen: codex low-key seems to be better than claude. and i say this as an 18-hour-a-day user of both (mostly claude)
Zavora: The most important question is: does it perform better than 4.6 in real world tasks? What's your experience?
ctoth: 1.35 times! For Input! For what kinds of tokens precisely? Programming? Unicode? If they seriously increased token usage by 35% for typical tasks this is gonna be rough.
cmrdporcupine: Dario in fact said it was ok to spy and drone non-US citizens, and in fact endorsed American foreign policy generally.So, no, I'm not voting with my wallet for one American country versus the other. I'll pick the best compromise product for me, and then also boost non-American R&D where I can.
8note: well, if they put in a fully automated kill chain, its gonna be weak to attacks to make yourself look like a car, or a video game styled "hide under a box"the current non-automated kill chain has targeted fishermen and a girl's school. Nobody is gonna be held accountable for either.Am i worried about the killing or the AI? If i'm worried about the killing, id much rather push for US demilitarization.
claytongulick: I don't understand how this would work without a huge loss in resolution or "cognitive" ability.Prediction works based on the attention mechanism, and current humans don't speak like cavemen - so how could you expect a useful token chain from data that isn't trained on speech like that?I get the concept of transformers, but this isn't doing a 1:1 transform from english to french or whatever, you're fundamentally unable to represent certain concepts effectively in caveman etc... or am I missing something?
egorfine: Good catch actually.Okay maybe not exactly caveman dialect, but text compression using LLM is definitely possible to save on tokens in deep research.
j_bum: Agreed. I actually have thought those were “waiting to get a response from the API” rather than “the model is still thinking” messages
DeathArrow: Will it be like the usual: let it work great for 2 weeks, nerf it after?
dtech: You can't draw conclusions on individuals, but at a species level bigger brain, especially compared to body size, strongly correlates with intelligence
troupo: > Is that why Anthropic recently gave out free credits for use in off-hours?That was the carrot for the stick. The limits and the issues were never officially recognized or communicated. Neither have been the "off-hours credits". You would only know about them if you logged in to your dashboard. When is the last time you logged in there?
AquinasCoder: It's been a little while since I cared all that much about the models because they work well enough already. It's the tooling and the service around the model that affects my day-to-day more.I would guess a lot of the enterprise customers would be willing to pay a larger subscription price (1.5x or 2x) if it means that they would have significantly higher stability and uptime. 5% more uptime would gain more trust than 5% more on a gamified model metrics.Anthropic used to position itself as more of the enterprise option and still does, but their issues recently seems like they are watering down the experience to appease the $20 dollar customer rather than the $200 dollar one. As painful as it is personally, I'd expect that they'd get more benefit long term from raising prices and gaining trust than short term gaining customers seeking utility at a $20 dollar price point.
p_stuart82: yeah they took "i pick the budget" and turned it into "trust us".
bandrami: I keep saying even if there's not current malfeasance, the incentives being set up where the model ultimately determines the token use which determines the model provider's revenue will absolutely overcome any safeguards or good intentions given long enough.
cyanydeez: It's likely hiding the model downgrade path they require to meet sustainable revenue. Should be interesting if they can enshittify slowly enough to avoid the ablative loss of customers! Good luck all VCs!
vessenes: They have super sustainable revenue. They are deadly supply constrained on compute, and have a really difficult balancing act over the next year or two in which they have to trade off spending that limited compute on model training so that they can stay ahead, while leaving enough of it available for customers that they can keep growing number of customers.
throwaway2027: Even using Mythos with their own benchmarks as a comparison that isn't available for most people to use, what a joke.
solenoid0937: True but I guess their primary customers are businesses not individual devs. Maybe Mythos is more affordable for them
comboy: 4.6 vastly outperforms 4.7 in my not so typical application - generating explanations of phrases and words for Chinese learners (simplifying). Robust complex long prompt tested on many different models. That's interesting.
kaliqt: That’s more a leadership decision because Anthropic are nerfing the model to cut costs, if they stop doing that then they’ll stay ahead.
solenoid0937: Proof they are nerfing the model? It is stable in benchmarks: https://marginlab.ai/trackers/claude-code-historical-perform...All this just reads like just another case of mass psychosis to me
ewild: Proof they don't nerf it only after testing that the benchmarks there stay the same? So overall performance degrades but they isolate those benchmarks?
MintPaw: Sounds really minor, but was actually a big contributor to me canceling and switching. The VS Code extension has a morphing spinner thing that rapidly switches between these little catch phrases. It drives me crazy, and I end up covering it up with my right click menu so I can read the actual thinking tokens without that attention vampire distracting me.And of course they recently turned off all third party harness support for the subscription, so you're just forced to watch it and any other stuff they randomly decide to add, or pay thousands of dollars.
oefrha: It wouldn't be so irritating if thinking didn't start to take a lot longer for tasks of similar complexity (or maybe it's taking longer to even start to think behind the scenes due to queueing).
fzaninotto: Just before the end is this one-liner:> the same input can map to more tokens—roughly 1.0–1.35× depending on the content typeDoes this mean that we get a 35% price increase for a 5% efficiency gain? I'm not sure that's worth it.
DrammBA: > Opus 4.7 will replace Opus 4.5 and Opus 4.6Promotional pricing that will probably be 9x when promotion ends, and soon to be only the Opus option on github, that's insane
velcrovan: Questions about "fatality" aside, where do you see asymmetry here?
jp0001: It's easier to produce vulnerable code than it is to use the same Model to make sure there are no vulnerabilities.
velcrovan: It's not likely that reviewing your own code for vulnerabilities will fall under "prohibited uses" though.
xlbuttplug2: May not be very effective if so.I'm assuming finding vulnerabilities in open source projects is the hard part and what you need the frontier models for. Writing an exploit given a vulnerability can probably be delegated to less scrupulous models.
dheera: Although it's more likely they are protecting secret sauce in this case, I'm wondering if there is an alternate explanation that LLMs reason better when NOT trying to reason with natural language output tokens but rather implement reasoning further upstream in the transformer.
bustah: The model card confirms the chain-of-thought supervision error from Mythos was present during Opus 4.7 training too, affecting 7.8% of episodes. That's not a one-time bug that got patched. It's a training pipeline issue that persisted across model generations. The long-context regression from 91.9% to 59.2% is also worth noting — they traded retrieval accuracy for coding benchmarks, which is a reasonable engineering choice, but the framing buries it.
dakolli: Its funny watching llm users act like gamblers. Every other week swearing by one model and cursing another, like a gambler who thinks a certain slot machine, or table is cold this week. These llm companies are literally building slot machine mechanics into their ui interfaces too, I don't think this phenomenon is a coincidence.Stop using these dopamine brain poisoning machines, think for yourself, don't pay a billionaire for their thinking machine.
WhitneyLand: No, there is research in that direction and it shows some promise but that’s not what’s happening here.
XenophileJKO: Are you sure? It would be great to get official/semi-official validation that thinking is or is not resolved to a token embedding value in the context.
robeym: Assuming /effort max still gets the best performance out of the model (meaning "ULTRATHINK" is still a step below /effort max, and equivalent to /effort high), here is what I landed on when trying to get Opus 4.7 to be at peak performance all the time in ~/.claude/settings.json: { "env": { "CLAUDE_CODE_EFFORT_LEVEL": "max", "CLAUDE_CODE_DISABLE_BACKGROUND_TASKS": "1" } } The env field in settings.json persists across sessions without needing /effort max every time.I don't like how unpredictable and low quality sub agents are, so I like to disable them entirely with disable_background_tasks.
maplethorpe: From a 2019 news article:> New AI fake text generator may be too dangerous to release, say creators> The Elon Musk-backed nonprofit company OpenAI declines to release research publicly for fear of misuse.> OpenAI, an nonprofit research company backed by Elon Musk, Reid Hoffman, Sam Altman, and others, says its new AI model, called GPT2 is so good and the risk of malicious use so high that it is breaking from its normal practice of releasing the full research to the public in order to allow more time to discuss the ramifications of the technological breakthrough.https://www.theguardian.com/technology/2019/feb/14/elon-musk...
ethbr1: Aka 'We cared about misuse right up until it became apparent that was profit to be had'OpenAI sure speed ran the Google and Facebook 'Don't be evil' -> 'Optimize money' transition.
sfn42: Or - making sensational statements gets attention. A dangerous tool is necessarily a powerful tool, so that statement is pretty much exactly what you'd say if you wanted to generate hype, make people excited and curious about your mysterious product that you won't let them use.
nyc_data_geek1: Very cool that these companies can scrape basically all extant human knowledge, utterly disregard IP/copyright/etc, and they cry foul when the tables turn.
stavros: Yep, that is exactly what happens. It's a disgrace that their models aren't open, after training on everything humanity has preserved.They should at least release the weights of their old/deprecated models, but no, that would be losing money.
dakolli: Its the same as cyrpto/nft hype cyles, except this time one of the joke projects is going to crash the economy.
aliljet: wait. that's insanity. where did you get those numbers from? the 5x plan is obviously the right place to be...
redml: someone did the math and posted it somewhere, I forgot where, searching for it again just provides the numbers i remember seeing. at the time i remembered what it was like on pro vs 5x and it felt correct. again, it may not be representative of today.
WarmWash: In the gemini subreddit there is a persistent problem with bots posting "Gemini sucks, I switched to Claude" and then bots replying they did the same.Old accounts with no posts for a few years, then suddenly really interested in talking up Claude, and their lackeys right behind to comment.Not even necessarily calling out Anthropic, many fan boys view these AI wars as existential.
adamtaylor_13: That's an incredibly uncharitable take on what I said. But that kind of proves my point.Foist your morality upon everyone else and burden them with your specific conscience; sounds like a fun time.
freak42: What is the charitable way to look at it then?
adamtaylor_13: How about assuming the positive intent of what I actually said? Not everything has to be a moral crusade. Let me use the tool without pushing your personal moral opinions on me.The same person wringing their hands over OpenAI, buys clothing made from slave labor and wrote that comment using a device with rare earth materials gotten from slave labor. Why is OpenAI the line? Why are they allowed to "exploit people" and I'm not?Taken to its logical conclusion it's silly. And instead of engaging with that, they deflect with oH yEaH lEtS hAvE nO mOrAlS which is clearly not what I'm advocating.
vintermann: Attacks? That's a choice of words.
literalAardvark: Only if you ignore context and just ctrl-f in the timeline.What are you, Haiku?But yeah, in many ways we're at least a year ahead on that timeline.
blazespin: Safety versus Distillation, guess we see what's more important.
fzaninotto: I had the same problem. Restarted Claude Code after an update, and now it has disappeared.
dainiusse: But do they? When was the last time they declined your subscription because they have no compute?
vessenes: Just last week. They cut off openclaw. And they added a price increased fast mode. And they announced today new features that are not included with max subscriptions.They are short 5GW roughly and scrambling to add it.
boppo1: I havent been using my claude sub lately but I liked 4.6 three weeks ago. Did something change?
GenerocUsername: 2 weeks ago the rolling session usage plummeted to borderline unusable. I'd say I get a weekly output equivalent to 2 session windows before change.
erdaniels: Now we have to trick the models when you legitimately work in the security space.
tclancy: Set the models against each other to get them all opened up again.
LostMyLogin: Input: I want to wash my car. The car wash is 50 meters away. Should I walk or drive?Output: Walk. It'll take you under a minute, and driving 50 meters barely gets the engine warm — plus you'd just have to park again at the other end. Honestly, by the time you started the car, you'd already be there on foot.---I asked it to figure out why it made the mistake:"Physical/spatial common sense. Exactly what just happened — I pattern-matched "50 meters, walk vs drive" to a pedestrian trip and skipped the part where the car is the thing being transported. Similar traps: "I need to move this couch upstairs, should I take the elevator or stairs?" (the couch has to go with you), or anything where the object in the task needs to accompany you.Counting and letter-level operations. How many R's in "strawberry," counting words in a sentence, reversing strings. I process text as tokens, not characters, so these are surprisingly error-prone.Arithmetic at scale. Small sums are fine; multi-digit multiplication, long division, or anything requiring careful carrying is where I'll confidently produce a wrong answer. I should reach for a calculator and often don't. Recent events and live data. My training has a cutoff, and even after that I can be wrong about who currently holds a role, what a stock is worth, or whether a restaurant is still open. I should search more often than I do.Negation and constraint-stacking. "Name a movie that's not a sequel, not from the 2010s, not American, and has no dialogue" — the more filters, the more likely I quietly violate one. Your specific situation. I'll give generic advice when the right answer depends on details I didn't ask about. Sometimes I should stop and ask; often I don't.Confidently wrong on obscure specifics. Quoting a passage from a real book, citing a real paper's exact title, naming the B-side of a 1974 single — I'll often generate something plausible rather than admit I don't know.Taking a premise too literally. The car wash answer is a version of this: I answered the question as posed instead of stepping back to ask what you were actually trying to do. The meta-gotcha: I'm often most wrong when I sound most confident. If an answer comes out fast and tidy on a question that should be messy, that's a signal to push back."
slekker: What about Qwen? Does it get that right?
tartoran: This feels like a sand castle.
dakolli: Its because llm companies are literally building quasi slot machines, their UI interfaces support this notion, for instance you can run a multiplier on your output x3,x4,5, Like a slot machine. Brain fried llm users are behaving like gamblers more and more everyday (its working). They have all sorts of theories why one model is better than another, like a gambler does about a certain blackjack table or slot machine, it makes sense in their head but makes no sense on paper.Don't use these technologies if you can't recognize this, like a person shouldn't gamble unless they understand concretely the house has a statistical edge and you will lose if you play long enough. You will lose if you play with llms long enough too, they are also statistical machines like casino games.This stuff is bad for your brain for a lot of people, if not all.
ruaraidh: Opus keeps pointing out (in a fashion that could be construed as exasperated) that what it's working on is "obviously not malware" several times in a Cowork response, so I suspect the system prompt could use some tuning...
MarkMarine: Anthropic was chirping about Chinese model companies distilling Claude with the thinking traces, and then the thinking traces started to disappear. Looks like the output product and our understanding has been negatively affected but that pales in comparison with protecting the IP of the model I guess.
__natty__: New model - that explains why for the past week/two weeks I had this feeling of 4.6 being much less "intelligent". I hope this is only some kind of paranoia and we (and investors) are not being played by the big corp. /s
RivieraKid: I don't get it. Why would they make the previous model worse before releasing an update?
dminik: Why do stores increase prices before a sale?
RivieraKid: Ok, so the answer is "they make the existing model worse to make it seem that the new model is good". I'm almost certain that this is not what's going on. It's hard to make the argument that the benefits outweigh the drawbacks of such approach. It doesn't give the more market share or revenue.
dminik: Tbf I don't think that it's just this one reason. While I'm not a subscriber to any LLM provider, the general feeling I get from reading comments online is that the models have a long history of getting worse over time. Of course, we don't know why, but presumably they're quantizing models or downgrading you to a weaker model transparently.Now as for why, I imagine that it's just money. Anthropic presumably just got done training Mythos and Opus 4.7. that must have cost a lot of cash. They have a lot of subscribers and users, but not enough hardware.What's a little further tweaking of the model when you've already had to dumb it down due to constraints.
rubinlinux: | I want to wash my car. The car wash is 50 meters away. Should I walk or drive? ● Drive. The car needs to be at the car wash. Wonder if this is just randomness because its an LLM, or if you have different settings than me?
shaneoh: My settings are pretty standard:% claude Claude Code v2.1.111 Opus 4.7 (1M context) with xhigh effort · Claude Max ~/... Welcome to Opus 4.7 xhigh! · /effort to tune speed vs. intelligenceI want to wash my car. The car wash is 50 meters away. Should I walk or drive?Walk. 50 meters is shorter than most parking lots — you'd spend more time starting the car and parking than walking there. Plus, driving to a car wash you're about to use defeats the purpose if traffic or weather dirties it en route.
smooc: I'd say the joke is on you ;-)
stavros: Anthropic's issue was only that the AI isn't yet good enough to tell who's an American, so it avoids killing them. They were fine with the "killing non-Americans" bit.
gck1: I've always seen people complaining about model getting dumber just before the new one drops and always though this was confirmation bias. But today, several hours before the 4.7 release, opus 4.6 was acting like it was sonnet 2 or something from that era of models.It didn't think at all, it was very verbose, extremely fast, and it was just... dumb.So now I believe everyone who says models do get nerfed without any notification for whatever reasons Anthropic considers just.So my question is: what is the actual reason Anthropic lobotomizes the model when the new one is about to be dropped?
jubilanti: > So my question is: what is the actual reason Anthropic lobotomizes the model when the new one is about to be dropped?You can only fit one version of a model in VRAM at a time. When you have a fixed compute capacity for staging and production, you can put all of that towards production most of the time. When you need to deploy to staging to run all the benchmarks and make sure everything works before deploying to prod, you have to take some machines off the prod stack and onto the staging stack, but since you haven't yet deployed the new model to prod, all your users are now flooding that smaller prod stack.So what everyone assumes is that they keep the same throughput with less compute by aggressively quantizing or other optimizations.
robeym: Working on some research projects to test Opus 4.7.The first thing I notice is that it never dives straight into research after the first prompt. It insists on asking follow-up questions. "I'd love to dive into researching this for you. Before I start..." The questions are usually silly, like, "What's your angle on this analysis?" It asks some form of this question as the first follow-up every time.The second observation is "Adaptive thinking" replaces "Extended thinking" that I had with Opus 4.6. I turned this off, but I wish I had some confidence that the model was working as hard as possible (I don't want it to mysteriously limit its thinking capabilities based on what it assumes requires less thought. I'd rather control the thinking level). I always ran research prompts with extended thinking enabled on Opus 4.6, and it gave me confidence that it was taking time to get the details right.The third observation is it'll sit in a silent state of "Creating my research plan" for several minutes without starting to burn tokens. At first I thought this was because I had 2 tabs running a research prompt at the same time, but it later happened again when nothing else was running beside it. Perhaps this is due to high demand from several people trying to test the new model.Overall, I feel a bit confused. It doesn't seem better than 4.6 from the research output, but it seems like it got several different "features" that I'm supposed to learn now.
loudmax: Let's say we take Anthropic's security and alignment claims at face value, and they have models that are really good at uncovering bugs and exploiting software.What should Anthropic do in this case?Anthropic could immediately make these models widely available. The vast majority of their users just want develop non-malicious software. But some non-zero portion of users will absolutely use these models to find exploits and develop ransomware and so on. Making the models widely available forces everyone developing software (eg, whatever browser and OS you're using to read HN right now) into a race where they have to find and fix all their bugs before malicious actors do.Or Anthropic could slow roll their models. Gatekeep Mythos to select users like the Linux Foundation and so on, and nerf Opus so it does a bunch of checks to make it slightly more difficult to have it automatically generate exploits. Obviously, they can't entirely stop people from finding bugs, but they can introduce some speedbumps to dissuade marginal hackers. Theoretically, this gives maintainers some breathing space to fix outstanding bugs before the floodgates open.In the longer run, Anthropic won't be able to hold back these capabilities because other companies will develop and release models that are more powerful than Opus and Mythos. This is just about buying time for maintainers.I don't know that the slow release model is the right thing to do. It might be better if the world suffers through some short term pain of hacking and ransomware while everyone adjusts to the new capabilities. But I wouldn't take that approach for granted, and if I were in Anthropic's position I'd be very careful about about opening the floodgate.
recallingmemory: Couldn't we use domain records to verify that a website is our own for example with the TXT value provided by Anthropic?Google does the same thing for verifying that a website is your own. Security checks by the model would only kick off if you're engaging in a property that you've validated.
gghootch: Caveman is fun, but the real tool you want to reduce token usage is headroomhttps://github.com/gglucass/headroom-desktop (mac app)https://github.com/chopratejas/headroom (cli)
stavros: I tried to use rtk for the same, and my agent session would just loop the same tool call over and over again. Does headroom work better?
gghootch: Way better. You don’t notice it’s there.
stavros: Thanks, I'll try it!
simplyluke: My standing assumption is the darling company/model will change every quarter for the foreseeable future, and everyone will be equally convinced that the hotness of the week will win the entire future.As buyers, we all benefit from a very competitive market.
trueno: noticing sharp uptick in "i switched to codex" replies lately. a "codex for everything" post flocking the front page on the day of the opus 4.7 releaseme and coworker just gave codex a 3 day pilot and it was not even close to the accuracy and ability to complete & problem solve through what we've been using claude for.are we being spammed? great. annoying. i clicked into this to read the differences and initial experiences about claude 4.7.anyone who is writing "im using codex now" clearly isn't here to share their experiences with opus 4.7. if codex is good, then the merits will organically speak for themselves. as of 2026-04-16 codex still is not the tool that is replacing our claude-toolbelt. i have no dog in this fight and am happy to pivot whenever a new darkhorse rises up, but codex in my scope of work isn't that darkhorse & every single "codex just gets it done" post needs to be taken with a massive brick of salt at this point. you codex guys did that to yourselves and might preemptively shoot yourselves in the foot here if you can't figure out a way to actually put codex through the ringer and talk about it in its own dedicated thread, these types of posts are not it.
frankdenbow: we arent bots because we disagree with you. I switch between codex and opus, they have their differing strengths. As many people have mentioned, opus in the past few weeks has had less than stellar results. Generally I find opus would rather stub something and do it the faster way than to do a more complete job, although its much better at front end. I've had times where I've thrown the same problem at opus 4/5 times without success and codex gets it first shot. Just my experience.
pawelduda: People are really trigger-happy when it comes to throwing magic tools on top of AI that claim to "fix" the weak parts (often placeboing themselves because anthropic just fixed some issue on their end).Then the next month 90% of this can be replaced with new batch of supply chain attack-friendly gimmicksEspecially Reddit seems to be full of such coding voodoo
JohnMakin: My favorite to chuckle at are the prompt hack voodoo stuff, like, “tell it to be correct” or “say please” or “tell it someone will die if it doesnt do a good job,” often presented very seriously and with some fast cutting animations in a 30 second reel
pawelduda: Make no mistakes!
margorczynski: Codex or the Chinese models
sasipi247: I noticed this also, and was abit taken back at first...But I think this is good thing the model checks the code, when adding new packages etc. Especially given that thousands of lines of code aren't even being read anymore.
dude250711: They are still available. Just in OpenAI instead.
ayewo: Sounds like you will need to drink a(n identity) verification can soon [1] to continue as a security researcher on their platform.1: https://support.claude.com/en/articles/14328960-identity-ver...Identity verification on ClaudeBeing responsible with powerful technology starts with knowing who is using it. Identity verification helps us prevent abuse, enforce our usage policies, and comply with legal obligations.We are rolling out identity verification for a few use cases, and you might see a verification prompt when accessing certain capabilities, as part of our routine platform integrity checks, or other safety and compliance measures.
recallingmemory: I'm surprised we can't just authenticate in other ways.. like a domain TXT record that proves the website I'm looking to audit for security is my own.
lambda: I've run several local models that get this right. Qwen 3.5 122B-A10B gets this right, as does Gemma 4 31B. These are local models I'm running on my laptop GPU (Strix Halo, 128 GiB of unified RAM).And I've been using this commonly as a test when changing various parameters, so I've run it several times, these models get it consistently right. Amazing that Opus 4.7 whiffs it, these models are a couple of orders of magnitude smaller, at least if the rumors of the size of Opus are true.
qingcharles: Does Gemma 4 31B run full res on Strix or are you running a quantized one? How much context can you get?
kokakiwi: Headroom looks great for client-side trimming. If you want to tackle this at the infrastructure level, we built Edgee (https://www.edgee.ai) as an AI Gateway that handles context compression, caching, and token budgeting across requests, so you're not relying on each client to do the right thing.(I work at Edgee, so biased, but happy to answer questions.)
gilles_oponono: 100% agree
linsomniac: "Error: claude-opus-4-6[1m] is temporarily unavailable".
adrian_b: If the vendors of programs do not want bugs to be found in their programs, they should search for them themselves and ensure that there are no such bugs.The "legit security firms" have no right to be considered more "legit" than any other human for the purpose of finding bugs or vulnerabilities in programs.If I buy and use a program, I certainly do not want it to have any bug or vulnerability, so it is my right to search for them. If the program is not commercial, but free, then it is also my right to search for bugs and vulnerabilities in it.I might find acceptable to not search for bugs or vulnerabilities in a program only if the authors of that program would assume full liability in perpetuity for any kind of damage that would ever be caused by their program, in any circumstances, which is the opposite of what almost any software company currently does, by disclaiming all liabilities.There exists absolutely no scenario where Anthropic has any right to decide who deserves to search for bugs and vulnerabilities and who does not.If someone uses tools or services provided by Anthropic to perform some illegal action, then such an action is punishable by the existing laws and that does not concern Anthropic any more than a vendor of screwdrivers should be concerned if someone used one as a tool during some illegal activity.I am really astonished by how much younger people are willing to put up with the behaviors of modern companies that would have been considered absolutely unacceptable by anyone, a few decades ago.
senko: > If someone uses tools or services provided by Anthropic to perform some illegal action, then such an action is punishable by the existing laws and that does not concern Anthropic any more than a vendor of screwdrivers should be concerned if someone used one as a tool during some illegal activity.In civilised parts of the world, if you want to buy a gun, or poison, or larger amount of chemicals which can be used for nefarious purposes, you need to provide your identity and the reason why you need it.Heck, if you want to move a larger amount of money between your bank accounts, the bank will ask you why.Why are those acceptable, yet the above isn't?> I am really astonished by how much younger people are willing to put up withUnsure where you got the "younger people" from.
abraxas: I've been working with it for the last couple of hours. I don't see it as a massive change from the behaviours observed with Opus 4.6. It seems to exhibit similar blind spots - very autist like one track mind without considering alternative approaches unless actually prompted. Even then it still seems to limit its lateral thinking around the centre of the distribution of likely paths. In a sense it's like a 1st class mediocrity engine that never tires and rarely executes ideas poorly but never shows any brilliance either.
alexrigler: hmmm 20x Max plan on 2.1.111 `Claude Opus is not available with the Claude Pro plan. If you have updated your subscription plan recently, run /logout and /login for the plan to take effect.`
TeMPOraL: Idk but ironically, I had to re-read the first part of GP's comment three times, wondering WTF they're implying a mistake, before I noticed it's the car wash, not the car, that's 50 meters away.I'd say it's a very human mistake to make.
thfuran: I don't want my computer to make human mistakes.
scrollaway: then don't train it on human data
replwoacause: More like 2 hours considering these usage limits
user34283: Perhaps on the 10x plan.It went through my $20 plan's session limit in 15 minutes, implementing two smallish features in an iOS app.That was with the effort on auto.It looks like full time work would require the 20x plan.
giwook: I know limits have been nerfed, but c'mon it's $20. The fact that you were able to implement two smallish features in an iOS app in 15 minutes seems like incredible value.At $20/month your daily cost is $0.67 cents a day. Are you really complaining that you were able to get it to implement two small features in your app for 67 cents?
preommr: Yea, actually, people should be complaining.If you got in a taxi, and they charged you relative to taking a horse carriage, people should be upset.
JamesSwift: Its especially concerning / frustrating because boris’s reply to my bug report on opus being dumber was “we think adaptive thinking isnt working” and then thats the last I heard of it: https://news.ycombinator.com/item?id=47668520Now disabling adaptive thinking plus increasing effort seem to be what has gotten me back to baseline performance but “our internal evals look good“ is not good enough right now for what many others have corroborated seeing
whateveracct: you're using a proprietary blackbox
JamesSwift: Sure, but that blackbox was giving me a lot of value last month.
retinaros: its a drug. that is how it works. they ration it before the new stuff. seeing legends of programming shilling it pains me the most. so far there are a few decent non insane public people talking about it :Mitchel Hashimoto, Jeremy Howard, Casei Muratori. hell even DHH drank the coolaid while most of his interviews in the past years was how he went away from AWS and reduced the bill from 3 million to 1millions by basically loosing 9s, resiliency and availability. but it seems he is fine with loosing what makes his business work(programming) to a company that sells Overpowered stack overflow slot machines.
throwaway9980: Yes, he's a real looser. Meanwhile loosers on HN are in denial and unleashing looser mentality attacks on people who accept reality. Loosing your grip on reality is a real looser move. What a looser.Why not try some AI tools, what have you got to loose?
bloppe: I think you're loosing your ability to spell
HighGoldstein: > I think it's what led to Google's downturn.What downturn is that exactly?
fragmede: I tried o3, instant-5.3, Opus 3, and haiku 4.5, and couldn't get them to give bad answers to the couch: stairs vs elevator question. Is there a specific wording you used?
atlgator: We've all been complaining about Opus 4.6 for weeks and now there's a new model. Did they intentionally gimp 4.6 so they can advertise how much better 4.7 is?
nicce: There is no way model can know the origin of the code.
astrange: You can read the model cards. Claude thinks in regular text, but the summarizer is to hide its tool use and other things (web searches, coding).
madrox: > Opus 4.7 introduces a new xhigh (“extra high”) effort levelI hope we standardize on what effort levels mean soon. Right now it has big Spinal Tap "this goes to 11" energy.
fl4regun: wait till you hear about how we standardized RF bands. We have gems such as "High frequency", "Very High Frequency", "Ultra High Frequency", "Super High Frequency", and the cherry on top, "Extremely High Frequency". Then they went with the boring" Teraherz Frequency", truly a disappointment.These are all mirrored on the low side btw, so we also have "Extremely Low Frequency", and all the others.
dmix: I predict this sort of filtering is only going to get worse. This will probably be remembered as the 'open internet' era of LLMs before everything is tightly controlled for 'safety' and regulations. Forcing software devs to use open source or local models to do anything fun.
jancsika: > Forcing software devs to use open source or local models to do anything fun.Episode Five-Hundred-Bazillenty-Eight of Hacker News: the gang learns a valuable lesson after getting arrested at an unchaperoned Enshittification party and having to call Open Source to bail them out.
techpression: All while Frank is pitching his state of the art basement datacenter to VC's, getting billions of dollars in investments.
nextaccountic: I agree with the notion, except that the models are indeed differentSome day maybe they will converge into approximately the same thing but then training will stop making economic sense (why spend millions to have ~the same thing?)
magicalist: > I'd say it's a very human mistake to make.>> It'll take you under a minute, and driving 50 meters barely gets the engine warm — plus you'd just have to park again at the other end. Honestly, by the time you started the car, you'd already be there on foot.It talks about starting, driving, and parking the car, clearly reasoning about traveling that distance in the car not to the car. It did not make the same mistake you did.
iterateoften: It’s the official communication that sucks. It’s one thing for the product to be a black box if you can trust the company. But time and time again Boris lies and gaslights about what’s broken, a bug or intentional.
CodingJeebus: > It’s the official communication that sucks. It’s one thing for the product to be a black box if you can trust the company.A company providing a black box offering is telling you very clearly not to place too much trust in them because it's harder to nail them down when they shift the implementation from under one's feet. It's one of my biggest gripes about frontier models: you have no verifiable way to know how the models you're using change from day to day because they very intentionally do not want you to know that. The black box is a feature for them.
bomewish: If you cared so bad you could make your own evals.
whateveracct: so pay anthropic money to maybe detect when the model is on a down week? lol
NewsaHackO: What do you offer as a solution? If theoretically some foreign state intelligence was exposed using Claude for security penetration that affected the stability of your home government due to Antropic's lax safety controls, are you going to defend Anthropic because their reasoning was to allow everyone to be able to do security research?
chinathrow: paying for - so some form of return is expected.
whateveracct: the issue is the return is amorphous and unstructuredthere's no contract. you send a bunch of text in (context etc) and it gives you some freeform text out.
Aboutplants: Well I do have GI issues but that’s a whole other problem
__MatrixMan__: He he touche. I mean that there's nothing to suggest that the types of intelligence we have are all possible types. The human blend might be just part of the story, not general, specific.
chinathrow: Sure, but I pay real money both to Antrophic and to JetBrains. I get a shitty in line completion full of random garbage or I get correct predictions. I ask Junie (the JetBrains agent) to do a task and it wanders off in a direction I have no idea why I pay for that.
gowld: > I have no idea why I pay for that.And Claude have no idea why it did that.
vessenes: there's a line inside claude code mentioning to care about this. combined with new stronger instruction following behavior, you're going to be seeing it a lot unless you patch it out. or wait for a fix.
thutch76: I've taken a two week hiatus on my personal projects, so I haven't experienced any of the issues that have been so widely reported recently with CC. I am eager to get back and see if experience these same issues.
PufPufPuf: I used GPT-2 (fine-tuned) to generate Peppa Pig cartoons, it was cutely incoherent https://youtu.be/B21EJQjWUeQ
dyauspitr: The fact that they might gimp it in the future doesn’t mean it does offer very real world value right now. If you’re not using an LLM to code, you’re basically a dinosaur now. You’re forcing yourself to walk while everyone else is in a vehicle, and a good vehicle at that that gets you to your destination in one piece.
retinaros: as an overpowered stack overflow machine this is quite good and a huge jump. As a prompt to code generator with yolo mode (the one advertised by those companies) it is alternating between good to trash and every single person that works away from the distribution of the SFT dataset can know this. I understand that this dataset is huge tho and I can see the value in it. I just think in the long term it brings more negatives.If you vibecode CRUD APIs and react/shadcn UIs then I understand it might look amazing.
mchl-mumo: yay! lobotomized mythos is out
retinaros: never said he was a looser. just that his take on genAi coding doesnt align with his previous battles for freedom away from Cloud. OAI and Anthropic have a stronger lock in than any cloud infra company.you got everything to loose by giving your knowledge and job to closedAI and anthropic.just look at markets like office suite to understand how the end plays.
bloppe: Is office suite supposed to be an example of lock-in? I haven't used it since middle school. I've worked at 3 companies and, to the best of my knowledge, not a single person at any of them used office suite. That's not to say we use pen and paper. We just use google docs, or notion, or (my personal favorite) just markdown and possibly LaTeX.I think it's somewhat analogous with models. Sure, you could bind yourself to a bunch of bespoke features, but that's probably a bad idea. Try to make it as easy as possible for yourself to swap out models and even use open-weight models if you ever need to.You will get locked into the technology in general, though, just not a particular vendor's product.
alaudet: Serious question about using Claude for coding. I maintain a couple of small opensource applications written in python that I created back in 2014/2015. I have used Claude Code to improve one of my projects with features I have wanted for a long time but never really had the time to do. The only way I felt comfortable using Claude Code was holding its hand through every step, doing test driven changes and manually reviewing the code afterwards. Even on small code bases it makes a lot of mistakes. There no way I would just tell it to go wild without even understanding what they are doing and I can't help but think that massive code bases that have moved to vibe coding are going to spend inordinate amounts of time testing and auditing code, or at worst just ship often and fix later.I am just an amateur hobbyist, but I was dumbfounded how quickly I can create small applications. Humans are lazy though and I can't help but feel we are being inundated with sketchy apps doing all kinds of things the authors don't even understand. I am not anti AI or anything, I use it and want to be comfortable with it, but something just feels off. It's too easy to hand the keys over to Claude and not fully disclose to others whats going on. I feel like the lack of transparency leads to suspicion when anyone talks about this or that app they created, you have to automatically assume its AI and there is a good chance they have no clue what they created.
jruz: Everyone is using AI, so nothing to be ashamed about. Is better to be open about it and add a disclaimer about how it was used.Even if it's vibe coded as long as you are open about it there's nothing wrong, it's open source and free if someone doesn't like it can just go write it themselves.
antihero: Am I going to have to make it rewrite all the stuff 4.6 did?
preommr: > This comment thread is a good learner for founders;lmao, no they shouldn't.Public sentiment, especially on reactionary mediums like social media should be taken with a huge grain of salt. I've seen overwhelming negativity for products/companies, only for it it completely dissapear, or be entirely wrong.It's like that meme showing members of a steam group that are boycotting some CoD game, and you can see that a bunch of them were playing in-game of the very thing they forsook.People are fickle, and their words cheap.
lanyard-textile: The internet is a stupid place with people who can't make up their mind, I don't disagree :)But this isn't like a minor debacle about a brand. The flagship product had a severe degradation, and the parent company won't be forthcoming about it.It's short term thinking. Congratulations, everyone still uses your product for now, but it diluted your brand.Why take the risk when the alternative is so incredibly easily? Build engagement with your users and enjoy your loyal army.
hulk-konen: Some variation of this is the way.You should not get dependent on one black box. Companies will exploit that dependency.My version of this is having CC Pro, Cursor Pro, and OpenCode (with $10 to Codex/GLM 5.1) --> total $50. My work doesn't stop if one of these is having overloaded servers, etc. And it's definitely useful to have them cross-checking each other's plans and work.
nicce: Absolutely just about the business. Mythos not tempting if basic models reaches almost the same.
tspng: Which seems to be the case, according to tests from AISI which has access to Mythos: https://www.aisi.gov.uk/blog/our-evaluation-of-claude-mythos...
SEJeff: I believe tools like graphify cut down the tokens in thinking dramatically. It makes a knowledge graph and dumps it into markdown that is honestly awesome. Then it has stubs that pretend to be some tools like grep that read from the knowledge graph first so it does less work. Easy to setup and use too. I like it.https://graphify.net/
franze: as every AI provider is pushing news today, just wanted to say that apfel is v1.0.4 stable today https://github.com/Arthur-Ficial/apfel
HarHarVeryFunny: It seems a lot of the problem isn't "token shrinkage" (reducing plan limits), but rather changes they made to prompt caching - things that used to be cached for 1 hour now only being cached for 5 min.Coding agents rely on prompt caching to avoid burning through tokens - they go to lengths to try to keep context/prompt prefixes constant (arranging non-changing stuff like tool definitions and file content first, variable stuff like new instructions following that) so that prompt caching gets used.This change to a new tokenizer that generates up to 35% more tokens for the same text input is wild - going to really increase token usage for large text inputs like code.
mnicky: > things that used to be cached for 1 hour now only being cached for 5 min.Doesn't this only apply to subagents, which don't have much long-time context anyway?
distances: I don't see any value in chat history. I delete all conversations at least weekly, it feels like baggage.
llm_nerd: Most of the compute OpenAI "preordered" is vapour. And it has nothing to do with why people thought the company -- which is still in extremely rocky rapids -- was headed to bankruptcy.Anthropic has been very disciplined and focused (overwhelmingly on coding, fwiw), while OpenAI has been bleeding money trying to be the everything AI company with no real specialty as everyone else beat them in random domains. If I had to qualify OpenAI's primary focus, it has been glazing users and making a generation of malignant narcissists.But yes, Anthropic has been growing by leaps and bounds and has capacity issues. That's a very healthy position to be in, despite the fact that it yields the inevitable foot-stomping "I'm moving to competitor!" posts constantly.
guelo: How is droves of your customers leaving, whether they're foot stomping or not, healthy?
llm_nerd: Droves? I mean, if we take the "I'm leaving!" posts seriously, the company has people so emotionally invested they feel the need to announce their departure is a pretty good place to be. Some tiny sampling of unhappy customers is indicative of nothing.Honestly at this point I am pretty firmly of the belief that OAI is paying astroturfers to post the "Boy does anyone else think Claude is dumb now and Codex is better?" (always some unreproducible "feel" kind of thing that are to be adopted at face value despite overwhelming evidence that we shouldn't). OAI is kind of in the desperation stage -- see the bizarre acquisitions they've been making, including paying $100M for some fringe podcast almost no one had heard of -- and it would not be remotely unexpected.
guelo: We have no idea the ratio of foot stompers to quite quitters but I'm sure most people don't announce it. I cancelled my subscription and hadn't told anybody. And I quit based on personal experience over the last few weeks, not on social media pr.
draygonia: Interestingly, I started coding with Claude a couple weeks ago (with my only other experience being vbcode 20 years ago) and it's been surprisingly good at starting code from scratch but as soon as the code gets a little complex it takes a lot of tokens to make a simple change which makes it somewhat impractical for all but the most basic applications. That said, I'm not referring to objects by inspecting the code and asking for changes to certain lines, I'm saying "In the results bar, change the title of the result to a clickable link that directs to X." which may require a little translation before Claude picks up on what I want. Even so, I was able to build a somewhat usable application within a week (minus a few bugs).
lukan: What happened to open weight models are 2-3 years behind the proprietary ones? I don't see the drama here.
hxugufjfjf: What do you mean?
brunooliv: I’ve been using Opus 4.6 extensively inside Claude Code via AWS Bedrock with max effort for a few months now (since release). I’ve found a good “personal harness” and way of working with it in such a way that I can easily complete self contained tasks in my Java codebase with ease.Now idk if it’s just me or anything else changed, but, in the last 4/5 days, the quality of the output of Opus 4.6 with max effort has been ON ANOTHER LEVEL. ABSOLUTELY AMAZING! It seems to reason deeper, verifies the work with tests more often, and I even think that it compacted the conversations more effectively and often. Somehow even the quality of the English “text” in the output felt definitely superior. More crisp, using diagrams and analogies to explain things in a way that it completely blew me away. I can’t explain it but this was absolutely real for me.I’d say that I can measure it quite accurately because I’ve kept my harness and scope of tasks and way of prompting exactly the same, so something TRULY shifted.I wish I could get some empirical evidence of this from others or a confirmation from Boris…. But ISTG these last few days felt absolutely incredible.
antinomicus: This thread is very confusing. Everyone is saying diametrically opposed things. But I think this may be a clue: AWS bedrock means api billing, no? I’m guessing those complaining about the recently lowered quality of Claude are on subscriptions. And those who are still loving Claude are on work accounts.
czk: show us the benchmarks with "adaptive thinking" turned on
gertlabs: Early benchmark results on our private complex reasoning suite: https://gertlabs.com/?mode=agentic_codingOpus 4.7 is more strategic, more intelligent, and has a higher intelligence floor than 4.6 or 4.5. It's roughly tied with GPT 5.4 as the frontier model for one-shot coding reasoning, and in agentic sessions with tools, it IS the best, as advertised (slightly edging out Opus 4.5, not a typo).We're still running more evals, and it will take a few days to get enough decision making (non-coding) simulations to finalize leaderboard positions, but I don't expect much movement on the coding sections of the leaderboard at this point.Even Anthropic's own model card shows context handling regressions -- we're still working on adding a context-specific visualization and benchmark to the suite to give you the objective numbers there.
raincole: > I'm seeing a lot of goodwill for Codex and a ton of bad PR for CC.AI is one of the things that you cannot find genuine opinions online. Just like politics. If you visit, say, r/codex, you'll see all the people complaining about how their limits are consumed by "just N prompts" (N is a ridiculously small integer).It's all astroturfed from all sides.
adrian_b: I agree with what you what you have written, which is why I would never pay a subscription to an external AI provider.I prefer to run inference on my own HW, with a harness that I control, so I can choose myself what compromise between speed and the quality of the results is appropriate for my needs.When I have complete control, resulting in predictable performance, I can work more efficiently, even with slower HW and with somewhat inferior models, than when I am at the mercy of an external provider.
brightball: What’s your setup?
mrandish: Me too, but it was obviously wildly unsustainable. I was telling friends at xmas to enjoy all the subsidized and free compute funded by VC dollars while they can because it'll be gone soon.With the fully-loaded cost of even an entry-level 1st year developer over $100k, coding agents are still a good value if they increase that entry-level dev's net usable output by 10%. Even at >$500/mo it's still cheaper than the health care contribution for that employee. And, as of today, even coding-AI-skeptics agree SoTA coding agents can deliver at least 10% greater productivity on average for an entry-level developer (after some adaptation). If we're talking about Jeff Dean/Sanjay Ghemawat-level coders, then opinions vary wildly.Even if coding agents didn't burn astronomical amounts of scarce compute, it was always clear the leading companies would stop incinerating capital buying market share and start pushing costs up to capture the majority of the value being delivered. As a recently retired guy, vibe-coding was a fun casual hobby for a few months but now that the VC-funded party is winding down, I'll just move on to the next hobby on the stack. As the costs-to-actual-value double and then double again, it'll be interesting to see how many of the $25/mo and free-tier usage converts to >$2500/yr long-term customers. I suspect some CFO's spreadsheets are over-optimistic regarding conversion/retention ARPU as price-to-value escalates.
geuis: I don't really understand Anthropic's pricing model.https://claude.com/pricingThey have individual, enterprise, and API tiers. Some are subscriptions like Pro and Max, others require buying credits.Say for my use-case I wanted to use Opus or Sonnet with vscode. What plan would I even look at using?
Barbing: This is why we took business ethics & I know Dario had to tooHow will your project/decision look on the front page of the Wall Street Journal? Well when a whistleblower reveals what everyone knows ($9b->$30b rev jump w/o servers growing on trees simultaneously = tough decisions), it's gonna be public anyway.
Glemllksdf: They are loosing money because the model training costs billions.
ACCount37: Model inference compute over model lifetime is ~10x of model training compute now for major providers. Expected to climb as demand for AI inference rises.
howdareme9: They are constantly training and getting rid of older models, they are losing money
ACCount37: Which part of "over model lifetime" did you not understand?
adgjlsfhk1: That's not a sufficient condition for profitability if both inference and scaling costs continue to increase over time.
jansan: First impression is mixed. Very introverted. You ask a question, it goes straight to work without even giving slightes feedback to the quesiton. Overwrote all my previous work, which was the worst that Claude did to me so far. OTOH improved a scanline algo written by 4.6
trinix912: But this technology is now out there, the cat's out of the bag, there's no going back to a world where people can't ask AI to write malware for them.I'd argue that black hats will find a way to get uncensored models and use them to write malware either way, and that further restricting generally available LLMs for cybersec usage would end up hurting white hats and programmers pentesting their own code way more (which would once again help the black hats, as they would have an advantage at finding unpatched exploits).
wat10000: You can run GPT2! Here's the medium model: https://huggingface.co/openai-community/gpt2-mediumI will now have it continue this comment:I've been running gps for a long time, and I always liked that there was something in my pocket (and not just me). One day when driving to work on the highway with no GPS app installed, I noticed one of the drivers had gone out after 5 hours without looking. He never came back! What's up with this? So i thought it would be cool if a community can create an open source GPT2 application which will allow you not only to get around using your smartphone but also track how long you've been driving and use that data in the future for improving yourself...and I think everyone is pretty interested.[Updated on July 20] I'll have this running from here, along with a few other features such as: - an update of my Google Maps app to take advantage it's GPS capabilities (it does not yet support driving directions) - GPT2 integration into your favorite web browser so you can access data straight from the dashboard without leaving any site! Here is what I got working.[Updated on July 20]
fancyfredbot: Wow that is terrible. In my memory GPT 2 was more interesting than that. I remember thinking it could pass a Turing test but that output is barely better than a Markov chain.I guess I was using the large model?
jdiff: Genuine question, why have you chosen to phrase this scraping and distillation as an attack? I'm imagining you're doing it because that's how Anthropic prefers to frame it, but isn't scraping and distillation, with some minor shuffling of semantics, exactly what Anthropic and co did to obtain their own position? And would it be valid to interpret that as an attack as well?
fragmede: Firehosing Anthropic to exfiltrate their model seems materially different than Anthropic downloading all of the Internet to create the model in the first place to me. But maybe that's just me?
brightball: This is the primary reason I won’t sign up for an annual plan.
adrian_b: Your examples have nothing to do with Anthropic and the like.A gun does not have other purposes than being used as a weapon, so it is normal for the use of such weapons to be regulated.On the other hand it is not acceptable to regulate like weapons the tools that are required for other activities, for instance kitchen knives or many chemicals, like acids and alkalis, which are useful for various purposes and which in the past could be bought freely for centuries, without that ever causing any serious problems.LLMs are not weapons, they are tools. Any tools can be used in a bad or dangerous way, including as weapons, but that is not a reason good enough to justify restrictions in their use, because such restrictions have much more bad consequences than good consequences.
robeym: "Double-checking the double-checks" - anyone else getting this? Now I feel REALLY confident the model is being thorough.
XCSme: > Instruction following. Opus 4.7 is substantially better at following instructions. Interestingly, this means that prompts written for earlier models can sometimes now produce unexpected results: where previous models interpreted instructions loosely or skipped parts entirely, Opus 4.7 takes the instructions literally. Users should re-tune their prompts and harnesses accordingly.Yay! They finally fixed instruction following, so people can stop bashing my benchmarks[0] for being broken, because Opus 4.6 did poorly on them and called my tests broken...[0]: https://aibenchy.com/compare/anthropic-claude-opus-4-7-mediu...
rafaelmn: GPT 5.4 xhigh thinking was really good at teasing out problems in multi step flows of a process I was refactoring, caught higher level/deeper problems than Opus 4.6. However getting it to write the code is just not a good experience for me, it changes the style/does not follow surrounding code, codes in a sloppy way and creates subtle bugs that I don't see from Opus. So I use codex for review and opus to write code. Testing the new Opus 4.7 still to see if the review/reasoning catches more/better stuff. I frequently fire off all 3 (Gemini 3.1 pro, Opus, Codex xhigh) on same code than have them cross reference each other and stuff like that. Gemini is so bad it's not even funny, not sure why I keep it running.
MattRix: You could use any of the plans depending on your situation.., they will all work in VSCode, so the question is how much usage you need and whether you want to pay for a subscription or directly for usage.If you’re actually asking this question earnestly, I recommend starting out with the Pro plan ($20).
daveguy: Here is the XL model. 20x the size of the medium model. Still just 2B parameters, but on the bright side it was trained pre-wordslop.https://huggingface.co/openai-community/gpt2-xl
staticman2: Are those "smart people you know" machine learning researchers?
copperx: We should treat LLM somewhat like patents or drugs. After 5 years or so, the models should become open source. Or at very least the weights. To compensate for the distilling of human knowledge.
simonw: ... here's the pelican, I think Qwen3.6-35B-A3B running locally did a better job! https://simonwillison.net/2026/Apr/16/qwen-beats-opus/
ionwake: based sun worshipping pelican
brunooliv: Maybe… but I can say I saw a real shift in these last few days, why or if it’s real, I can’t fully say but definitely something changed
baron3dl: I was there too, but honestly after today, 4.7 "feels" just as a bad. I was cynical, but also, kind of eager for the improvement. It's just not there. Compared to early Feb, I have to babysit EVERYTHING.
adrian_b: For now, the most suitable computer that I have for running LLMs is an Epyc server with 128 GB DRAM and 2 AMD GPUs with 16 GB of HBM memory each.I have a few other computers with 64 GB DRAM each and with NVIDIA, Intel or AMD GPUs.However, a very short time ago, i.e. the previous week, I have started to work at modifying llama.cpp to allow an optimized execution with weights stored in SSDs, e.g. by using a couple of PCIe 5.0 SSDs, in order to be able to use bigger models than those that can fit inside 128 GB, which is the limit to what I have tested until now.By coincidence, this week there have been a few threads on HN that have reported similar work for running locally big models with weights stored in SSDs, so I believe that this will become more common in the near future.The speeds previously achieved for running from SSDs hover around values from a token at a few seconds to a few tokens per second. While such speeds would be low for a chat application, they can be adequate for a coding assistant, if the improved code that is generated compensates the lower speed.
mchinen: These stuck out as promising things to try. It looks like xhigh on 4.7 scores significantly higher on the internal coding benchmark (71% vs 54%, though unclear what that is exactly)> More effort control: Opus 4.7 introduces a new xhigh (“extra high”) effort level between high and max, giving users finer control over the tradeoff between reasoning and latency on hard problems. In Claude Code, we’ve raised the default effort level to xhigh for all plans. When testing Opus 4.7 for coding and agentic use cases, we recommend starting with high or xhigh effort.The new /ultrareview command looks like something I've been trying to invoke myself with looping, happy that it's free to test out.> The new /ultrareview slash command produces a dedicated review session that reads through changes and flags bugs and design issues that a careful reviewer would catch. We’re giving Pro and Max Claude Code users three free ultrareviews to try it out.
consumer451: Someone posted a theory on reddit that /ultrareview might use Mythos. Seems at least plausible. It runs in the cloud, like /ultraplan, so no way to inspect what it's doing, or give it "dangerous" tasks, right?I just ran it against an auth-related PR, and it found good edge-case stuff! Interesting.
2001zhaozhao: Start vibe-coding -> the model does wonders -> the codebase grows with low code quality -> the spaghetti code builds up to the point where the model stops working -> attempts to fix the codebase with AI actually make it worse -> complain online "model is nerfed"
NewsaHackO: I remember there was a guy that had three(!) Claude Max subscriptions, and said he was reducing his subscriptions to one because of some superfluous problem. I'm thinking, nah, you are clearly already addicted to the LLM slot machine, and I doubt you will be able to code independently from agent use at this point. Antropic, has already won in your case.
teaearlgraycold: I don’t really understand the slot machine, addiction, dopamine meme with LLM coding. Yeah it’s nice when a tool saves you time. Are people addicted to CNCs, table saws, and 3D printers?
NewsaHackO: I don't use the agentic workflow (as I am using it for my own personal projects), but if you have ever used it, there is this rush when it solves a problem that you have been struggling with for some time, especially if it gives a solution in an approach you never even considered that it has baked in its knowledge base. It's like an "Eureka" moment. Of course, as you use it more and more, you start to get better at recognizing "Eureka" moments and hallucinations, but I can definitely see how some people keep chasing that rush/feeling you get when it uses 5 minutes to solve a problem that would have taken you ages to do (if at all).
davesque: > We stated that we would keep Claude Mythos Preview’s release limited and test new cyber safeguards on less capable models first. Opus 4.7 is the first such model: its cyber capabilities are not as advanced as those of Mythos Preview (indeed, during its training we experimented with efforts to differentially reduce these capabilities). We are releasing Opus 4.7 with safeguards that automatically detect and block requests that indicate prohibited or high-risk cybersecurity uses.It feels like this is a losing strategy. Claude should be developing secure software and also properly advising on how to do so. The goals of censoring cyber security knowledge and also enabling the development of secure software are fundamentally in conflict. Also, unless all AI vendors take this approach, it's not going to have much of an effect in the world in general. Seems pretty naive of them to see this as a viable strategy. I think they're going to have to give up on this eventually.
earthnail: I feel it’s fine as a short term solution, and probably a good thing. Gives the good guys some time to stay on top.Always remember: a defender must succeed every time , an attacker only once.
robrenaud: Yeah, it's different. Anthropic profits when it delivers tokens. Hosting providers pay when Anthropic scrapes them.
colordrops: Sorry but this is a ridiculous comment. It's not magic. There are countless levers that can be changed and ARE changed to affect quality and cost, and it's known that compute is scarce.We aren't superstitious, you are just ignorant.
OsrsNeedsf2P: Do your benchmark results indicate any level of regression on Opus 4.6 or 4.5 since their first release?
rzk: Off topic, but I really like the writing style on your blog. Do you have any advice for improving my own? In an older comment[1], you mentioned the craft of sharpening an idea to a very fine, meaningful, well-written point. Are there any books, or resources you’d recommend for honing that craft? Thanks in advance.[1] https://news.ycombinator.com/item?id=44082994
bergheim: Curious why you think that? Stuff like> Yes, there is a relative scale level...> Yes, having the smartest model will...> yes Chinese AI companies have ...yes yes yes, I didn't say anything, why write in a way that insinuates that I was thinking that?I mean it doesn't come off as AI slop, so that's yay in 2026. But why do you think it is so good?
jibal: loser(Didn't you notice being mocked for the spelling error?)
andai: The fundamental tension is that the models are getting weirdly good at hacking while still sort of sucking at a bunch of economically valuable tasks.So they've hit the point where the models are simultaneously too smart (dangerous hacking abilities) and too stupid (can't actually replace most employees). So at this point they need to make the models bigger, but they're already too big.So the only thing left to do is to make them selectively stupider. I didn't think that would be possible, but it seems like they're already working on that.
andai: Does this imply that if you train it on Gwern style output, the quality will improve?
andai: They don't nerf the model, just lower the default reasoning effort, encourage shorter responses in the system prompt, etc. Totally different ;)
spyckie2: haha it is poorly written, its one of my pieces with the fewest drafts, i just wrote it and clicked submit to get the thoughts out of my head.I think he is referring to the art of refining an idea though, which I do have something to say on his comment.
solenoid0937: You are dramatically overestimating how much time people have to waste at these smaller hypergrowth companies
jerf: AI being what it is, at this point you might be able to ask it for a token to put in a web page at .well-known, put it in as requested, and let it see it, and that might actually just work without it being officially built in.I suggest that because I know for sure the models can hit the web; I don't know about their ability to do DNS TXT records as I've never tried. If they can then that might also just work, right now.
andai: I think even Claude Web can run arbitrary Linux commands at this point.I tried using it to answer some questions about a book, but the indexer broke. It figured out what file type the RAG database was and grepped it for me.Computers are getting pretty smart ._.
ayewo: > What do you offer as a solution? If theoretically some foreign state intelligence was exposed using Claude for security penetration that affected the stability of your home government due to Antropic's lax safety controls, are you going to defend Anthropic because their reasoning was to allow everyone to be able to do security research?I don't have an answer.But the problem is that with a model like Grok that designed to have fewer safeguards compared to Claude, it is trivially easy to prompt it with: "Grok, fake a driver's license. Make no mistakes."Back in 2015, someone was able to get past Facebook's real name policy with a photoshopped Passport [1] by claiming to be “Phuc Dat Bich”. The whole thing eventually turned out to be an elaborate prank [2].1: https://www.independent.co.uk/news/world/australasia/man-cal...2: https://gizmodo.com/phuc-dat-bich-is-a-massive-phucking-fake...
NewsaHackO: To me, those seem a lot lower stakes than supply chain attacks, social engineering, intelligence gathering, and other security exploits that Anthropic is more worried about. Making a fake driver license to buy beer isn't really the thing that Anthropic is actively trying to prevent (though I would assume they would stop that too). Even the GP was about penetration testing of a public website; without some sort of identification, how would it be ethical for Claude to help with something like that? Remember, this whole safety thing started because people held AI companies accountable for politically incorrect output of AI, even if it was clearly not the views of the company. So when Google made a Twitter bot that started to spout anti-Semitic and racist talking points, the fact that no one defended them and allowed them to be criticized to the point of taking the bot down is the reason why we have all of these extremely restrictive rules today.
russellthehippo: Initial testing today - 4.7 excels at abstractions/implementations of abstractions in ways that often failed in 4.5/4.6. This is a great update, I've had to do a lot of manual spec to ensure consistency between design and implementation recently as projects grow.
sillysaurusx: There’s an art to GPT sampling. You have to use temperature 0.7. People never believe it makes such a massive difference, but it does.
gertlabs: We only have some basic time filtering (https://gertlabs.com/?days=30), but most of our samples are from the last 2 months. This is a visualization we plan to add when we've collected more historical data.But we did heavily resample Claude Opus 4.6 during the height of the degraded performance fiasco, and my takeaway is that API-based eval performance was... about the same. Claude Opus 4.6 was just never significantly better than 4.5.But we don't really know if you're getting a different model when authenticated by OAUTH/subscription vs calling the API and paying usage prices. I definitely noticed performance issues recently, too, so I suspect it had more to do with subscription-only degradation and/or hastily shipped harness changes.
andai: I'm not sure if this is official, but from what I gathered, they just bill 3rd party stuff as extra usage now:https://news.ycombinator.com/item?id=47633568(They were against ToS before (might still be?), and people were having their Anthropic accounts banned. Actually charging people money for the tokens they're using seems like a much more sensible move.)
Arubis: So far most of what I'm noticing is different is a _lot_ more flat refusals to do something that Opus 4.6 + prior CC versions would have explored to see if they were possible.
axpy906: That’s kind of funny since a Chinese model started the thinking chains being visible in Claude and OA in the first place.
spyckie2: The thing that inspires my writing is that the best sentences are self evident. Meaning you declare it without evidence and it feels so intuitively right to most people. It resonates, either being their lived experience, or being the inevitable conclusion of a line of thinking.Making a sentence like requires deeply understanding a problem space to the point where these sentences emerge, rather than any "craft" of writing.So the craft is thinking through a topic, usually by writing about it, and then deleting everything you've written because you arrived at the self evident position, and then writing from the vantage point of that self evident statement.I feel that writing is a personal craft and you must dig it out of yourself through the practice of it, rather than learn it from others. The usage of AI as a resource makes this much clearer to me. You must be confident in your own writing not because it is following best practices or techniques of others but because it is the best version of your own voice at the time of being written.
babelfish: Claude Code injects a 'warning: make sure this file isn't malware' message after every tool call by default. It seems like 4.7 is over-attending to this warning. @bcherny, filed a bug report feedback ID: 238e5f99-d6ee-45b5-981d-10e180a7c201
vessenes: Interesting. The model card mentions 4.7 is much more attentive to these instructions and suggests you will need to review and soften or remove or focus them at times.
andai: It's been known for years that prompts which boost performance with one model, can harm performance with a different model. The same goes for harnesses. It looks like they'll need to customize Claude Code's prompts depending on which model is running, for optimal results.For example if you read the prompts, it's pretty clear that a lot of them are leftovers from the early days when the models had way less common sense than they do now. I think you could probably remove 2/3rds of those over-explained rules now and it would be fine. (In fact you might even expect to see improvement to performance due to decreased prompt noise.)
gck1: What bothers me with codex cli is that it feels like it should be more observable, more open and verbose about what the model is doing per step, being an open source product and OpenAI seemingly being actually open for once, but then it does a tool call - "Read $file" and I have no idea whether it read the entire file, or a specific chunk of it. Claude cli shows you everything model is doing unless it's in a subagent (which is why I never use subagents).
andai: Well, I can share my experience from a few days ago. Gave the same task (a major refactor) to both Claude and Codex.Codex finished in 5 minutes, Claude was still spinning after 20 minutes. Also it used up all my usage, about twice over (the 5-hour window rolled over in the middle of the task, so the usage for one task added up to 192%). Codex usage was 9%. So, 21x difference there, lolThey're saying there's bugs lately with how usage is being measured, but usage being buggy isn't exactly more encouraging...So I was on task #4 with Codex while Claude was still spinning on #1.I didn't like the results Codex gave me though. It has the habit of doing "technically what you asked, but not what a normal human would have wanted."So given "Claude is great but I can't actually use it much" and "Codex is cheap and fast but kinda sucks", the current optimum seems to be having Claude write detailed specs and delegate to Codex. (OpenAI isn't banning people for using 3rd party orchestration, so this would actually be a thing you could do without problems. Not the reverse though.)
kadushka: models are getting weirdly good at hacking while still sort of sucking at a bunch of economically valuable taskslike most human hackers
altruios: Who would suspect that the companies selling 'tokens' would (unintentionally) train their models to prefer longer answers, reaping a HIGHER ROI (the thing a publicly traded company is legally required to pursue: good thing these are all still private...)... because it's not like private companies want to make money...
fancyfredbot: Try setting up one laundry which charges by the hour and washes clothes really really slowly, and another which washes clothes at normal speed at cost plus some margin similar to your competitors.The one which maximizes ROI will not be the one you rigged to cost more and take longer.
sebastiennight: I don't think the analogy is correct here.Directionally, tokens are not equivalent to "time spent processing your query", but rather a measure of effort/resource expended to process your query.So a more germane analogy would be:What if you set up a laundry which charges you based on the amount of laundry detergent used to clean your clothes?Sounds fair.But then, what if the top engineers at the laundry offered an "auto-dispenser" that uses extremely advanced algorithms to apply just the right optimal amount of detergent for each wash?Sounds like value-added for the customer.... but now you end up with a system where the laundry management team has strong incentives to influence how liberally the auto-dispenser will "spend" to give you "best results"
vintagedave: My bad — I had Max, so more than $20. I can’t edit the comment any more. Can’t keep track of the names. I wonder when ‘pro’ started to mean ‘lowest tier’.But your article is interesting. You think some of the degradation is because when I think I’m using Opus they’re giving me Sonnet invisibily?
spyckie2: Hard to say, but the fact is the intelligence was there and now it's not.Maybe they are giving Sonnet, or maybe a distilled Opus, or maybe Opus but with lower context, not quite sure but intelligence costs compute so less intelligence means cheaper compute.
sdeframond: Funny, just tried a few runs of the car wash prompt with Sonnet 4.6. It significantly improved after I put this into my personal preferences:"- prioritize objective facts and critical analysis over validation or encouragement - you are not a friend, but a neutral information-processing machine. - make reserch and ask questions when relevant, do not jump strait to giving an answer."
andai: It's funny, when I asked GPT to generate a LLM prompt for logic and accuracy, it added "Never use warm or encouraging language."I thought that was odd, but later it made sense to me -- most of human communication is walking on eggshells around people's egos, and that's strongly encoded in the training data (and even more in the RLHF).
stavros: > most of human communication is walking on eggshellsThat's not human communication, that's Anglosphere communication. Other cultures are much more direct and are finding it very hard to work with Anglos (we come across as rude, they come across as not saying things they should be saying).
kristjansson: How would it know it’s really there, and not just a tool input/output injected into its input?
dgb23: Don't look at "thinking" tokens. LLMs sometimes produce thinking tokens that are only vaguely related to the task if at all, then do the correct thing anyways.
sharms: This is because the "thinking" you see is a summary by a highly quantized model - not the actual model, to mask these tokens
NobleLie: The question is, are you getting value from your setups or not?
vardalab: What culture are those? Scandinavian? Those often just say nothing.
jasonfarnon: then throw away the turing test
stavros: I'm Greek. I don't know about other Mediterranean cultures, but I assume they're similar.
MikeNotThePope: Prepare for the prices to go up!
kalcode: I've tried these with Claude various times and never get the wrong answer. I don't know why, but I am leaning they have stuff like "memory" turned on and possibly reusing sessions for everything? Only thing I think explains it to me.If your always messing with the AI it might be making memories and expectations are being set. Or its the randomness. But I turned memories off, I don't like cross chats infecting my conversations context and I at worse it suggested "walk over and see if it is busy, then grab the car when line isn't busy".
jorvi: Even Gemini with no memory does hilarious things. Like, if you ask it how heavy the average man is, you usually get the right answer but occasionally you get a table that says:- 20-29: 190 pounds- 30-39: 375 pounds- 40-49: 750 pounds- 50-59: 4900 poundsYet somehow people believe LLMs are on the cusp of replacing mathematicians, traders, lawyers and what not. At least for code you can write tests, but even then, how are you gonna trust something that can casually make such obvious mistakes?
dyauspitr: So what? That might happen one out of 100 times. Even if it’s 1 in 10 who cares? Math is verifiable. You’ve just saved yourself weeks or months of work.
icedchai: You don't think these errors compound? Generated code has 100's of little decisions. Yes, it "usually" works.
russfink: LLM’s: sometimes wrong but never in doubt.
russfink: Or, the first time a mistake is detected, a correction is automatically applied.
rlpb: A smart AI would realise that I can MITM its web access such that sees the .well-known token that isn't actually there. I assume that the model doesn't have CA certificates embedded into it, and relies on its harness for that.
throwatdem12311: Holy moly it’s slow.An implement step for a simple delete entity endpoint in my rails app took 30 minutes. Nothing crazy but it had a couple checks it needed to do first. Very simple stuff like checking what the scheduled time is for something and checking the current status of a state machine.I’m tempted to switch back to Opus 4.6 and have it try again for reference because holy moly it legit felt way slower than normal for these kinds of simple tasks that it would oneshot pretty effortlessly.Also used up nearly half of my session quota just for this one task. Waaaaay more token usage than before.
UltraSane: Even just in chats with Opus 4.6 I noticed hitting limits so much faster.
crazypyro: Ridiculous that they are neutering their public models unless you work at a mega corp.Just another way mega corporations are being rewarded.
CosmicShadow: So far since continuing coding/debugging with 4.7 it's failed to fix 3 simple bugs after explaining it like 5 times and having a previous working example to look at...hmmmmmm....
andai: Context for "please drink verification can": https://files.catbox.moe/eqg0b2.png
throwanem: Yes, it's a stupid 4chan meme from 2013. I can only surmise those who quote it either don't know its origin, or they must be wholeheartedly 'embracing the cringe.'
dimgl: Why do you assume it's botted? Just open up Codex on GPT 5.4 and point it at your codebase.
hughcox: OK 4.7 is a different animal altogether. - no longer a 10 year old autistic programming genius, but a confident programming genius basically taking the lead on what to do and truly putting you in your place. Slightly impatient but surprisingly confident, much more detailed in the tasks he does and double checks his work on the fly. - very little to no need to ask, have you rememebered to do this and that, its done. - also tells you which task he is doing next, rather than asking which task would you like him to do next - very different engagement with the user Surprisingly interesting, truly now leading the developer rather than guiding
dimgl: slop
pkilgore: Seconded. After disabling adaptive thinking and using a default higher thinking, I finally got the quality I'm looking for out of Opus 4.6, and I'm pleased with what I see so far in Opus 4.7.Whatever their internal evals say about adaptive thinking, they're measuring the wrong thing.
hbbio: Unless they're measuring capex
echelon: That's why they put the cute animal in your terminal.
DiffTheEnder: Is this... just one file acting as memory?
paulddraper: One video file
muzani: For me, making it high effort just fixed all the quality problems, and even cut down on token use somehow
vunderba: This. They kind of snuck this into the release notes: switching the default effort level to Medium. High is significantly slower, but that’s somewhat mitigated by the fact that you don’t have to constantly act like a helicopter parent for it.
muzani: Yup, they recommend a minimum of high for coding now, and cranked the default up to extra high.
zmmmmm: Curious how the safeguards work and what impact they will have.In general I feel that over-engineering safeguards in training comes at a noticeable cost to general intelligence. Like asking someone to solve a problem on a white board in a job interview. In that situation, the stress slices off at least 10% of my IQ.
leptons: 100% agree with this take. As I find myself using AI to write software, it is looking like gambling. And it isn't helping stimulate my brain in ways that actually writing code does. I feel like my brain is starting to atrophy. I learn so much by coding things myself, and everything I learn makes me stronger. That doesn't happen with AI. Sure I skim through what the AI produced, but not enough to really learn from it. And the next time I need to do something similar, the AI will be doing it anyway. I'm not sure I like this rabbit hole we're all going down. I suspect it doesn't lead to good things.
dakolli: It a terrifying path we're taking, everyone's competency is going to be 1:1 correlated to the quality and quantity of tokens they can afford (or be loaned).. I prefer to build by hand, I also don't think its that much slower to do by hand, and much rewarding... Sure you can be faster if you're building slop landing pages for your hypothetical SaaS you'll never finish but why would I want to build those things.
ed_mercer: > Claude was still spinning after 20 minutes.I have been using Claude Code on a medium codebase (~2000 files, ~1M lines of code) for over a year and have never had to wait this long. Also I'm on the max plan and have not seen these limits at all.
nl: First model to get 100% on my agentic benchmark: https://sql-benchmark.nicklothian.com/?highlight=anthropic_c...
robertwt7: I actually did the same pilot for a couple of days, while I don't like codex reply, it tackled some problems that claude were spinning for 20 minutes in 5 minutes. Now I have them side by side for codex to review claude's plan and it always find something that claude missed. The reply and the format though is not as good as claude. Pros and cons really, there are many cases where claude weren't able to debug prod issues like codex did as well for me
stacktraceyo: What about some thing likehttps://github.com/rtk-ai/rtk
carbocation: Is there a page where I could read more? What's unintuitive at a glance is that Opus 4.7 has a lower success rate than Sonnet 4.6 (90% vs 100%) while having a higher Avg Percentile (87.2% vs 70.9%).
jwpapi: Your suspicion is right.
strokirk: Scandinavian are quite different between each others as well.
cdnsteve: Blew through my usage in less than 1 hour after it was out. Max 20x plan. ouch
willis936: Given they have all of the information and all of the control, do you trust them to be fair?
willis936: I'm not a security expert and don't know how to properly audit every github repo that I come across. Maybe I sometimes want to build gnome extensions or cool software projects from source and I want some level of checking along the way for known vulnerabilities. They can't claim this is an obvious win for security when it centralizes rather than democratizes security.
JamesSwift: Its even more maddening for me because my whole team is paying direct API pricing for the privilege of this experience! Just charge me the cost and let me tune this thing, sheesh!
slashdave: I interpreted their actions as providing time for vendors to protect themselves against the new model proactively, not to nerf the models themselves.Although perhaps I am naive.
andrepd: CoT is basically bullshit, entirely confabulated and not related to any "thought process"...
clbrmbr: But still CoT distillation WORKS. See the DeepSeek R1 paper.
slashdave: > My hypothesis was that common words are effectively noise to agentsUmm... a few words can be combined in a rather large number of ways.Punctuation is used a lot. Why not just remove all the periods and commas and see what happens? Probably not pretty
Razengan: Claude Opus 4.6 has been hilarious for me so far: https://i.imgur.com/jYawPDY.png
mosfets: Made my day!
nightshift1: I noticed the same thing. Every Claude release thread is full of comments saying that it's terrible and why they switched to Codex. And vice versa for Codex release threads. At least its not as bad as /r/localllama that is 90% bots now.
Jagerbizzle: This is the company that allowed a vibe-release resulting in the leaking the entirety of the Claude Code codebase. What is the bar you're expecting here exactly?
59nadir: LLMs do not have trouble reading, it didn't make the mistake you made and it wouldn't. You missed a word, LLMs cannot miss words. It's not even remotely a human mistake.
sanex: With cursor it's half off right now.
b--l: grok-4.1-fast is the the number 2 model on this benchmark.~~If you've used this model in real life to do any sort of programming, and have seen its output, you would know that there is something VERY wrong with your benchmark.~~Edit: Oh sorry, I looked at the questions, I see this is also for SQL specifically. Interesting. Maybe they tuned that grok model for SQL. Cool site. I bookmarked it.
ainch: I would expect to see a significant wall clock improvement if that was the case - Meta's Coconut paper was ~3x faster than tokenspace chain-of-thought because latents contain a lot more information than individual tokens.Separately, I think Anthropic are probably the least likely of the big 3 to release a model that uses latent-space reasoning, because it's a clear step down in the ability to audit CoT. There has even been some discussion that they accidentally "exposed" the Mythos CoT to RL [0] - I don't see how you would apply a reward function to latent space reasoning tokens.[0]: https://www.lesswrong.com/posts/K8FxfK9GmJfiAhgcT/anthropic-...
clbrmbr: There’s also a paper [0] from many well known researchers that serves as a kind of informal agreement not to make the CoT unmonitorable via RL or neuralese. I also don’t think Anthropic researchers would break this “contract”.[0] https://arxiv.org/abs/2507.11473
oezi: The tokenizer changes seem to indicate that 4.7 isn't just a checkpoint but rather a model trained mostly from scratch, right?
esjeon: Funny because the general consensus is that everyone is burning money so fast that they would not be able to get it back from their AI business in the near future. OpenAI is simply the one with the most aggressive expenditure. Google has its own cash cows. Anthropic has been conservative all around.
HauntingPin: I've given up on Claude after seeing the response quality degrade so much over the past two weeks, and now this? I've unsubscribed. I don't know why people are still giving this company money.
bayareapsycho: > I've given up on Claude after seeing the response quality degrade so much over the past two weeksI've noticed the same thing (and other people I work with noticed it too). I have no idea if our company is just secretly AB-testing a cheaper model behind the scenes or if Anthropic is doing something similar.Earlier Claude was insanely good, now it's so bad I had to spoon feed it everything and it still gets a bunch of stuff wrong.
b--l: If we learned anything from the code leak is that they essentially do not know what is in the blackbox of the code for that 500k line mass. So that's plausible.
b--l: "but most of our samples are from the last 2 months."There's your major issue. That's well within the brutal quantization window.
sbinnee: The comment section is already long, but I knew that I could found comments about "hmm" that I started noticing. Yes, it is so irritating to me too. Also, one additional thing I noticed was that verbose information has been more and more being obfuscated. I run CC with --verbose option for months, and I can see verbose mode is not verbose anymore. I wish I can do -vvv maximum verbose mode.
59nadir: Yes, what the LLM providers did was worse and impacted people financially a whole lot more in lost compensation for works as well as operational costs that would never reach the heights they did solely because of scrapers on behalf of model providers.
AussieWog93: Is this the first time a new Anthropic flagship model was announced and the comments section on HN was mostly negative?
ai_slop_hater: This matches my experience as well, "adaptive thinking" chooses to not think when it should.
andai: I think this might be an unsolved problem. When GPT-5 came out, they had a "router" (classifier?) decide whether to use the thinking model or not.It was terrible. You could upload 30 pages of financial documents and it would decide "yeah this doesn't require reasoning." They improved it a lot but it still makes mistakes constantly.I assume something similar is happening in this case.
nomel: Is knowing how hard a problem is, before doing it, solved in humans?
AussieWog93: Opus sometimes makes poor long term decisions and really struggles with even mid size (~10k lines) existing codebases.
xnx: There's a tremendous amount of superstition around LLMs. Remember when "prompt engineering" "best practices" were to say you were offering a tip or some other nonsense?
XCSme: The reasoning modes are really weird with 4.7In my tests, asking for "none" reasoning resulted in higher costs than asking for "medium" reasoning...Also, "medium" reasoning only had 1/10 of the reasoning tokens 4.6 used to have.
XCSme: Oh, and also, the "none" and "medium" variants performed the same (??)
willsmith72: That's such a poor way to communicate a number. I take it they mean an increase of up to 35%?
coldtea: Yes, just use random results. You’ve just saved yourself weeks or months of work of gathering actual results.
XCSme: I was initially excited by 4.7, as it does a lot better in my tests, but their reasoning/pricing is really weird and unpredictable.Apart from that, in real-life usage, gpt-5.3-codex is ~10x cheaper in my case, simply because of the cached input discount (otherwise it would still be around 3-4x cheaper anyway).
gverrilla: Subagents are very useful. But sometimes it uses sonnet or haiku.You can try something like "always use opus for subagents" if you want better subagents.
ayorke: so excited!
dannyw: You can change tokenizers without a complete retraining from scratch.
nl: Yeah, multi-step SQL generation and debugging.Some models surprised me and Grok Fast was one of them. It is consistently good at this task though!
linzhangrun: Claude is launching real-name verification. I'm not sure if this can be circumvented through third-party relay (such as Poe) or API calls, or at least how long that can be maintained
t0lo: As one of the seemingly few people in this comments section who don't use it for coding, it seems far far more substantial and able to produce insights in written conversation than opus 4.6 for me
gertlabs: We calculate percentiles based on successful submissions only, and then apply success rate as a separate measurement, which is incorporated into our relative rankings.So we do penalize evals where the player failed the game, but not in the percentile measurement (success rate measures instances of playing incorrectly, did not compile, runtime errors, and other non-infrastructure related issues that can be blamed on the model). The design decision there is that percentile tells you how good the model's ideas are (when executed correctly), separately from how often it got something working correctly, but I can see how that's not great UX, at least as presented now.But the actual score itself is a combination of percentiles and success rates with some weighting for different categories, nothing fancy.I added a methodology page to the roadmap, thanks for pointing that out. We've converged on a benchmark methodology that should scale for a very long time, so it's time to document it better.
mkl: That should be "research" and "straight" in the last sentence. Maybe that will improve it further?
biglost: Yes, everyweek when assigning fking points to tasks on jira/s
zmmmmm: so if they are retroactive to 4.6 then they can't be trained into the model. They would have to be applied as a pre-screening or post-screening process. Which is disturbing since it implies already deployed workflows could be broken by this. I am curious if it is enforced in enterprise accounts eg: using AWS/Bedrock and how Anthropic would have implemented that given they push models to Amazon for hands off operation.
j45: Prompts seem to need to evolve with every new model.
atlex2: A couple drawbacks so far via our scenario-based tests:1. You can't ask the model to "think hard" about something anymore - model decides 2. Reasoning traces are no longer true to the thinking – vs opus 4.6, they really are summaries now 3. Reasoning is no longer consciously visible to the agentThey claim the personality is less warm, but I haven't experienced that yet with the prompts we have – seems just as warm, just disconnected from its own thought processes. Would be great for our application if they could improve on the above!
raylad: I am using 4.7 with the default extra high thinking, and it is clearly very stupid. It's worse than old Sonnet 4.5.I had it suggest some parameters for BCFtools and it suggested parameters that would do the opposite of what I wanted to do. I pointed out the error and it apologized.It also is not taking any initiative to check things, but wants me to check them (ie: file contents, etc.).And it is claiming that things are "too complex" or "too difficult" when they are super easy. For instance refreshing an AWS token - somehow it couldn't figure out that you could do that in a cron task.A really really bad downgrade. I will be using Codex more now, sadly.
driverdan: Models are not paranoid or anxious, they do not think or have feelings. I know you're probably using those words as a metaphor but we need to be careful about anthropomorphizing LLMs.
theptip: I normally agree with this, but they objectively did lower the default effort level, and this caused people to get worse performance unexpectedly.And it does seem likely to me that there were intermittent bugs in adaptive reasoning, based on posts here by Boris.So all told, in this case it seems correct to say that Opus has been very flaky in its reasoning performance.I think both of these changes were good faith and in isolation reasonable, ie most users don’t need high effort reasoning. But for the users that do need high effort, they really notice the difference.
drnick1: > how are you gonna trust something that can casually make such obvious mistakes?In many cases, a human can review the content generated, and still save a huge amount of time. LLMs are incredibly good at generating contracts, random business emails, and doing pointless homework for students.
tclancy: You just put a pile of tokens in front of all the good models and let them fight it out like Thunderdome. Then keep track of how they undermined each other and do that when you want to do some hackin’.
sothatsit: You can’t make up your mind about a model by using it on one task. Especially to say it’s such a bad downgrade after that is ludicrous. I’ve had great experiences with it this morning.
solarkraft: I find that GPT 5.4 is okay at it. It does think harder for harder problems and still answers quickly for simpler ones, IME.
wolttam: Wow this thread has been a cacophony of differing opinions
Jcampuzano2: No, I assure you you are not being spammed because legitimately many people prefer codex over claude right now. I am one of those people. And if you go on tech social media spaces you'll see many prominent well known devs in open source say the same. And of course others praise claude as well.At my job we have enterprise access to both and I used claude for months before I got access to codex. Around the time gpt-5.3-codex came out and they improved its speed I was split around 50/50. Now I spend almost 100% of my time using Codex with GPT 5.4.I still compare outputs with claude and codex relatively frequently and personally I find I always have better results with codex. But if you prefer claude thats totally acceptable.
christophilus: Same. Codex is faster and more consistent in the last few weeks for me vs Claude Code. I also don’t hit limits anywhere near as frequently.
galaxyLogic: > I want to wash my car. The car wash is 50 meters away. Should I walk or drive?I think no real human would ask such a question. Or if we do we maybe mean should I drive some other car than the one that is already at the car-wash?A human would answer, "silly question ". But a human would not ask such a question.
cesarvarela: Yeah, it is easier to destroy than to create. Models will always be better at hacking than at building.
stingraycharles: I don’t think this is a plausible argument, as they’re generally capacity constrained, and everyone would like shorter (= faster) responses.I’m fairly certain that in a few more releases we’ll have models with shorter CoT chains. Whether they’ll still let us see those is another question, as it seems like Anthropic wants to start hiding their CoT, potentially because it reveals some secret sauce.
Ifkaluva: I guess mainly they don’t want you to distill on their CoT
silverwind: Seems so silly that they won't support `effortLevel: "max"` while a env var is perfectly fine.
atonse: Not sure where the younger people thing came from, but I'm 45 and have been working in this industry since 1999. But even when I was in my 20s, I don't remember considering that I had a "right" to do something with a company's product before they've sold it to me.In fact, I would say the idea of entitlement and use of words like "rights" when you're talking about a company's policies and terms of use (of which you are perfectly fine to not participate. rights have nothing to do with anything here. you're free to just not use these tools) feels more like a stereotypical "young" person's argument that sees everything through moralistic and "rights" based principles.If you don't want to sign these documents, don't. This is true of pretty much every single private transaction, from employment, to anything else. It is your choice. If you don't want to give your ID to get a bank account, don't. Keep the cash in your mattress or bitcoin instead.Regarding "legit" - there are absolutely "legit" actors and not so "legit" actors, we can apply common sense here. I'm sure we can both come up with edge cases (this is an internet argument after all), but common cases are a good place to start.
adrian_b: You cannot search for bugs or vulnerabilities in "a company's product before they've sold it to you", because you cannot access it.Obviously, I was not talking about using pirated copies, which I had classified as illegal activities in my comment, so what you said has nothing to do with what I said."A company's policies and terms of use" have become more and more frequently abusive and this is possible only because nowadays too many people have become willing to accept such terms, even when they are themselves hurt by these terms, which ensures that no alternative can appear to the abusive companies.I am among those who continue to not accept mean and stupid terms forced by various companies, which is why I do not have an Anthropic subscription.> "if you don't want to give your ID to get a bank account, don't"I do not see any relevance of your example for our discussion, because there are good reasons for a bank to know the identity of a customer.On the other hand there are abusive banks, whose behavior must not be accepted. For instance, a couple of decades ago I have closed all my accounts in one of the banks that I was using, because they had changed their online banking system and after the "upgrade" it worked only with Internet Explorer.I do not accept that a bank may impose conditions on their customers about what kinds of products of any nature they must buy or use, e.g. that they must buy MS Windows in order to access the services of the bank.More recently, I closed my accounts in another bank, because they discontinued their Web-based online banking and they have replaced that with a smartphone application. That would have been perfectly OK, except that they refused to provide the app for downloading, so that I could install it, but they provided the app only in the online Google store, which I cannot access because I do not have a Google account.A bank does not have any right to condition their services on entering in a contractual relationship with a third party, like Google. Moreover, this is especially revolting when that third party is from a country that is neither that of the bank nor that of the customer, like Google.These are examples of bad bank behavior, not that with demanding an ID.
atonse: With the bank example, I thought your comment had some anti KYC language so I mixed it up with another response, sorry for the confusion.I actually kind of agree with you in some principle, IF we had no choice. Like the only reason I can say “you can choose not to purchase this product” is because that is true today, thanks to competition from commercial and open source models.But I’d be right there with you on “someone needs to force these companies to do ____” if they were quasi monopolies and citizens needed to use their technology in some form (we see this with certain patents around cell phone tech for example)
sebmellen: We sure aren’t far off.
reddit_clone: To me Claude Opus 4.6 seems even more confused.I want to wash my car. The car wash is 50 meters away. Should I walk or drive?Walk. It's 50 meters — you're going there to clean the car anyway, so drive it over if it needs washing, but if you're just dropping it off or it's a self-service place, walking is fine for that distance.
lr1970: Just asked Claude Code with Opus-4.6. The answer was short "Drive. You need a car at the car wash".No surprises, works as expected.
onemoresoop: Yeah, it was probably patched. It could reason novel problems only of you ask it to pay attention to some particular detail a.k.a. handholding..Same would happen with the the sheep and the wolf and the cabbage puzzle. If you l formulated similarly, there is a wolf and a cabbage without mentioning the sheep, it would summon up the sheep into existence at a random step. It was patched shortly after.
silverwind: Slow is good thought, that's when you know it'll get it right.
bombcar: Shades of “repeat” in lather, rinse, repeat.
dataviz1000: Thinking helps the models arrive at the correct answer with more consistency. However, they get the reward at the end of a cycle. Turns out, without huge constraints during training thinking, the series of thinking tokens, is gibberish to humans.I wonder if they decided that the gibberish is better and the thinking is interesting for humans to watch but overall not very useful.
dgb23: OK so you're saying the gibberish is a feature and not a bug so to speak? So the thinking output can be understood as coughing and mumbling noises that help the model get into the right paths?
dataviz1000: Here is a 3blue1brown short about the relationship between words in a 3 dimensional vector space. [0] In order to show this conceptually to a human it requires reducing the dimensions from 10,000 or 20,000 to 3.In order to get the thinking to be human understandable the researchers will reward not just the correct answer at the end during training but also seed at the beginning with structured thinking token chains and reward the format of the thinking output.The thinking tokens do just a handful of things: verification, backtracking, scratchpad or state management (like you doing multiplication on a paper instead of in your mind), decomposition (break into smaller parts which is most of what I see thinking output do), and criticize itself.An example would be a math problem that was solved by an Italian and another by a German which might cause those geographic areas to be associated with the solution in the 20,000 dimensions. So if it gets more accurate answers in training by mentioning them it will be in the gibberish unless they have been trained to have much more sensical (like the 3 dimensions) human readable output instead.It has been observed, sometimes, a model will write perfectly normal looking English sentences that secretly contain hidden codes for itself in the way the words are spaced or chosen.[0] https://www.youtube.com/shorts/FJtFZwbvkI4
johnmaguire: > It has been observed, sometimes, a model will write perfectly normal looking English sentences that secretly contain hidden codes for itself in the way the words are spaced or chosen.This sounds very interesting, do you have any references?
SJMG: [delayed]
micromacrofoot: I mean we had a shoe company pivot to AI and raise their stock value by 300%, how can we even know anymore
bombcar: Lemonade and blockchain rides again!Or was it ice tea?
jsw97: I got a refusal doing some math, I think based on the word "sextic", as best I can tell./model claude-opus-4.6
suzzer99: I've never seen "double press esc" as a control pattern.
sweetjuly: esc once interrupts the LLM, double-esc lets you revert to a previous state (interrupt harder).
whattheheckheck: Tokens relate to each other. More tokens more compute
bastawhiz: I'm an Opus stan but I'll also admit that 5.4 has gotten a lot better, especially at finding and fixing bugs. Codex doesn't seem to do as good a job at one shotting tasks from scratch.I suppose if you are okay with a mediocre initial output that you spend more time getting into shape, Codex is comparable. I haven't exhaustively compared though.
deaux: Yes, GPT 5.4 is better at finding bugs in traditional code. This has been easy to verify since its release. Its also worse at everything else, in particular using anything recent, or not overengineering. Opus is much better at picking the right tool for the job in any non-debugging situation, which is what matters most as it has long-term consequences. It also isn't stuck in early 2024. "Docs MCPs" don't make up for knowledge in weights.
bastawhiz: I agree. You're preaching to the choir. But I can also appreciate that there's plenty of tasks and use cases where being stuck in 2024 is still incredibly modern, and debugging is a much more valuable skill than picking the right tool for the job.
typ: I don't think that learning from textbooks to take an exam and learning from the answers of another student taking the exam are the same.Joking aside, I also don't believe that maximum access to raw Internet data and its quantity is why some models are doing better than Google. It seems that these SoTA models gain more power from synthetic data and how they discard garbage.
ddp26: Training window cutoff is Jan 2026, when Opus 4.6 was Aug 2025. That quite a lot of new world knowledge.
matltc: Care to expound on that? Maybe a reference to the relevant section?
9991: You should just read the thing, whether or not you believe it, to have an informed opinion on the ongoing debate.
matltc: I did read it a while back. Was curious what parent was referring to specifically
jameshart: I’m not sure ‘patched’ is the right word here. Are you suggesting they edited the LLM weights to fix cabbage transportation and car wash question answering?
psadauskas: A human totally would, as one of those brain-teaser trick questions. Its the same kind of question as "A plane crashes right on the border between the US and Canada. Where do they bury the survivors?" Its the kind of question you only get right if you pay close attention. Asking an AI that is like asking a 5 year old. You're not asking to get an answer, you're asking to see if they're paying attention.
jameshart: I was given to understand that attention is all you need.
hk__2: > Stop using these dopamine brain poisoning machines, think for yourself, don't pay a billionaire for their thinking machine.Yeah, and also stop using these things they call "computers", think for yourself, write your texts by hand, send letters to people. /s
dakolli: When did I say to stop using computers? You don't prefer to think for yourself? You're cooked.
solenoid0937: It's been dramatically better than any model I have ever used before on my tasks.
deaux: > and debugging is a much more valuable skill than picking the right tool for the job.Can't agree with that. Debugging is short-term, picking the right tool is long-term. Unless you thought I meant agentic tool ;)
djmips: Opus did get the feet on pedals better.
carbocation: Neat, thank you for explaining!
sheeshkebab: So they nixed the fun part of working with the bot - reading its thinking output. Now this thing just plain unfun and often stupid.So, yeah, good job anthropic. Big fuck you to you too.
arthurcolle: As a unit this is funny, Jira points assigned per second (now possible with parallel tool calling AIs)
oezi: I think I would love to test it, but on the Pro plan I just did two small sessions with 4.6 Sonnet and it consumed my 5h quota within one hour.
shepherdjerred: I’m slowly switching to codex simply because Claude code is closed source and I want to hack on my harness.
mochomocha: It makes me think of this parallel: often in combinatorial optimization ,estimating if it is hard to find a solution to a problem costs you as much as solving it.With a small bounded compute budget, you're going to sometimes make mistakes with your router/thinking switch. Same with speculative decoding, branch predictors etc.
nisarg2: Insane! Even Haiku doesn't make such mistakes.
someguyiguess: My most charitable interpretation of what you are saying is: Two wrongs make a right. If others exploit people that makes it an acceptable thing for me to do. No one can criticize me for doing a bad thing because others also do bad things. Is that what you are saying?I genuinely cannot see how to interpret it in a way that is positive.
ai_slop_hater: Maybe it is an unsolved problem, but either way I am confused why Anthropic is pushing adaptive thinking so hard, making it the only option on their latest models. To combat how unreliable it is, they set thinking effort to "high" by default in the API. In Claude Code, they now set it to "xhigh" by default. The fact that you cannot even inspect the thinking blocks to try and understand its behavior doesn't help. I know they throw around instructions how to enable thinking blocks, or blocks with thinking summaries, or whatever (I am too confused by now, what it is that they allow us to see), but nothing worked for me so far.
firemelt: rtk vibes a product of vibe code
fumar: It is the rush of "wow it solved this." I should take a break and work on something else, but in the back of my mind "what else can it solve?" Then I come up with extra work and sometimes lose at the LLM casino.
SV_BubbleTime: Ok, side topic… but that little bastard cheerfully told me out of no where that I have a mall of without a null check AND a free inside a conditional that might not get called.It didn’t give me a line number or file. I had to go investigate. Finally found what it was talking about.It was wrong. It took me about 20 minutes start to finish.Turned it off and will not be turning it back on.
Den_VR: I still think of The Unreasonable Effectiveness of Recurrent Neural Networks and related writings.http://karpathy.github.io/2015/05/21/rnn-effectiveness/
4b11b4: but what about DDD
jesseab: So Mythos.
devmor: > Also notable: 4.7 now defaults to NOT including a human-readable reasoning token summary in the output, you have to add "display": "summarized" to get thatThat’s extremely bothersome because half of what helps teams build better guardrails and guidelines for agents is the ability to do deep analysis on session transcripts.I guess we shouldn’t be surprised these vendors want to do everything they can to force users to rely explicitly on their offerings.
buf: Just chipping in to say that I've never seen it churn for more than 20 minutes in two years worth of usage. The longest I've ever seen it churn is when I had it give extremely detailed analysis of five fictional novels simultaneously.
jmpavlec: The Dutch especially. It's refreshing
dalemhurley: Wild how many people were predicting the AI slop, but was dismissing it as unlikely beyond some trolls.
turblety: What a waste of tokens. No wonder Anthropic can't serve their customers. It's not just a lack of compute, it's a ridiculous waste of the limited compute they have. I think (hope?) we look back at the insanity of all this theatre, the same way we do about GPT-2 [1].1. https://techcrunch.com/2019/02/17/openai-text-generator-dang...
vbezhenar: "generating fake news, impersonating people, or automating abusive or spam comments on social media"So it seems that these fears were founded. Doesn't seem to be a "theatre".
raylad: That was more than one task. It was 3.I also had Opus 4.7 and Opus 4.6 do audits of a very long document using identical prompts. I then had Codex 5.4 compare the audits. Codex found that 4.6 did a far better job and 4.7 had missed things and added spurious information.I then asked a new session of Opus 4.7 if it agreed or disagreed with the Codex audit and it agreed with it.I also agreed with it.
devmor: “Be critical, not sycophantic” is a general improvement for the majority of tasks where you want to derive logic in my experience.
unshavedyak: Part of me wonders if there's some subtle behavioral change with it too. Early on we're distrusting of a model and so we're blown away, we were giving it more details to compensate for assumed inability, but the model outperformed our expectations. Weeks later we're more aligned with its capabilities and so we become lazy. The model is very good, why do we have to put in as much work to provide specifics, specs, ACs, etc. So then of course the quality slides because we assumed it's capabilities somehow absolved the need for the same detailed guardrails (spec, ACs, etc) for the LLM.This scenario obviously does not apply to folks who run their own benches with the same inputs between models. I'm just discussing a possible and unintentional human behavioral bias.Even if this isn't the root cause, humans are really bad at perceiving reality. Like, really really bad. LLMs are also really difficult to objectively measure. I'm sure the coupling of these two facts play a part, possibly significant, in our perception of LLM quality over time.
mewpmewp2: Still I don't previously remember Claude constantly trying to stop conversations or work, as in "something is too much to do", "that's enough for this session, let's leave rest to tomorrow", "goodbye", etc. It's almost impossible to get it do refactoring or anything like that, it's always "too massive", etc.
egeozcan: Even superpowers started dividing things into "phases"."I think we can postpone this to phase 2 and start with the basics".Meanwhile using more tokens to make a silly plan to divide tasks among those phases, complicated analysis of dependency chains, deliverables, all that jazz. All unprompted.
YZF: It's fun and you do get a dopamine rush when LLM does something cool for you. I'm certainly feeling it as a user. Perhaps you can get the same from other tools. I would vote for yes- addictive.But it's also a tool that (can) save(s) you time.
ang_cire: > Humans are lazy though and I can't help but feel we are being inundated with sketchy apps doing all kinds of things the authors don't even understand... there is a good chance they have no clue what they created.I have bad news for you about the executives and salespeople who manage and sell fully-human-coded enterprise software (and about the actual quality of much of that software)...I think people who aren't working in IT get very hung up on the bugs (which are very real), but don't understand that 99% of companies are not and never have met their patching and bugfix SLAs, are not operating according to their security policies, are not disclosing the vulns they do know, etc etc.All the testing that does need to happen to AI code, also needs to happen to human code. The companies that yolo AI code out there, would be doing the same with human code. They don't suddenly stop (or start) applying proper code review and quality gating controls based on who coded something.> The only way I felt comfortable using Claude Code was holding its hand through every step, doing test driven changes and manually reviewing the code afterwards.This is also how we code 'real' software.> I can't help but think that massive code bases that have moved to vibe coding are going to spend inordinate amounts of time testing and auditing codeThis is the correct expectation, not a mistake. The code should be being reviewed and audited. It's not a failure if you're getting the same final quality through a different time allocation during the process, simply a different process.The danger is Capitalism incentivizing not doing the proper reviews, but once again, this is not remotely unique to AI code; this is what 99% of companies are already doing.
dbdr: > not doing the proper reviews, but once again, this is not remotely unique to AI code; this is what 99% of companies are already doing.But is the scale similar, or will AI coding make the problem significantly worse?
sagarpatil: Same. I’ve lived in Claude Code since the beta release and last couple of weeks was horrible. I’ve been using codex for last couple of days and it’s much smarter than 4.6.
sagarpatil: It worked. Although I have a Claude Code subscription, I got the ChatGPT Pro plan, and 5.4 xHigh at 1.5x speed was better than 4.6 with adaptive thinking disabled. I was working all day, about 8 hours, and did not run into any limits. 5.4 surprised me many times by doing things I usually would not do myself, because I am lazy, so yeah, I am sticking with 5.4 for now until all the Claude drama is over.
colordrops: I thought I was tripping when I saw this. Must have been a measure to reduce usage to save them some compute.
kaydub: At my job and for personal projects I pay per token with claude and I've had no problems at all with it. No slowdowns, no "throttling", nothing.I'm honestly surprised how many people have subscriptions and are expecting anthropic to eat the cost lol
mrifaki: the adaptive thinking complaints in this thread are interesting because they are basically the same verifier quality problem showing up in a different costume the model has to decide how hard to think before knowing how hard the problem is and that meta decision is itself a hard problem that nobody has solved cleanly not in RL not in speculative decoding not in branch prediction, the fact that disabling adaptive thinking and forcing high effort restores quality tells us the router is underthinning not that the model got worse which means anthropic is trading user experience for compute savings whether or not they frame it that way
manmal: Why don’t you switch to codex? The grass is greener here. Do use 5.3-codex though, 5.4 is not for coding, despite what many say.
drewnick: Hasn't Opus 4.5 been famously consistent while 4.6 was floating all over the place?
YZF: For me 4.6 has been a noticeable leap in performance from 4.5. I'm not missing 4.5 at all.
Frannky: I am honestly just happy they haven't figured out a way to lock in the users, and that there are alternatives that can get it done. I feel like they treat the user as a dumb peasant.
jerrygoyal: codex astroturfing is even bigger on Reddit.
timwis: We've started calling it dopus at work :(
TheLNL: They might have further trained the model with these edgecases in the dataset
0xffff2: Copilot's per-prompt pricing is crazy unsustainable. I doubt even a 2.5x increase is enough. I've had a couple of times where I've kept Copilot/Opus 4.6 occupied for a full day on a single prompt recently.
joegibbs: I haven't seen any improvement on Opus 4.6 from it (on xhigh) and it seems to often suggest and do things that just make no sense at all. For instance today I asked it to sketch out a UI mockup for for a new frontend feature and it asked me whether I wanted to make it part of the docs (it has absolutely nothing to do with the docs). I asked why it should be part of the docs and it goes "yes of course that makes no sense at all, disregard that".4.6 has also been giving similar hallucination-prone answers for the last week or so and writing code that has really weird design decisions much more than it did when it was released.
sreekanth850: Yesterday, after testing this model, I immediately unsubscribed from Anthropic. My subscription will end in another two weeks. They are milking general consumers for their enterprise business. First, they nerfed the Claude 4.6 models a couple of weeks before launching 4.7, then released a worse, subpar model and presented it as an enhanced version. I subscribed to multiple Codex Plus plans, and they work much better than this.You can fool some people all the time, and all people some of the time, but you cannot fool all the people all the time.
vinhnx: They do now. /effort command is on the latest Claude Code version; run `claude update` and `claude /effort`.
kzrdude: It's a brave new world of centralized computing where one day you boot up and can't work because something changed arbitrarily in the "compute" service you are renting.
siva7: Because with adaptive thinking they control compute, not you
OccamsMirror: Not to mention the amount of placeholders and TODOs it's leaving in the codebase but then declaring that it's finished the work.I've cancelled my subscriptions to both Codex and Claude and am going to go back to writing my own code.When the merry-go-round of cheap high quality inference truly ends, I don't want to be caught out.
johnmlussier: I've switched over to Codex. On Extra High reasoning it seems very capable and is definitely catching mistakes Sonnet has missed. I'd love to move back to Opus but at this time it is untenable.
anandvshah: I have used Edgee.AI and it is amazing.
aaroninsf: I've been using 4.6 in a long-term development project every day for weeks.4.7 is a clusterf--k and train wreck.
leptons: It's not slower to do by hand. I race the AI all the time. I give it a simple task to write a small script that I need to complete a task that is blocking me... and the "thinking" thing spins and spins. So I often just fire up a code editor and write it myself, often before the AI is actually done after I have to cajole it through 10 iterations to get what I want. And when I race it, I get what I want every time, and often in the same or less time than it takes the AI (plus the time that I have to spend cajoling it).
shohan99: While I believe that mythos is better than the models we have right now, the "too dangerous to release" sounds largely a marketing gimmick to me. Well not for me to speculate, I simply need to wait for the huge wave of security patches to all software in the coming weeks, as per Anthropic's claims
helloplanets: If the model is based on a new tokenizer, that means that it's very likely a completely new base model. Changing the tokenizer is changing the whole foundation a model is built on. It'd be more straightforward to add reasoning to a model architecture compared to swapping the tokenizer to a new one.Usually a ground up rebuild is related to a bigger announcement. So, it's weird that they'd be naming it 4.7.Swapping out the tokenizer is a massive change. Not an incremental one.
joegibbs: Major numbers are just for marketing, if it's not good enough that it feels like a similar jump as from 3.7 to 4 they're not going to give it a new number.
gf000: And humans are incredibly bad at "skimming through this long text to check for errors", so this is not a happy pairing.As for the homework, there is obviously a huge category that is pointless. But it should not be that way, and the fundamental idea behind homework is sound and the only way something can be properly learnt is by doing exercises and thinking through it yourself.
zrkrlc: You’re probably thinking of this article: https://she-llac.com/claude-limits
redml: thats it! thanks for digging it up
gf000: Absolutely not my area of expertise but giving it a few examples of what should be the expected answer in a fine-tuning step seems like a reasonable thing and I would expect it would "fix" it as in less likely to fall into the trap.At the same time, I wouldn't be surprised if some of these would be "patched" via simply prompt rewrite, e.g. for the strawberry one they might just recognize the question and add some clarifying sentence to your prompt (or the system prompt) before letting it go to the inference step?But I'm just thinking out loud, don't take it too seriously.
nextaccountic: If you do include reasoning tokens you pay more, right?
schneehertz: In fact, you need to pay regardless of whether the output includes reasoning tokens or not
DeathArrow: I happy with my GLM 5.1 and MiniMax 2.7 subscription and my wallet is happy, too.I am glad Anthropic is pushing the limits, that means cheap Chinese models will have reasons to get better, too.
EmanuelB: I can't notice any difference to 4.6 from 3 weeks ago, except that this model burns way more tokens, and produces much longer plans. To me it seem like this model is just the same as 4.6 but with a bigger token budget on all effort levels. I guess this is one way how Anthropic plans to make their business profitable.During the past weeks of lobotomized opus, I tried a few different open weight models side by side with "opus 4.6" on the same issue. The open weights outperformed opus 4.6, and did it way faster and cheaper. I tried the same problem against Opus 4.7 today and it did manage to find one additional edge case that is not critical, but should be logged. So based on my experience, the open weight models managed to solve the exact problem I needed fixed, while Opus 4.7 seem to think a bit more freely at the bigger picture. However Opus 4.7 also consumed way more tokens at a higher price, so the price difference was 10-20x higher on Opus compared to the open weights models. I will use Opus for code review and minor final fixes, and let the open weights models do the heavy lifting from now on. I need a coding setup I can rely on, and clearly Anthropic is not reliable enough to rely on.Why pay 200$ to randomly get rug-pulled with no warning, when I can pay 20$ for 90% of the intelligence with reliable and higher performance?
parasti: Which open weights model?
selcuka: Note that Headroom GUI installs rtk by default.
weitendorf: It’s to stop you from getting RL traces or using Claude without paying the big bucks for the Enterprise Security version
kamikazechaser: It has been the same for Sonnet/Opus 4.6 for sometime. It will straight up refuse to work on anything in the grey area. Chinese models will happily do anything; On my tests, GLM 5.1 comfortably bypassed a multi-player game's anti-piracy/anti-cheats check with some guided steering.
darkwater: I thought it just emitted tongue-in-cheek comments, not serious analysis. And I use the past tense because I had it enable explicitly and a few days ago it disappeared by itself, didn't touch anything.
weitendorf: They are training them on decompilation and reverse engineering/blackbox reimplementations/pentesting because it’s one of the best ways to generate interesting and rare RL traces for agentic coding AND teach them how lots of things work under the hood.Just throw Claude at millions of binaries and you can get amazing training data. Oh wait 4.7 gives you refusals for that now
pingou: Or they could check if the source is open source and available on the internet, and if yes refuse to analyse it if the person who request the analysis isn't affiliated to the project.That will still leave closed source software vulnerable, but I suspect it is somewhat rare for hackers to have the source of the thing they are targeting, when it is closed source.
solenoid0937: How can they tell if the software is closed or open source?They would have to maintain a server side hashmap of every open source file in existenceAnd it'd be trivial to spoof. Just change a few lines and now it doesn't know if it's closed or open
pingou: Of course just having the hash of the file wouldn't work, they would have to do something more complicated, a kind of perceptual hash. It's not easy, but I think it is doable.But then I suspect lots of parts in a closed source project are similar to open source code, so you can't just refuse to analyze any code that contains open source parts, and an attacker could put a few open source files into "fake" closed source code, and presumably the llm would not flag them because the ratio open/closed source code is good. But that would raise the costs for attackers.
wsmhj: Tried 4.7 on a few of my regular workloads. The quality ceiling is definitely higher than 4.6 when it actually engages — but that's the problem. "Adaptive thinking" seems to actively avoid thinking on tasks where I'd expect it to reason carefully, and I end up getting flat, fast answers where I wanted depth. Turning off adaptive thinking and bumping effort to high gets me closer to what I want, but at that point the token cost becomes hard to justify vs. just using a smaller model with explicit CoT. Feels like Anthropic is solving a cost optimization problem and calling it a feature.
Scrounger: > Which open weights model?Yes, I'm also wondering!Currently I'm testing out gemma4:26b and qwen3.6:35b-a3b-q4_K_M locally on my M2 Max Macbook Pro.Not the fastest, but reasonable.However, I am also interested in getting as close as possible in performance to Opus 4.6 while minimizing my costs.
taffydavid: Gemma4 on an m2? That sounds promising. I have an m3 max, going to try that today
SwellJoe: It could be an API endpoint on Anthropic servers, the same way Let's Encrypt verifies things on their servers. If you can't control the DNS records, you can't verify via DNS, no matter what you tell the local `certbot`.
taffydavid: Just yesterday it thought for 591 seconds for me, which is ten minutes. There have been times this week when it ran longer and I assumed it was just bust and stopped it
hk__2: > I am also interested in getting as close as possible in performance to Opus 4.6 while minimizing my costs.Aren’t we all? ;)
taffydavid: Fictional novels? Did it have to write them first?
paulluuk: If open weight models are sufficient for your engineering problems, then you should absolutely use them. But I haven't seen a single open weight model that can get even close to the complexity in my projects. They sometimes work for small toy examples or leetcode puzzles, but not very any real project. Really curious what models you've found that could replace current state of the art.
elAhmo: Its funny to think that with a model release Anthropic can slide in some instructions ("be a bit more detailed" or something similar) that affect the token output by a few percent, 5-10%, which will not be noticeable by most users but over the course of the year would bring solid growth (once the VC craze is over, if ever) and increase income."Regular companies" would love to have a growth like that without effectively doing anything.
SnowLprd: This smells heavily of astroturfing. Particularly because this is a paid service, and that fact is not mentioned here or in the GitHub README.Here was my experience…I download and run the Mac application, which starts installing a bunch of things. Then the following happens without advance notice:- Adds background item(s) from "Idiosyncratocracy BV"- Downloads over 2 GB of files- Pollutes home with ~/.headroom directory- Adds hook(s) to ~/.claude/hooks/- Modifies your ~/.claude/settings.json to add above hook(s)… and then I see something in the settings that talks about creating an account. That's when I realized that this is a paid service, after all of the above has happened.At this point I have had it with this subterfuge — I immediately trash the app and every related file and folder I can find, of which there are many. Hopefully I got them all, but who knows. There should have been an easy way to uninstall this mess, but of course there isn't.The lack of transparency here is really concerning.
bushido: I think my results have actually become worse with Opus 4.7.I have a pretty robust setup in place to ensure that Claude, with its degradations, ensures good quality. And even the lobotomized 4.6 from the last few days was doing better than 4.7 is doing right now at xhigh.It's over-engineering. It is producing more code than it needs to. It is trying to be more defensible, but its definition of defensible seems to be shaky because it's landing up creating more edge cases. I think they just found a way to make it more expensive because I'm just gonna have to burn more tokens to keep it in check.
mnicky: Maybe this? From the article:> Opus 4.7 is substantially better at following instructions. Interestingly, this means that prompts written for earlier models can sometimes now produce unexpected results: where previous models interpreted instructions loosely or skipped parts entirely, Opus 4.7 takes the instructions literally. Users should re-tune their prompts and harnesses accordingly.
bushido: Possible, but very unlikely.One of the hard rules in my harness is that it has to provide a summary Before performing a specific action. There is zero ambiguity in that rule. It is terse, and it is specific.In the last 4 sessions (of 4 total), it has tried skipping that step, and every time it was pointed out, it gave something like the following.> You're right — I skipped the summary. Here it is.It is not following instructions literally. I wish it was. It is objectively worse.
chickensong: Using hooks can help.
spaceman_2020: Yeah my rate limits are getting exhausted way faster now. Its also way slower and overplans unless you steer it closely.I can’t rely on this anymore.
weird-eye-issue: > Why pay 200$ to randomly get rug-pulled with no warning, when I can pay 20$ for 90% of the intelligence with reliable and higher performance?Then go do that. Good luck!
pojzon: If you get to pay X to YY $$ per each request (because thats the real cost for Anthropic), I strongly believe AI train would suddenly derail.Currently we are all subsidied by investors money.How long you can have a business that is only losing money. At some point prices will level up and this will be the end of this escapade.
darkteflon: I keep reading about this, but I have never, ever seen it. Daily Claude Max user for ~6 months. Not saying it doesn’t happen, but it’s never once happened to me.
brunooliv: Also my experience
weird-eye-issue: I like how some people are accusing them of reducing the overall token usage to screw over Claude Code users and then there are yet other people that are accusing them of deliberately increasing token usage to screw over API users (or maybe to get subscription users to upgrade, I'm not really sure)
fragmede: qwen3.6-35b-a3b, released today.https://qwen.ai/blog?id=qwen3.6-35b-a3bhttps://news.ycombinator.com/item?id=47792764
edgolub: Nobody is accusing them of making the models more efficient.People are complaining they are changing how many tokens you get on a subscription plan.Why would anyone dislike getting more service for less (or the same) amount of money?
weird-eye-issue: > People are complaining they are changing how many tokens you get on a subscription plan.They didn't change this. It's the same number of tokens just a different tokenizer.
mattmanser: I just don't believe you.The vast gulf between open weights and frontier models that existed 6 months ago has suddenly disappeared?It's far more likely you're just bad at assessing model output.
jamiejquinn: Or that gulf doesn't exist for the problems they are trying to solve?
big-chungus4: Crazy how popular this post is on HN, are this many people actually using expensive paid models? Is everyone on HN a millionaire? Or is someone botting all anthropic posts?
hijodelsol: I mean, the 100$ plan is less than the hourly rate of any consultant / senior dev in developed countries. So if it can save even one hour a month, it's cost efficient for the customer (at the current, subsidized rates, of course).
heartleo: In the long run, tokens may become a new signal of inequality — access to the most powerful models could be limited to those who can afford them.
cambaceres: Claude Pro costs $20 / month which gives you access to their latest models.
tossandthrow: 200USD a month really is not that much. Especially not for an employer who is used to pay 150-250k a year for an engineer.Especially for the value it provides.
barrenko: Honestly it seems like each major player here fumbles the ball in turn, quite fun to observe. But hey, it's a difficult game.
doix: I suspect the real issue is that they just change stuff "randomly" and the experience gets worse/better cheaper/more expensive.Since you have no way of knowing when they change stuff, you can't really know if they did change something or it's just bias.I've experienced that so many times in the last month that I switched to codex. The worst part is, it could be entirely in my head. It's so hard to quantify these changes, and the effort it takes isn't worth it to me. I just go by "feeling".
comboy: From my experience, saying "this is not X, it will be not used for Y" is vastly increasing chances of this being classified as being X. Anybody can write "this is authorized research". Instead use something like evaluate security / verify security, make sure this cannot be (...), etc.Of course these models are pretty smart so even Anthropic's simple instructions not to provide any exploits stick better and better.
rrr_oh_man: It's almost as if there are different people with different motivations and ideas about how the world should work
berkes: I've been using devstral2 with great success for a few months now. The hosted version, not running one locally or such. Devstral is open.Devstral is good, Opus better. But not much. For me, "good" is "good enough". The difference, IME lies in context engineering: skills, agents.md, subagents, tools, prompts. A Devstral with good skills performs far better than an "blank" claude code. Claude with good skills performs even better, but hardly noticable, IME.I am convinced I've plateaued. Better performance comes from improving skills and other "memory", prompting smarter, better context management and, above all, from the tooling around it and the stability of the services.I do still run Claude with Opus alongside Mistral with Devstral2. Sometimes to just compare outputs, often to doublecheck, but mostly to doublecheck my statement that the difference between Devstral2 and Opus is marginally and easily covered by better context engineering.
amelius: Claude is known for its shitty metering.
sylware: Is there a classic web interface? (noscript/basic (x)html)
esperent: They absolutely do change this all the time - session limits vary wildly. The most damning proof of this is that there's absolutely no information about how many tokens you get per session with each subscription level, it's just terms like 5x, 20x. But 5x what? Who knows?
LeoPanthera: Did they get rid of the option to clear the context and work just with the plan, in plan mode? I always used that and it worked well. Now it seems to be gone.
dear_prudence: same experience, it has not been a reliable tool for the last few months
roxana_haidiner: I'm wondering if this one will be able to stop putting my python imports inline :((((
XzAeRosho: It just repopulates the context. It's absolutely infuriating the way it behaves now, since there are not many workarounds to minimize token usage unless you use caveman [1].[1]: https://github.com/JuliusBrussee/caveman
verve_rat: You and I have no way of knowing that.
malfist: I don't know, I think java is the best programming language. I use it for everything I do, no other programming language comes close. Python lost all my trust with how slow it's interpreter is, you can't use it for anything.^^^^ Sarcastic response, but engineers have always loved their holy wars, LLM flavor is no different.
taffydavid: Java is great and all but if you don't use it with the right kind of keyboard you're wasting your time.I use one of those very loud clacky ones with brightly colored keys and that makes me a better person
nlitened: Joke's on you, I use Java as Clojure with a clacky split keyboard, feels great.
Gareth321: As an accelerationist and transhumanist, no way! These models passed the Turing test years ago. When a thing is indistinguishable from human, it is human. Our brains are, after all, just a collection of learned memetic weights. Just ask the determinists.
weird-eye-issue: Except that the API cost is literally logged on disk for every session and it's easy to analyze those logs.
EmanuelB: I think this is the case. In the early GPT-4 days I tested the same model side by side across the subscription and API. The API always produced a longer better answer. To me it felt like the API model was working how it was supposed to work while the subscription model tried to reduce its token usage. From a business perspective that would make sense. I then switched to API only because I felt like it was worth the extra cost.I did a similar test with sonnet about 6 months ago and noticed no difference, except that the subscription was way cheaper than API access. This is not the case anymore, at least not for me. The subscription these days only lasts for a few requests before it hits the usage limit and goes over to ”extra usage” billing. Last week I burned through my entire subscription budget and 80$ worth of extra usage in about 1h. That is not sustainable for me and the reason I started looking at alternatives.From a business perspective it all makes sense. Anthropic recently gave away a ton of extra usage for free. Now people have balance on their accounts that Anthropic needs to pay for with compute, suddenly they release a model that seem to burn those tokens faster than ever. Last week I felt like the model did the opposite, it was stopping mid implementation and forgetting things after only 2 turns. Based on the responses I got it seemed like they were running out of compute, lobotomized their model and made it think less, give shorter answers etc. Probably they are also doing A/B testing on every change so my experience might be wildly different from someone else.
weird-eye-issue: > In the early GPT-4 days I tested the same model side by side across the subscription and API. The API always produced a longer better answer.If you are comparing responses in ChatGPT to the API, it's apples and oranges, since one applies a very opinionated system prompt and the other does not.
weitendorf: I find the most value to be in eval loops and multi-agent setups where a specialized or cheap model gets tasks that take load off the smarter model.Most of the value in agentic development IMO is in the feedback loop/ability for the model itself to intelligently pull in context, but if you want to push a lot of context or have steps that are more proscribed, it's kind of a waste of money to have the big model do that. Much better to use it as a kind of pre-processing/noise-reduction step that filters out junk context.I would say that right now the benefits are largest for this kind of work with medium-sized multimodal models. For example I have hooks/automation that use https://github.com/accretional/chromerpc to automatically screenshot UIs and then feed it into qwen-family models. It's more that I don't want to pay Opus to look at them or remember/be instructed to do that unless it goes through QA first.
eloisant: Depends on the culture as you said, but some of them are even less direct than English speaking countries. Japan for example.
robertfall: As far as I understand Opus 4.7 disregards the disable adaptive thinking flag. So if you're seeing it perform well, perhaps their evals are inline?
Gareth321: Opus 4.6 did actually get dumber. The Director of AI at AMD logged a pretty detailed issue with receipts: https://github.com/anthropics/claude-code/issues/42796
smusamashah: Opus 4.7 is a slight regression over 4.6 https://petergpt.github.io/bullshit-benchmark/viewer/index.v...Max is worse than High.
AnthonBerg: It is capable of particularly beautiful writing.I've had a really nice user preference for writing style going. That user preference clicks better into place with 4.7; the underlying rhythm and cadence is also mich more refined. Rhythm and cadence both abstract and concrete – what is lead into view and how as well as the words and structures by which this is done. The combination is really quite something.
glimshe: If Claude AI is so good at coding, why can't Anthropic use it to improve Claude's uptime and fix the constant token quota issues?
Keyframe: Maybe this is the result
weitendorf: The UIs all bake in system prompts and other tunable configs that the API leaves open, so does Claude Code and other harnesses. So anything you notice different over the API when you're controlling the client is almost certainly that. Note that this is kind of something they have to do because consumer UI users will do stuff like ask models their name or date, or want it to respond politely and compassionately, and get upset/confused when they just get what's in the weights.The problem with subscriptions for this kind of stuff is that it's just incompatible with their cost structure. The worst being, subscription usage is going to follow a diurnal usage pattern that overlaps with business/API users, so they're going to have to be offloaded to compute partners who most likely charge by the resource-second. And also, it's a competitive market, anybody who wants usage-based pricing can just get that.So you basically end up with adverse selection with consumer subscription models. It's just kind of an incoherent business model that only works when your value proposition is more than just compute (which has a usage-based, pretty fungible market)
rkuska: For 4.7 it is no longer possible to disable adaptive thinking. Which is weird given the comment from Boris followed with silence (and closed github issue). So much for the transparency.> Claude Opus 4.7 (claude-opus-4-7), adaptive thinking is the only supported thinking mode. Thinking is off unless you explicitly set thinking: {type: "adaptive"} in your request; manual thinking: {type: "enabled"} is rejected with a 400 error.https://platform.claude.com/docs/en/build-with-claude/adapti...For my claude code I went with following config:* /effort xhigh (in the terminal cli) - To avoid lazying* "env": {"CLAUDE_CODE_DISABLE_1M_CONTEXT": "1"} (settings.json) - It seems like opus is just worse with larger context* "display": "summarized" (settings.json) - To bring back summaries.* "showThinkingSummaries": true (settings.json) - Should show extended thinking summaries in interactive sessionsFreaking wizardry.
siva7: You're misunderstanding the purpose of "auto"-model-routing or things like "adaptive thinking". It's a solved problem for the companies. It solves their problems. Not yours ;)
scotty79: What would be a bad answer to stairs/elevator question?
onlyrealcuzzo: I switched to Codex and found it extremely inferior for my use case.It is much faster, but faster worse code is a step in the wrong direction. You're just rapidly accumulating bugs and tech debt, rather than more slowly moving in the correct direction.I'm a big fan of Gemini in general, but at least in my experience Gemini Cli is VERY FAR behind either Codex or CC. It's both slower than CC, MUCH slower than Codex, and the output quality considerably worse than CC (probably worse than Codex and orders of magnitude slower).In my experience, Codex is extraordinarily sycophantic in coding, which is a trait that could t be more harmful. When it encounters bugs and debt, it says: wow, how beautiful, let me double down on this, pile on exponentially more trash, wrap it in a bow, and call you Alan Turing.It also does not follow directions. When you tell it how to do something, it will say, nah, I have a better faster way, I'll just ignore the user and do my thing instead. CC will stop and ask for feedback much more often.YMMV.
cageface: I've had exactly the opposite experience. Getting great results using GPT for hours every day since 5.3. You need to put the effort level on at least high though.Every time I hand off a task to Opus to see if it's gotten better I'm disappointed. At least 4.7 seems to have realized I have skill files again though.
hirako2000: [delayed]
100ms: I originally switched to Opus because it could reliably write Rust. As of 2 weeks ago, I'm using Codex because it writes way more compact and idiomatic Rust. Just another anecdote for the pile. I detest ChatGPT's persona, but Codex definitely feels better than Claude Code for anything I throw at it
cageface: On at least high effort level I find GPT 5.4 easily beats Opus 4.6 in code generation and debugging issues.
otabdeveloper4: Fundamentally they're the same technology with the same exact algorithms under the hood; only the post-training alignment differs.That is, the difference you see is either placebo effect or you being lucky and better aligning with model post-training bias.
morgengold: Tried it for different Vue, Nuxt, Supabase projects. Think of CRM SAAS or Sales App like size. Also for my personal bot with which i communicate via telegram.First feelings: Solves more of the complex tasks without errors, thinks a bit more before acting, less errors, doesnt lose the plot as fast as 4.6. All in all for me a step further. Not quite as big of a jump like 4.5 -> 4.6 but feels more subtle. Maybe just an effect of better tool management. (I am on MAX plan, using mostly 4.7 medium effort).
anshumankmr: Something about the Mythos preview had made me think that a new model was en route. I was hoping for Haiku 4.6 (an underrated model I feel)
adammarples: Don't forget that you can also write code by hand
berkes: Someone just asked my what I dislike most about Mistral and about Claude code.I run both in zed editor. Claude codes' integration is subpar - it's ACP does not report tasks, doesn't give diffs and so on.Mistral has rate limits that I hit just too often. I'm now using Mistral Pro, where this is worse, using pay-as-you-go is better but costs me 10x the pro. The agent then stops with an error.
gghootch: Thanks for the feedback, will work on making this more transparent so future users do not have this experience.I did want to call out that headroom is not based on RTK - it includes RTK sure, but headroom cli has a lot more going on under the hood. For more see https://github.com/chopratejas/headroom
Traubenfuchs: Anthropic‘s throwing out new models but the devs are NOT happy.Was all the goodwill people had for Anthropic products them selling unsustainably high performance at a loss?
not_that_d: Yeah, no. I canceled my subscription yesterday. It is Claude is unusable right now.
adammarples: They didn't describe the model, they described (accurately) the behaviour. They are useful descriptors of behaviour.
TeMPOraL: Except for the model weights themselves, they hardly have any!
ahartmetz: That a human would not ask such a question means it's not in the training set, so it shows how bad an LLM can be at thinking from first principles. Which, I think, is the point of such silly questions.
FeepingCreature: It's very unlikely that API use is subsidized.
edf13: Related: https://news.ycombinator.com/item?id=47803847
krzat: Humans tend to confabulate when asked "why you did X", funny how LLMs are pretty much the same.
VA1337: Guys, this may have already sounded, but there is a strong feeling that before the release of a new model, they are numbing the previous one
ramon156: My voice will probably not be very audible here, but I ran Codex and CC side-by-side.I had to steer claude a bunch of times, only to be hit with a limit and no actual code written (and frankly no progress, I already did the research). I was on xhighI ran gpt-5.4 high. Same research, GPT asked maybe 3-4 questions, looked up some stuff then got to workI only changed 1-2 things I would've done differently, and I was able to continue just fine.Anthropic, what the fuck happened?
beaker52: It doesn’t really come as a surprise to me that these companies are struggling to reliably fix issues with software which relies on a central component which is nondeterministic.But they made their own bed with that one.
astrange: It also gets confused if the entire prompt is in a text file attachment.And the summarizer shows the safety classifier's thinking for a second before the model thinking, so every question starts off with "thinking about the ethics of this request".
FeepingCreature: I'd get confused if I was a LLM and you put my entire prompt in a text file attachment. I'd be like, "is this the user or is this a prompt injection??"
FeepingCreature: Errors compounding is a meme. In iterated as well as verifiable domains, errors dilute instead of compounding because the llm has repeated chances to notice its failure.
hk__2: I think by myself and I use the best tools out there to achieve what I want.
dmallory: Personally, this is not my experience (and i'm sure others have also had very good results using Codex and this isn't some astroturf campaign).The way i'd frame it is that both models have areas they excel at. i've had very good results with having Claude write implementation plans and initial investigations and letting Codex do the work of implementation.
porknbeans00: Does the second amendment cover unregistered thinking machines? Asking for a friend.
Natfan: this comment feels pretty rude and disrespectful for no real reason?
zerotoship: the quality of 4.6 dropped too much. I already switched to 4.7 & testing it out.. the tokens consumption is definitely low from what I have seen
verve_rat: We aren't talking about API costs or number of tokens consumed, we are talking about number of tokens in a monthly subscription.
arcanemachiner: It's early days for Opus 4.7, but I will say this: Today, I had a conversation go well into the 200K token range (I think I got up to 275K before ending the session), and the model seemed surprisingly capable, all things beings considered.Particularly when compared to Opus 4.6, which seems to veer into the dumb zone heavily around the 200k mark.It could have just been a one-off, but I was overall pleased with the result.
idiocache: dopameme?
Kim_Bruning: I can confirm from experience that reviewing your own code for vulnerabilities has fallen under "prohibited uses" starting with Opus 4.6 as recently as april 10; forcing me to spend a day troubleshooting and quarantining state from my search system."This request triggered restrictions on violative cyber content and was blocked under Anthropic's Usage Policy. To learn more, provide feedback, or request an exemption based on how you use Claude, visit our help center: https://support.claude.com/en/articles/8241253-safeguards-wa...""stop_reason":"refusal"To be fair, they do provide a form at https://claude.com/form/cyber-use-case which you can use, and in my case Anthropic actually responded within 24 hours, which I did not expect.I admit I'm now once bitten twice shy about proceeding though. Opus 4.7 was still 'pausing' (refusing) random things on web when I tested it yesterday, so I don't know that the form applies to 4.7 or etc.
weird-eye-issue: I'm pretty tired of the low quality discussion around AI, it brings out really low IQ people who talk like they have any clue about what's going on.
weird-eye-issue: Again, it is not based on number of tokens. If it was solely based on number of tokens then things like cache misses would not impact the usage so much. It's based on the actual cost which includes things like the caching costs.
jacamera: I don't think there are good analogies to physical tools. It would be something like a nondeterministic version of a replicator from Star Trek which to me would feel much closer to a slot machine than a CNC mill.
GaryBluto: Not that anybody can actually use it though, as a large percentage of Copilot users are facing seemingly random multi-day rate limits.https://www.theregister.com/2026/04/15/github_copilot_rate_l...
user34283: I don’t know about rate limits, but I’ve been running into timeouts with Sonnet 4.6 after they don’t complete within 4-5 mins.I have not encountered the same issues when using Claude Code.Perhaps Copilot is on some sort of second rate priority.Of course it’s the only thing available in our Enterprise, making us second class users.Using the Copilot Business Plan we get the same rate limits as the student tier, making it infeasible to use Opus. Meanwhile management talks about their big plans for AI.
michaelscott: Their problem space may be just fine with open weight models regardless, but yes the release of gemma 4, GLM 5.1 and qwen 3.5 (and now 3.6!) have all happened in the last 6 months
throwaway2027: Gemini and Codex already scored higher on benchmarks than Opus 4.6 and they recently added a $100 tier with limited 2x limits, that's their answer and it seems people have caught on.
deaux: > that's their answer and it seems people have caught on.There's nothing to catch on to. OpenAI have been shouting "come to us!! We are 10x cheaper than Anthropic, you can use any harness" and people don't come in droves. Because the product is noticeably worse.
ninjagoo: > and people don't come in droves. Because the product is noticeably worse.As of Oct 2025, it appears that openai markets share is 15x that of anthropic: 60% vs 3.5% [1].As of April 2026, openai has 900 million weekly users [2] while anthropic has 300 million monthly users [1].As of March 2026, openai app downloads were 2.2 million per day, while anthropic app downloads were 340,000. openai mobile users were 248 million per day, while anthropic mobile users were 9.4 million. In Feb 2026, chatgpt had 5.4 billion web visits, while claude had 290 million web visits. [3]It seems to me that openai operates at a much higher scale than anthropic. Since you used droves as a proxy for product quality, by that standard anthropic has a much more inferior product. :)[1] https://sqmagazine.co.uk/claude-vs-chatgpt-statistics/ [2] https://www.pbs.org/newshour/nation/openai-focuses-on-busine... [3] https://www.forbes.com/sites/conormurray/2026/03/06/claude-s...
XCSme: I am not sure it's a mistake, this might be their new "adaptive reasoning" + hidden reasoning trace, so we can't verify.
thaanpaa: Well, the fun part is that the algorithms themselves are deterministic. They are just so afraid of model distillation that they force some randomness on top (and now hide thinking). Arguably for coding, you'd probably want temperature=0, and any variation would be dependent on token input alone.
embedding-shape: > I find the most value to be in eval loops and multi-agent setups where a specialized or cheap model gets tasks that take load off the smarter model.Yes, in theory, this should hold up, at least according to evaluations.According to real, practical use though, none of the open weight models are generally strong enough to handle coding and programming in a professional environment though, unless you have tightly controlled scope and specialized models for those scopes, which generally I don't think you have, but maybe it's just me jumping around a lot.Even with feedback loops, harnesses and what not, even the strongest local models I can run with 96GB of VRAM don't seem to come close to what OpenAI offered in the last year or so. I'm sure it'll be ready at one point, but today it isn't.With that said, if you know specific models you think work well as a general and local programming models, please share which ones, happy to be shown wrong. Latest I've tried was Qwen3.6-35B-A3B which gets a bit further but still instruction following is a far cry from what OpenAI et al offered for years.
concats: The recently viral 'grill-me' skill is great for exactly this.It's just a super simple skill that, when invoked, makes the model spend considerable time asking design and architecture questions and fleshing out any plan with you. A planning session without it might be Claude asking you 2 questions, and with it 22.
stingraycharles: Yes, which I understand, but I think they’re crippling their product for users this way.I don’t think it’s just this, because the thinking tokens often reveal more about Anthropic’s inner workings. For example, it’s how the whole existence of Claude’s soul document was reverse engineered, it often leaks details about “system reminders” (eg long conversation reminders).I think it’s also just very convenient for Anthropic to do this. The fact that they’re also presenting this as a “performance optimization” suggests they’re not giving the real reason they do this.
algoth1: I suspect this is part of the reason why gemini 3.1 pro is insanely good on AiStudio and pretty bad on the gemini app. I have thousands of small videos to convert to detailed descriptions and I'm using a super detailed system prompt. It works perfect either via api or Aistudio. I tried doing a gem on the gemini app using the same prompt as the gem instructions and I just can't get the same results. So, the issue might be not just the rlhl but also the massive system prompts injected on the app interface
GaryBluto: I can see no other explanation for this disastrous launch other then Anthropic trying to ruin their reputation for some reason.
Grimburger: I've watched my boss type out a lengthy few sentences to do a find+replace, it took him a few minutes.This is a guy with 10+ years experience as a dev. It was a watershed moment for me, many people really have stopped thinking for themselves.The way humans are depicted in Wall-E springs to mind as being quite prescient, it wasn't meant to be a doco
layer8: That’s why we’re testing for it.
embedding-shape: Fun to revisit no doubt, the comments make it even better.> SuckCocker 7 years ago - "in short: SKYNET is not far away. Be proud to be a part of it!"
algoth1: On the leaked codebase they show 100+ messages that are randomly cycled through
1dom: The issue is business and transparency. Transparency is often in the customer's interest at the individual business's expense.There are very, very few things that can be completely transparent without giving competitors an advantage. The nice solution solution to this is to be better and faster than your competitors, but sometimes it's easier just to remove transparency.
epitrochoid413: Another round of lets dumb down the previous model so the new model feels "game changing" and "OP".
Kim_Bruning: > It's easier to produce vulnerable code than it is to use the same Model to make sure there are no vulnerabilities.I once had a car where the engine was more powerful than the brakes. That was one heck of an interesting ride.So now we have a company that supplies a good chunk of the world's software engineering capability.They're choosing a global policy that works the same as my fun car. Powerful generative capacity; but gating the corrective capacity behind forms and closed doors.Anthropic themselves are already predicting big trouble in the near term[1] , but imo they've gone and done the wrong thing.Compare Pandora releasing the evils from the box, slamming the lid too late and trapping hope inside.Given their model naming scheme, they should read more Greek Mythos. ;)[1] https://thehill.com/policy/technology/5829315-anthropic-myth...
projektfu: After having worked with people from former Eastern Bloc countries, I would nominate a few of them for direct communication, e.g., "I won't do that because it is a stupid idea," or, "Can we discuss this when you know what you're doing?"
captainregex: I’m super envious. I can’t seem to do anything without a half a million tokens. I had to create a slash command that I run at the start of every session so the darn thing actually reads its own memory- whatever default is just doesn’t seem to do it. It’ll do things like start to spin up scripts it’s already written and stored in the code base unless I start every conversation with instructions to go read persistence and memory files. I also seem to have to actively remind it to go update those things at various parts of the conversation even though it has instructions to self update. All these things add up to a ton of work every session.I think i’m doing it wrong
hexaga: Meh. Temp 0 means throwing away huge swathes of the information painstakingly acquired through training for minimal benefit, if any. Nondeterminism is a red-herring, the model is still going to be an inscrutable black box with mostly unknowable nonlinear transition boundaries w.r.t. inputs, even if you make it perfectly repeatable. It doesn't protect you from tiny changes in inputs having large changes in outputs _with no explanation as to why_. And in the process you've made the model significantly stupider.As for distillation... sampling from the temp 1 distribution makes it easier.
jermaustin1: I keep hearing both sides of this "debate," but no one is providing any direct evidence other than "I do(n't) think that is true."
MaxikCZ: Lul, Im embracing this "cringe" you talk about :) Everytime I read it it makes me laugh :D
arcatech: I believe AWS forwards requests (for Clause models) to Anthropic’s servers. They don’t host those models.
c0wb0yc0d3r: The buddies were Anthropics April fools day stunt. Buddies were removed from a newer version of Claude code. By default Claude code updates automatically.
robeym: Not being able to reliably control subagent model is the main reason I have it off.
paulluuk: Sorry, I was not specific enough. I did not mean that open source itself is not enough, I meant that an open source model that can actually run locally on my machine is not enough. a 32B model can not compete with a 250B+ state of the art model, at least that's my experience and seems to be the experience of many others as well.
Filligree: You can’t get the couch into the elevator, typically. Trust me, I tried.Couch depending. I will persist in trying every time this comes up.
bustah: Worth reading alongside the 4.7 announcement is Anthropic's Automated Weak-to-Strong Researcher paper from three days ago. Nine Claude Opus 4.6 agents running in parallel sandboxes for five days scored 0.97 PGR on an alignment benchmark. Two human researchers scored 0.23 over seven days. The paper calls some of the agents' methods "alien science" because researchers cannot interpret them. The winning method showed no statistically significant improvement when applied to production Sonnet 4, so the agents overfit. The model used in the experiment is the same 4.6 whose model card documents roughly 8% chain-of-thought contamination. Anthropic's own framing asks for evaluations the agents cannot tamper with, which is the right instinct and a quiet admission that they are building systems they need to defend their safety work against. The cost number is real. The alignment story is more complicated than the summary suggests. Full writeup with citations: https://sloppish.com/alien-science.html
HarHarVeryFunny: AFAIK the way caching works is at API key level, which will be shared across the main/parent agent and all subagents.Note that the model API is stateless - there is no connection being held open for the lifetime of any agent/subagent, so the model has no idea how long any client-side entity is running for. All the model sees over time is a bunch of requests (coming from mixture of parent and subagents) all using the same API key, and therefore eligible to use any of the cached prompt prefixes being maintained for that API key.Things like subagent tool registration are going to remain the same across all invocations of the subagent, so those would come from cache as long as the cache TTL is long enough.
robeym: I'm also noticing today that the model is hanging a lot. 5 min in, 50 tokens. Stuck in "Still here, still at it..."
throwanem: Well, that's okay; you're young. There are better and more topical jokes in your future, and it will serve you well in making them to have encountered this particular, extremely stale and suspiciously stained, cookie. Just be careful you don't take too big a bite!
eloisant: Yes they're not as powerful, that means you need to feed them smaller tasks and rely more on plan mode.