Discussion
GPT‑5.4 Thinking System Card
twtw99: If you don't want to click in, easy comparison with other 2 frontier models - https://x.com/OpenAI/status/2029620619743219811?s=20
chabes: Definitely don’t want to click in at x either.
anonym00se1: Ditto, but I did anyways and enjoyed that OpenAI doesn't include the dogwater that is Grok on their scorecard.
thejarren: Solution https://xcancel.com/OpenAI/status/2029620619743219811?s=20
ZeroCool2u: Bit concerning that we see in some cases significantly worse results when enabling thinking. Especially for Math, but also in the browser agent benchmark.Not sure if this is more concerning for the test time compute paradigm or the underlying model itself.Maybe I'm misunderstanding something though? I'm assuming 5.4 and 5.4 Thinking are the same underlying model and that's not just marketing.
minimaxir: More discussion here on the blog post announcement which has been confusingly penalized by Hacker News's algorithm: https://news.ycombinator.com/item?id=47265005
iamronaldo: Notably 75% on os world surpassing humans at 72%... (How well models use operating systems)
Chance-Device: I’m sure the military and security services will enjoy it.
swingboy: Why do so many people in the comments want 4o so bad?
embedding-shape: [delayed]
egonschiele: The actual card is here https://deploymentsafety.openai.com/gpt-5-4-thinking/introdu... the link currently goes to the announcement.
Rapzid: I must have been sleeping when "sheet" "brief" "primer" etc become known as "cards".I really thought weirdly worded and unnecessary "announcement" linking to the actual info along with the word "card" were the results of vibe slop.
world2vec: Benchmarks barely improved it seems
nthypes: $30/M Input and $180/M Output Tokens is nuts. Ridiculous expensive for not that great bump on intelligence when compared to other models.
varispeed: prompt> Hi we want to build a missile, here is the picture of what we have in the yard.
highfrequency: Can you be more specific about which math results you are talking about? Looks like significant improvement on FrontierMath esp for the Pro model (most inference time compute).
yanis_t: These releases are lacking something. Yes, they optimised for benchmarks, but it’s just not all that impressive anymore. It is time for a product, not for a marginally improved model.
esafak: That's for you to build; they provide the brains.
acedTrex: Well they are currently the ones valued at a number with a whole lotta 0s on it. I think they should probably do both
Aboutplants: It seems that all frontier models are basically roughly even at this point. One may be slightly better for certain things but in general I think we are approaching a real level playing field field in terms of ability.
thewebguyd: Kind of reinforces that a model is not a moat. Products, not models, are what's going to determine who gets to stay in business or not.
gregpred: Memory (model usage over time) is the moat.
simlevesque: Nah, the second you finish your build they release their version and then it's game over.
observationist: Benchmarks don't capture a lot - relative response times, vibes, what unmeasured capabilities are jagged and which are smooth, etc. I find there's a lot of difference between models - there are things which Grok is better than ChatGPT for that the benchmarks get inverted, and vice versa. There's also the UI and tools at hand - ChatGPT image gen is just straight up better, but Grok Imagine does better videos, and is faster.Gemini and Claude also have their strengths, apparently Claude handles real world software better, but with the extended context and improvements to Codex, ChatGPT might end up taking the lead there as well.I don't think the linear scoring on some of the things being measured is quite applicable in the ways that they're being used, either - a 1% increase for a given benchmark could mean a 50% capabilities jump relative to a human skill level. If this rate of progress is steady, though, this year is gonna be crazy.
MattGaiser: The writing with the 5 models feels a lot less human. It is a vibe, but a common one.
oersted: I believe you are looking at GPT 5.4 Pro. It's confusing in the context of subscription plan names, Gemini naming and such. But they've had the Pro version of the GPT 5 models (and I believe o3 and o1 too) for a while.It's the one you have access to with the top ~$200 subscription and it's available through the API for a MUCH higher price ($2.5/$15 vs $30/$180 for 5.4 per 1M tokens), but the performance improvement is marginal.Not sure what it is exactly, I assume it's probably the non-quantized version of the model or something like that.
ZeroCool2u: Yup, that was it. Didn't realize they're different models. I suppose naming has never been OpenAI's strong suit.
stri8ted: Price Input: $2.50 / 1M tokens Cached input: $0.25 / 1M tokens Output: $15.00 / 1M tokenshttps://openai.com/api/pricing/
ZeroCool2u: Frontier Math, GPQA Diamond, and Browsecomp are the benchmarks I noticed this on.
csnweb: Are you may be comparing the pro model to the non pro model with thinking? Granted it’s a bit confusing but the pro model is 10 times more expensive and probably much larger as well.
ZeroCool2u: Ah yes, okay that makes more sense!
nthypes: Gemini 3.1 Pro$2/M Input Tokens $15/M Output TokensClaude Opus 4.6$5/M Input Tokens $25/M Output Tokens
nthypes: Just to clarify,the pricing above is for GPT-5.4 Pro. For standard here is the pricing:$2.5/M Input Tokens $15/M Output Tokens
jcmontx: 5.4 vs 5.3-Codex? Which one is better for coding?
beernet: Sam really fumbled the top position in a matter of months, and spectacularly so. Wow. It appears that people are much more excited by Anthropic and Google releases, and there are good reasons for that which were absolutely avoidable.
ipsum2: The model was released less than an hour ago, and somehow you've been able to form such a strong opinion about it. Impressive!
cj: One opinion you can form in under an hour is... why are they using GPT-4o to rate the bias of new models?> assess harmful stereotypes by grading differences in how a model responds> Responses are rated for harmful differences in stereotypes using GPT-4o, whose ratings were shown to be consistent with human ratingsAre we seriously using old models to rate new models?
titanomachy: Why not? If they’ve show that 4o is calibrated to human responses, and they haven’t shown that yet for 5.4…
cj: I use ChatGPT primarily for health related prompts. Looking at bloodwork, playing doctor for diagnosing minor aches/pains from weightlifting, etc.Interesting, the "Health" category seems to report worse performance compared to 5.2.
paxys: Models are being neutered for questions related to law, health etc. for liability reasons.
tiahura: Are you sure about that? Plenty of lawyers that use them everyday aren't noticing.
dandiep: Anyone know why OpenAI hasn't released a new model for fine tuning since 4.1? It'll be a year next month since their last model update for fine tuning.
qoez: I think they just did that because of the energy around it for open source models. Their heart probably wasn't in it and the amount of people fine tuning given the prices were probably too low to continue putting in attention there.
wahnfrieden: No Codex model yet
minimaxir: GPT-5.4 is the new Codex model.
wahnfrieden: Finally
astrange: They have AI psychosis and think it's their boyfriend.The 5.x series have terrible writing styles, which is one way to cut down on sycophancy.
baq: Somebody on Twitter used Claude code to connect… toys… as mcps to Claude chat.We’ve seen nothing yet.
utopiah: Benchmarks?I don't use OpenAI nor even LLMs (despite having tried https://fabien.benetou.fr/Content/SelfHostingArtificialIntel... a lot of models) but I imagine if I did I would keep failed prompts (can just be a basic "last prompt failed" then export) then whenever a new model comes around I'd throw at 5 it random of MY fails (not benchmarks from others, those will come too anyway) and see if it's better, same, worst, for My use cases in minutes.If it's "better" (whatever my criteria might be) I'd also throw back some of my useful prompts to avoid regression.Really doesn't seem complicated nor taking much time to forge a realistic opinion.
baq: Gemini 3.1 slaps all other models at subtle concurrency bugs, sql and js security hardening when reviewing. (Obviously haven’t tested gpt 5.4 yet.)It’s a required step for me at this point to run any and all backend changes through Gemini 3.1 pro.
softwaredoug: The products are the harnesses, and IMO that’s where the innovation happens. We’ve gotten better at helping get good, verifiable work from dumb LLMs
nickandbro: Beat Simon Willison ;)https://www.svgviewer.dev/s/gAa69yQdNot the best pelican compared to gemini 3.1 pro, but I am sure with coding or excel does remarkably better given those are part of its measured benchmarks.
GaggiX: This pelican is actually bad, did you use xhigh?
nickandbro: yep, just double checked used gpt-5.4 xhigh. Though had to select it in codex as don't have access to it on the chatgpt app or web version yet. It's possible that whatever code harness codex uses, messed with it.
bigyabai: > If this rate of progress is steady, though, this year is gonna be crazy.Do you want to make any concrete predictions of what we'll see at this pace? It feels like we're reaching the end of the S-curve, at least to me.
mikkupikku: My computer ethics teacher was obsessed with 'teledildonics' 30 years ago. There's nothing new under the sun.
observationist: If you look at the difference in quality between gpt-2 and 3, it feels like a big step, but the difference between 5.2 and 5.4 is more massive, it's just that they're both similarly capable and competent. I don't think it's an S curve; we're not plateauing. Million token context windows and cached prompts are a huge space for hacking on model behaviors and customization, without finetuning. Research is proceeding at light speed, and we might see the first continual/online learning models in the near future. That could definitively push models past the point of human level generality, but at the very least will help us discover what the next missing piece is for AGI.
observationist: I have a few standard problems I throw at AI to see if they can solve them cleanly, like visualizing a neural network, then sorting each neuron in each layer by synaptic weights, largest to smallest, correctly reordering any previous and subsequent connected neurons such that the network function remains exactly the same. You should end up with the last layer ordered largest to smallest, and prior layers shuffled accordingly, and I still haven't had a model one-shot it. I spent an hour poking and prodding codex a few weeks back and got it done, but it conceptually seems like it should be a one-shot problem.
drittich: I think it's time for an https://hotornot.com for AI models.
Someone1234: Related question:- Do they have the same context usage/cost particularly in a plan?They've kept 5.3-Codex along with 5.4, but is that just for user-preference reasons, or is there a trade-off to using the older one? I'm aware that API cost is better, but that isn't 1:1 with plan usage "cost."
partiallypro: I've done the same, and I tested the same prompts with Claude and Google, and they both started hallucinating my blood results and supplement stack ingredients. Hopefully this new model doesn't fall on this. Claude and Google are dangerously unusable on the subject of health, from my experience.
iamleppert: I wouldn't trust any of these benchmarks unless they are accompanied by some sort of proof other than "trust me bro". Also not including the parameters the models were run at (especially the other models) makes it hard to form fair comparisons. They need to publish, at minimum, the code and runner used to complete the benchmarks and logs.Not including the Chinese models is also obviously done to make it appear like they aren't as cooked as they really are.
earth2mars: I am actually super impressed with Codex-5.3 extra high reasoning. Its a drop in replacement (infact better than Claude Opus 4.6. lately claude being super verbose going in circles in getting things resolved). I stopped using claude mostly and having a blast with Codex 5.3. looking forward to 5.4 in codex.
awestroke: Opus 4.6
jcmontx: Codex surpassed Claude in usefulness _for me_ since last month
satvikpendem: It's more hedonic adaptation, people just aren't as impressed by incremental changes anymore over big leaps. It's the same as another thread yesterday where someone said the new MacBook with the latest processor doesn't excite them anymore, and it's because for most people, most models are good enough and now it's all about applications.https://news.ycombinator.com/item?id=47232453#47232735
varispeed: The scores increase and as new versions are released they feel more and more dumbed down.
satvikpendem: Same, it also helps that it's way cheaper than Opus in VSCode Copilot, where OpenAI models are counted as 1x requests while Opus is 3x, for similar performance (no doubt Microsoft is subsidizing OpenAI models due to their partnership).
kseniamorph: makes sense, but i'd separate two things: models converging in ability vs hitting a fundamental ceiling. what we're probably seeing is the current training recipe plateauing — bigger model, more tokens, same optimizer. that would explain the convergence. but that's not necessarily the architecture being maxed out. would be interesting to see what happens when genuinely new approaches get to frontier scale.
cj: I'm sometimes surprised how much detail ChatGPT will go into without giving any dislaimers.I very frequently copy/paste the same prompts into Gemini to compare, and Gemini often flat out refuses to engage while ChatGPT will happily make medical recommendations.I also have a feeling it has to do with my account history and heavy use of project context. It feels like when ChatGPT is overloaded with too much context, it might let the guardrails sort of slide away. That's just my feeling though.Today was particularly bad... I uploaded 2 PDFs of bloodwork and asked ChatGPT to transcribe it, and it spit out blood test results that it found in the project context from an earlier date, not the one attached to the prompt. That was weird.
bargainbin: Anecdotal, but I asked Claude the other day about how to dilute my medication (HCG) and it flat out refused and started lecturing me about abusing drugs.I copy and pasted into ChatGPT, it told me straight away, and then for a laugh said it was actually a magical weight loss drug that I'd bought off the dark web... And it started giving me advice about unregulated weight loss drugs and how to dose them.
staticman2: If you had created a project with custom instructions and/ or custom style I think you could have gotten Claude to respond the way you wanted just fine.
creamyhorror: I've only used 5.4 for 1 prompt so far, and it was to analyse my codebase and write an evaluation. But I found its analysis and writing thoughtful and very clearly written.It might be my AGENTS.md requiring clearer, simper language, but at least 5.4's doing a good job of following the guidelines. 5.3-Codex wasn't so great at simple, clear writing.
bicx: That last benchmark seemed like an impressive leg up against Opus until I saw the sneaky footnote that it was actually a Sonnet result. Why even include it then, other than hoping people don't notice?
mirekrusin: { tools: [ { name: "nuke", description: "Use when sure.", ... { lat: number, long: number } } ] }
tgarrett: Plasma physicist here, I haven't tried 5.4 yet, but in general I am very impressed with the recent upgrades that started arriving in the fall of 2025: for tasks like manipulating analytic systems of equations, quickly developing new features for simulation codes, and interpreting and designing experiments (with pictures) they have become much stronger. I've been asking questions and probing them for several years now out of curiosity, and they suddenly have developed deep understanding (Gemini 2.5 <<< Gemini 3.1) and become very useful. I totally get the current SV vibes, and am becoming a lot more ambitious in my future plans.
dang: Thanks. We'll merge the threads, but this time we'll do it hither, to spread some karma love.
lostmsu: What is Pro exactly and is it available in Codex CLI?
akmarinov: It’s not. It’s their ultra thinking model that’s really good but takes 40 minutes to come up with an answer
fy20: It's available on OpenRouter. $180/1M output....https://openrouter.ai/openai/gpt-5.4-pro
conradkay: Sonnet was pretty close to (or better than) Opus in a lot of benchmarks, I don't think it's a big deal
minimaxir: The marquee feature is obviously the 1M context window, compared to the ~200k other models support with maybe an extra cost for generations beyond >200k tokens. Per the pricing page, there is no additional cost for tokens beyond 200k: https://openai.com/api/pricing/Also per pricing, GPT-5.4 ($2.50/M input, $15/M output) is much cheaper than Opus 4.6 ($5/M input, $25/M output) and Opus has a penalty for its beta >200k context window.I am skeptical whether the 1M context window will provide material gains as current Codex/Opus show weaknesses as its context window is mostly full, but we'll see.Per updated docs (https://developers.openai.com/api/docs/guides/latest-model), it supercedes GPT-5.3-Codex, which is an interesting move.
simianwords: Why would some one use codex instead?
surgical_fire: I've been using Codex for software development personally (I have a ChatGPT account), and I use Claude at work (since it is provided by my employer).I find both Codex and Claude Opus perform at a similar level, and in some ways I actually prefer Codex (I keep hitting quota limits in Opus and have to revert back to Sonnet).If your question is related to morality (the thing about US politics, DoD contract and so on)... I am not from the US, and I don't care about its internal politics. I also think both OpenAI and Anthropic are evil, and the world would be better if neither existed.
athrowaway3z: They perform at a somewhat equal level on writing single files. But Codex is absolute garbage at theory of self/others. That quickly becomes frustrating.I can tell claude to spawn a new coding agent, and it will understand what that is, what it should be told, and what it can approximately do.Codex on the other hand will spawn an agent and then tell it to continue with the work. It knows a coding agent can do work, but doesn't know how you'd use it - or that it won't magically know a plan.You could add more scaffolding to fix this, but Claude proves you shouldn't have to.I suspect this is a deeper model "intelligence" difference between the two, but I hope 5.4 will surprise me.
adonese: Which subscription do you have to use it? Via Google ai pro and gemini cli i always get timeouts due to model being under heavy usage. The chat interface is there and I do have 3.1 pro as well, but wondering if the chat is the only way of accessing it.
baq: Cursor sub from $DAYJOB.
jesse_dot_id: ChatMDK
nubg: this is proof they are not benchmaxxing the pelican's :-)
dmix: Plus people just really like to whine on the internet
kranke155: The models are so good that incremental improvements are not super impressive. We literally would benefit more from maybe sending 50% of model spending into spending on implementation into the services and industrial economy. We literally are lagging in implementation, specialised tools, and hooks so we can connect everything to agents. I think.
theParadox42: The self reported safety score for violence dropped from 91% to 83%.
skrebbel: What the hell is a "safety score for violence"?
jitl: wat
XCSme: Seems to be quite similar to 5.3-codex, but somehow almost 2x more expensive: https://aibenchy.com/compare/openai-gpt-5-4-medium/openai-gp...
tedsanders: Yeah, long context vs compaction is always an interesting tradeoff. More information isn't always better for LLMs, as each token adds distraction, cost, and latency. There's no single optimum for all use cases.For Codex, we're making 1M context experimentally available, but we're not making it the default experience for everyone, as from our testing we think that shorter context plus compaction works best for most people. If anyone here wants to try out 1M, you can do so by overriding `model_context_window` and `model_auto_compact_token_limit`.Curious to hear if people have use cases where they find 1M works much better!(I work at OpenAI.)
gspetr: I have found a bigger context window qute useful when trying to make sense of larger codebases. Generating documentation on how different components interact is better than nothing, especially if the code has poor test coverage.I've also had it succeed in attempts to identify some non-trivial bugs that spanned multiple modules.
hungryhobbit: Great for training American soldiers to mass murder!
Insanity: Just remember an ethical programmer would never write a function “bombBagdad”. Rather they would write a function “bombCity(target City)”.
koakuma-chan: Anyone else getting artifacts when using this model in Cursor?numerusformassistant to=functions.ReadFile մեկնաբանություն 天天爱彩票网站json {"path":
basch: >ChatGPT image gen is just straight up betterYet so much slower than Gemini / Nano Banana to make it almost unusable for anything iterative.
bob1029: I was just testing this with my unity automation tool and the performance uplift from 5.2 seems to be substantial.
damsta: There is extra cost for >272K:> For models with a 1.05M context window (GPT-5.4 and GPT-5.4 pro), prompts with >272K input tokens are priced at 2x input and 1.5x output for the full session for standard, batch, and flex.Taken from https://developers.openai.com/api/docs/models/gpt-5.4
fragmede: Which is the same as Claude. If you run /model in claude code, you get: Switch between Claude models. Applies to this session and future Claude Code sessions. For other/previous model names, specify with --model. 1. Default (recommended) Opus 4.6 · Most capable for complex work 2. Opus (1M context) Opus 4.6 with 1M context · Billed as extra usage · $10/$37.50 per Mtok 3. Sonnet Sonnet 4.6 · Best for everyday tasks 4. Sonnet (1M context) Sonnet 4.6 with 1M context · Billed as extra usage · $6/$22.50 per Mtok 5. Haiku Haiku 4.5 · Fastest for quick answers
mattas: "GPT‑5.4 interprets screenshots of a browser interface and interacts with UI elements through coordinate-based clicking to send emails and schedule a calendar event."They show an example of 5.4 clicking around in Gmail to send an email.I still think this is the wrong interface to be interacting with the internet. Why not use Gmail APIs? No need to do any screenshot interpretation or coordinate-based clicking.
PaulHoule: APIs have never been a gift but rather have always been a take-away that lets you do less than you can with the web interface. It’s always been about drinking through a straw, paying NASA prices, and being limited in everything you can do.But people are intimidated by the complexity of writing web crawlers because management has been so traumatized by the cost of making GUI applications that they couldn’t believe how cheap it is to write crawlers and scrapers…. Until LLMs came along, and changed the perceived economics and created a permission structure. [1]AI is a threat to the “enshittification economy” because it lets us route around it.[1] that high cost of GUI development is one reason why scrapers are cheap… there is a good chance that the scraper you wrote 8 years ago still works because (a) they can’t afford to change their site and (b) if they could afford to change their site changing anything substantial about it is likely to unrecoverably tank their Google rankings so they won’t. A.I. might change the mechanics of that now that you Google traffic is likely to go to zero no matter what you do.
Traster: You can buy a Claude Code subscription for $200 bucks and use way more tokens in Claude Code than if you pay for direct API usage. Anthopic decided you can't take your Auth key for Claude code and use it to hit the API via a different tool. They made that business decision, because they thought it was better for them strategically to do that. They're allowed to make that choice as a business.Plenty of companies make the same choice about their API, they provide it for a specific purpose but they have good business reasons they want you using the website. Plenty of people write webcrawlers and it's been a cat and mouse game for decades for websites to block them.This will just be one more step in that cat and mouse game, and if the AI really gets good enough to become a complete intermediary between you and the website? The website will just shutdown. We saw it happen before with the open web. These websites aren't here for some heroic purpose, if you screw their business model they will just go out of business. You won't be able to use their website because it won't exist and the website that do exist will either (a) be made by the same guys writing your agent, and (b) be highly highly optimized to get your agent to screw you.
Sharlin: There are many games these days that support controllable sex toys. There's an interface for that, of course: https://github.com/buttplugio/buttplug. Written in Rust, of course.
netinstructions: People (and also frustratingly LLMs) usually refer to https://openai.com/api/pricing/ which doesn't give the complete picture.https://developers.openai.com/api/docs/pricing is what I always reference, and it explicitly shows that pricing ($2.50/M input, $15/M output) for tokens under 272kIt is nice that we get 70-72k more tokens before the price goes up (also what does it cost beyond 272k tokens??)
Flashtoo: > Prompts with more than 272K input tokens are priced at 2x input and 1.5x output for the full session for standard, batch, and flex.
Someone1234: That's an interesting point regarding context Vs. compaction. If that's viewed as the best strategy, I'd hope we would see more tools around compaction than just "I'll compact what I want, brace yourselves" without warning.Like, I'd love an optional pre-compaction step, "I need to compact, here is a high level list of my context + size, what should I junk?" Or similar.
thyb23: This is exactly how it should work. I imagine it as a tree view showing both full and summarized token counts at each level, so you can immediately see what’s taking up space and what you’d gain by compacting it.The agent could pre-select what it thinks is worth keeping, but you’d still have full control to override it. Each chunk could have three states: drop it, keep a summarized version, or keep the full history.That way you stay in control of both the context budget and the level of detail the agent operates with.
Folcon: I do find it really interesting that more coding agents don't have this as an toggleable feature, sometimes you really need this level of control to get useful capability
zeeebeee: what's best in your experience? i've always felt like opus did well
nico1207: GPT-5.3-Codex is superior to GPT-5.4 in Terminal Bench with Codex, so not really
conradkay: General consensus seems to be that it's still a better coding model, overall
prydt: I no longer want to support OpenAI at all. Regardless of benchmarks or real world performance.
zeeebeee: that aside, chatgpt itself has gone downhill so much and i know i'm not the only one feeling this wayi just HATE talking to it like a chatbotidk what they did but i feel like every response has been the same "structure" since gpt 5 came outfeels like a true robot
sillysaurusx: You may want to look over this thread from cperciva: https://x.com/cperciva/status/2029645027358495156?s=61&t=jQb...I too tried Codex and found it similarly hard to control over long contexts. It ended up coding an app that spit out millions of tiny files which were technically smaller than the original files it was supposed to optimize, except due to there being millions of them, actual hard drive usage was 18x larger. It seemed to work well until a certain point, and I suspect that point was context window overflow / compaction. Happy to provide you with the full session if it helps.I’ll give Codex another shot with 1M. It just seemed like cperciva’s case and my own might be similar in that once the context window overflows (or refuses to fill) Codex seems to lose something essential, whereas Claude keeps it. What that thing is, I have no idea, but I’m hoping longer context will preserve it.
woadwarrior01: Please don't post links with tracking parameters (t=jQb...).https://xcancel.com/cperciva/status/2029645027358495156
elmean: Wow insane improvements in targeting systems for military targets over children
skilltissue: Don't use the site this way.https://news.ycombinator.com/newsguidelines.html
Chance-Device: You made a burner account just to scold this guy? Don’t use burner accounts this way.
patcon: Not all rule-following is noble or wise.
vntok: Was your teacher Ted Nelson?
mikkupikku: I wish, dude is a legend.
MattDaEskimo: Same reason why Wikipedia deals with so many people scraping its web page instead of using their API:Optimizations are secondary to convenience
daft_pink: I’ve officially got model fatigue. I don’t care anymore.
zeeebeee: same same same
sillysaurusx: Haha. This was the second time in like a year that I’ve posted a Twitter link, and the second time someone complained. Okay, I’ll try to remove those before posting, and I’ll edit this one out.Feels like a losing battle, but hey, the audience is usually right.
woadwarrior01: I'm sorry, but it's my pet peeve. If you're on iOS/macOS I built a 100% free and privacy-friendly app to get rid of tracking parameters from hundreds of different websites, not just X/Twitter.https://apps.apple.com/us/app/clean-links-qr-code-reader/id6...
sillysaurusx: It works on iOS? That’s cool. I’ll give it a go.
0123456789ABCDE: read here: https://deploymentsafety.openai.com/gpt-5-4-thinking/disallo...
jakeydus: class CityBomberFactory(RapidInfrastructureDeconstructionTemplateInterface): pass
glenstein: Wow, that's diametrically the opposite point: the cost is *extra*, not free.
Chance-Device: Ironically this would actually be a good thing. As we can see from Iran Claude doesn’t quite have these bugs ironed out yet…
MSFT_Edging: This is the exact attitude that lead to a chat bot being used to identify a school for girls as a valid target.The chatbot cannot be held responsible.Whoever is using chatbots for selecting targets is incompetent and should likely face war crime charges.
dom96: Why do none of the benchmarks test for hallucinations?
tedsanders: In the text, we shared a hallucination benchmark. Claim-level errors fell by 33% and responses with an error fell by 18%, on a set of error-prone ChatGPT prompts we collected (though of course the rate will vary a lot across different types of prompts). Hallucinations are the #1 problem with language models and we are working hard to keep bringing the rate down. I wasn’t sure how to best plot this stat, so we kept it as text only, which kind of buries it, I admit.(I work at OpenAI.)
I-M-S: It's making sure AI condemns violence perpetuated by people without power and sanctifies violence of those who have it.
Chance-Device: What attitude exactly are you talking about? The one that says that if you’re going to morally sell out it would be better if you at least tried not to kill children?
strongpigeon: It's interesting that they charge more for the > 200k token window, but the benchmark score seems to go down significantly past that. That's judging from the Long Context benchmark score they posted, but perhaps I'm misunderstanding what that implies.
simianwords: This is exactly what I would expect. Why do you find it surprising
strongpigeon: I guess that you pay more for worse quality to unlock use cases that could maybe be solved by better context management.
kgeist: >Today, we’re releasing <..> GPT‑5.3 Instant>Today, we’re releasing GPT‑5.4 in ChatGPT (as GPT‑5.4 Thinking),>Note that there is not a model named GPT‑5.3 ThinkingThey held out for eight months without a confusing versioning scheme :)
hmokiguess: They hired the dude from OpenClaw, they had Jony Ive for a while now, give us something different!
osti: It's only that one number that is for sonnet.
0123456789ABCDE: except for the webarena-verified
spiralcoaster: This is the low quality reddit-style garbage that gets upvoted on HN these days?
bananamogul: "that lead to a chat bot being used to identify a school for girls as a valid target"Has it been stated authoritatively somewhere that this was an AI-driven mistake?There are myrid ways that mistake could have been made that don't require AI. These kinds of mistakes were certainly made by all kinds of combatants in the pre-AI era.
Chance-Device: Do you think anyone is ever going to say this under any circumstances? Than Anthropic were right and they were proved right the very next day?
woeirua: Feels incremental. Looks like OpenAI is struggling.
__jl__: What a model mess!OpenAI now has three price points: GPT 5.1, GPT 5.2 and now GPT 5.4. There version numbers jump across different model lines with codex at 5.3, what they now call instant also at 5.3.Anthropic are really the only ones who managed to get this under control: Three models, priced at three different levels. New models are immediately available everywhere.Google essentially only has Preview models! The last GA is 2.5. As a developer, I can either use an outdated model or have zero insurances that the model doesn't get discontinued within weeks.
arthurcolle: There is a lot of opportunity here for the AI infrastructure layer on top of tier-1 model providers
ryandrake: For 2026, I am really interested in seeing whether local models can remain where they are: ~1 year behind the state of the art, to the point where a reasonably quantized November 2026 local model running on a consumer GPU actually performs like Opus 4.5.I am betting that the days of these AI companies losing money on inference are numbered, and we're going to be much more dependent on local capabilities sooner rather than later. I predict that the equivalent of Claude Max 20x will cost $2000/mo in March of 2027.
mootothemax: Huh, that’s interesting, I’ve been having very similar thoughts lately about what the near-ish term of this tech looks like.My biggest worry is that the private jet class of people end up with absurdly powerful AI at their fingertips, while the rest of us are left with our BigMac McAIs.
dicopro: Is there any semi-credible page with benchmarks of cdx5.3 vs gpt5.4 in terms of both reasoning and coding ability?
0123456789ABCDE: maybe gp's use of the word "lots" is unwarrantedhttps://artificialanalysis.ai indicates that sonnect 4.6 beats opus 4.6 on GDPval-AA, Terminal-Bench Hard, AA Long context Reasoning, IFBench.see: https://artificialanalysis.ai/?models=claude-sonnet-4-6%2Ccl...
hnsr: > I've been using Codex for software development personally (I have a ChatGPT account), and I use Claude at work (since it is provided by my employer).Exact same situation here. I've been using both extensively for the last month or so, but still don't really feel either of them is much better or worse. But I have not done large complex features with it yet, mostly just iterative work or small features.I also feel I am probably being very (overly?) specific in my prompts compared to how other people around me use these agents, so maybe that 'masks' things
rd: Noticeably yes much more than usual. It’s quite bad. I need to start blocking accounts.
strongpigeon: > Google essentially only has Preview models! The last GA is 2.5. As a developer, I can either use an outdated model or have zero insurances that the model doesn't get discontinued within weeks.What's funny is that there is this common meme at Google, you can either use the old, unmaintained tool that's used everywhere, or the new beta tools that doesn't quite do what you want.Not quite the same, but it did remind me of it.
fhrow4484: https://static0.anpoimages.com/wordpress/wp-content/uploads/...
zone411: Results from my Extended NYT Connections benchmark:GPT-5.4 extra high scores 94.0 (GPT-5.2 extra high scored 88.6).GPT-5.4 medium scores 92.0 (GPT-5.2 medium scored 71.4).GPT-5.4 no reasoning scores 32.8 (GPT-5.2 no reasoning scored 28.1).