Discussion
Computer Science > Machine Learning
charcircuit: Why didn't OpenAI finetune the model to use the python tool it has for these tasks?
parliament32: [delayed]
staticshock: LLMs seem to me closer to Kahneman's System 1 than to System 2. When understood in this way, it is obvious why LLMs are bad at counting r's in "strawberries". But it also makes ZEH feel like it couldn't possibly be a useful metric, because it's a System 2 evaluation applied to a System 1 system.
throwuxiytayq: > This is surprising given the excellent capabilities of GPT-5.2.Is this seriously surprising to anyone who knows the absolute minimum about how LLMs parse and understand text?
dontlikeyoueith: Nope.It's only surprising to people who still think they're going to build God out of LLMs.
BugsJustFindMe: Counting is something that even humans need to learn how to do. Toddlers also don't understand quantity. If a 2 year old is able to count to even 10 it's through memorization and not understanding.So, ok, don't trust anything that hasn't yet been trained to understand quantity for things that require understanding quantity. But there might be other things that don't require understanding of things that the system wasn't trained to understand that the system could still be trustable for.
irishcoffee: > Counting is something that even humans need to learn how to do. Toddlers also don't understand quantity. If they're able to count to even 10 it's through memorization and not understanding.I completely agree with you. LLMs are regurgitation machines with less intellect than a toddler, you nailed it.AI is here!
bigstrat2003: Let us be very clear: there is no such thing as a trustworthy LLM. Time and again they have shown that they understand nothing. They can be useful in the right context, but you can't trust them at all.
kenjackson: Whenveer I see these papers and try them, they always work. This paper is two months old, which in LLM years is like 10 years of progress.It would be interesting to actively track how far long each progressive model gets...
wg0: Actually almost all LLMs when they write numbered sections in a markdown have the counting wrong. They miss the numbers in between and such.So yes.And the valuations. Trillion dollar grifter industry.
hu3: > we found that GPT-5.2 cannot even compute the parity of a short string like 11000, and GPT-5.2 cannot determine whether the parentheses in ((((()))))) are balanced.I think there is a valid insight here which many already know: LLMs are much more reliable at creating scripts and automation to do certain tasks than doing these tasks themselves.For example if I provide an LLM my database schema and tell it to scan for redundant indexes and point out wrong naming conventions, it might do a passable but incomplete job.But if I tell the LLM to code a python or nodejs script to do the same, I get significantly better results. And it's often faster too to generate and run the script than to let LLMs process large SQL files.
pants2: Doesn't this just look like another case of "count the r's in strawberry" ie not understanding how tokenization works?This is well known and not that interesting to me - ask the model to use python to solve any of these questions and it will get it right every time.
wahnfrieden: It's not dismissible as a misunderstanding of tokens. LLMs also embed knowledge of spelling - that's how they fixed the strawberry issue. It's a valid criticism and evaluation.
cr125rider: Seems like it’s maybe also a tool steering problem. These models should be reaching for tools to help solve factual problems. LLM should stick to prose.
emp17344: I think this is still useful research that calls into question how “smart” these models are. If the model needs a separate tool to solve a problem, has the model really solved the problem, or just outsourced it to a harness that it’s been trained - via reinforcement learning - to call upon?
dwa3592: Nice! Although I tried the parenthesis balanced question with gemini and it gave the right answer in first attempt.
dwa3592: but it's a tricky question for LLMs; it shows that if it's not in the training set; LLMs could trip which kinda shows that the intelligence is not generalized yet.I tried this with gemini - (i am trying(something(re(a(l(ly)c)r)a)z)((y)he)re)and it tripped.
Lerc: The r's in strawberry presents a different level of task to what people imagine. It seems trivial to a naive observer because the answer is easily derivable from the question without extra knowledge.A more accurate analogy for humans would be to imagine if every word had a colour. You are told that there are also a sequence of different colours that correspond to the same colour as that word. You are even given a book showing every combination to memorise.You learn the colours well enough that you can read and write coherently using them.Then comes the question of how many chocolate-browns are in teal-with-a-hint-of-red. You know that teal-with-a-hint-of-red is a fruit and you know that the colour can also be constructed by crimson followed by Disney-blond. Now, do both of those contain chocolate-brown or just one of them, how many?It requires excersizing memory to do a task that is underrepresented in the training data because humans simply do not have to do the task at all when the answer can be derived from the question representation. Humans also don't have the ability that the LLMs need but the letter representation doesn't need that ability.
revachol: I just tried it in ChatGPT "Auto" and it didn't work> Yes — ((((()))))) is balanced.> It has 6 opening ( and 6 closing ), and they’re properly nested.Though it did work when using "Extensive Thinking". The model wrote a Python program to solve this.> Almost balanced — ((((()))))) has 5 opening parentheses and 6 closing parentheses, so it has one extra ).> A balanced version would be: ((((()))))Testing a couple of different models without a harness such that no tool calls are possible would be interesting
kenjackson: Weird. I tried in chatGPT auto and it worked perfectly. I tried like 10 variations. I also did the letters in words. Got all of them right.The one thing I did trip it up on was "Is there the sh sound in the word transportation". It said no. And then realized I asked for "sound" not letters. It then subsequently got the rest of the "sounds-like" tests I did.Clearly, my ChatGPT is just better than yours.
grey-area: To those saying this is not surprising, yes it will be surprising to the general public who are being served ads from huge companies like MS or OpenAI saying LLMs can help with their accounting, help them close deals by crunching the numbers in seconds, write complex code for them etc etc.This is important information for anyone to understand who thinks these systems are thinking, reasoning, and learning from them or that they’re having a conversation with them i.e. 90% of users of LLMs.
orbital-decay: Quick sanity check: you're susceptible to pretty terrible optical illusions which would never fool a model, does it mean you're not thinking? In fact, with a non-monospaced font I also have trouble determining whether these parens are balanced, and have to select them with the mouse, i.e. "dumb" tool, to make sure.Reminder that "thinking" is an ill-defined term like others, and the question whether they "think" is basically irrelevant. No intelligent system, human or machine, will ever have zero error rate, due to the very nature of intelligence (another vague term). You have to deal with that the same way you deal with it in humans - either treat them as bugs and build systems resilient to bugs, or accept the baseline error rate if it's low enough.
simianwords: There’s no way this is right. I checked complicated ones with the latest thinking model. Can someone come up with a counter example?
pton_xd: "in this paper we primarily evaluate the LLM itself without external tool calls."Maybe this is a factor?
simianwords: No tools were used.
revachol: heh, interesting that. I just tried it twice more with ChatGPT "Instant" (disabling "Auto-switch to Thinking") and it got it wrong both times. Does yours get it right without thinking or tool calls? If so, maybe it does like you better than me.
graemefawcett: It's not just an issue of tokenization, it's almost a category error. Lisp, accounting and the number of r's in strawberry are all operations that require state. Balancing ((your)((lisp)(parens))) requires a stack, count r's in strawberry requires a register, counting to 5 requires an accumulator to hold 4.An LLM is a router and completely stateless aside from the context you feed into it. Attention is just routing the probability distribution of the next token, and I'm not sure that's going to accumulate much in a single pass.