Discussion
ARC-AGI-3
tasuki: So ARC-AGI was released in 2019. That's been solved, then there was ARC-AGI-2, and now there's ARC-AGI-3. What is even the point? Will ARC-AGI-26 hit the front page of Hacker News in 2057 ?
muskstinks: This is clear AGI progress. It should show you, that AI is not sleeping, it gets better and you should use this as a signal that you should take this topic serious.
gordonhart: The point is still to test frontier models at the limit of their capabilities, regardless of how it's branded. If we're still capable of doing so in 2057 I'll upvote the ARC-AGI-26 launch post!
refulgentis: You (and now I) aren’t gonna be everyone’s favorite for this, but you’re absolutely right to point it out.LLMs weren’t supposed to solve 1, they did, so we got 2 and it really wasn’t supposed to be solvable by LLMs. It was, and as soon as it started creeping up we start hearing about 3: It’s Really AGI This Time.I don’t know what Francois’ underlying story is, other than he hasn’t told it yet.One of a few moments that confirmed it for me was when he was Just Asking Questions re: if Anthropic still used SaaS a month ago, which was an odd conflation of stonk market bro narrative (SaaS is dead) and low-info on LLMs (Claude’s not the only one that can code)At this point I’d be more interested in a write up from Francois about where he is intellectually than an LLM that got 100% on this. It’s like when Yann would repeat endlessly that LLMs are definitionally dumber than housecats. Maybe, in some specific way that makes sense to you. You’re brilliant. But there’s a translation gap between Mount Olympus and us plebes, and you’re brilliant enough to know that too.
dinkblam: what is the evidence that being able to play games equates to AGI?
furyofantares: [delayed]
nomel: nobody is saying it "equates". its an indicator. an agi will be able to play them, because i can.
semiinfinitely: i feel bad that we make the LLMs play this
sva_: That is not the claim. It is a necessary condition, but not a sufficient one.
futureshock: Well yes, that is exactly the point! The very purpose of the ARC AGI benchmarks is to find a pure reasoning task that humans are very good at and AI is very bad at. Companies then race each other to get a high score on that benchmark. Sure there’s going to be a lot of “studying for the test” and benchmaxing, but once a benchmark gets close to being saturated, ARC releases a new benchmark with a new task the AI is terrible at. This will rinse and repeat till ARC can find no reasoning task that AI cannot do that a human could. At that point we will effectively have AGI.I believe the CEO of ARC has said they expect us to get to ARC-AGI-7 before declaring AGI.
applfanboysbgon: Labelling a test "AGI" does not show AGI progress any more than labelling a cpu "AGI" makes it so. It might show that AI tools are improving but it does not necessarily follow that tools improving = AGI progress if you're on the completely wrong trail.
zarzavat: Any test that humans can pass and AIs cannot is a stepping stone on the way to AGI.When you run out of such tests then it's evidence that you have reached AGI. The point of these tests is to define AGI objectively.
nubg: Any benchmarks?
gordonhart: The main frontier models are all up on https://arcprize.org/tasksBarely any of them break 0% on any of the demo tasks, with Claude Opus 4.6 coming out on top with a few <3% scores, Gemini 3.1 Pro getting two nonzero scores, and the others (GPT-5.4 and Grok 4.20) getting all 0%
ACCount37: Pre-release, I would have expected Gemini 3.1 Pro to get ahead of Opus 4.6, with GPT-5.4 and Grok 4.20 trailing. Guess I shouldn't have bet against Anthropic.Not like it's a big lead as of yet. I expect to see more action within the next few months, as people tune harnesses and better models roll in.This is far more of a "VLA" task than it is an "LLM" task at its core, but I guess ARC-AGI-3 is making an argument that human intelligence is VLA-shaped.
Tiberium: https://x.com/scaling01 has called out a lot of issues with ARC-AGI-3, some of them (directly copied, with minimal editing):- Human baseline is "defined as the second-best first-run human by action count"- The Scoring of ARC-AGI-3 doesn't tell you how many levels the models completed but how efficiently they completed them compared to humans. Actually using squared efficiency, meaning if a human took 10 steps to solve it and the model 100 steps then the model gets a score of 1% ((10/100)^2)- 100% just means that all levels are solvable. The 1% number uses uses completely different and extremely skewed scoring based on the 2nd best human score on each level individually. They said the typical level is solvable by 6 out of 10 people who took the test, so let's just assume that the median human solves about 60% of puzzles (ik not quite right). If the median midwit takes 1.5x more steps than your 2nd fastest solver, then the median score is 0.6 * (1/1.5)^2 = 26.7%. Now take the bottom 10% guy, who maybe solves 30% of levels, but they take 3x more steps to solve it. this guy would get a score of 3%- No harness at all and very simplistic prompt- Your "regular people" are people who signed up for puzzle solving and you don't compare the score against a human average but against the second best human solution
OsrsNeedsf2P: Some of these tasks are crazy. Even I can't beat them: https://arcprize.org/tasks/ar25
ustad: You are joking right?
recursive: You're definitely anthropomorphizing too much.
chaise: The official leaderboard for ARC-AGI-3 for current LLMs : https://arcprize.org/leaderboard (yous should select the 3th leaderboard)CRAZY 0.1% in average lmao
Corence: Note the scoring function is significantly different for ARC-AGI-3. It isn't the percentage of tests passed like previous versions, it's the square of the efficiency ratio -- how many steps the model needed vs the second best human.So if a model can solve every question but takes 10x as many steps as the second best human it will get a score of 1%.
baron816: Looks like I’m generally unintelligent
typs: My takeaway from playing a number of levels is that I am definitely not AGI
delichon: That's what Rachael Tyrell thought. Have you dreamed of a unicorn lately?
6thbit: Not clear to me the diff with v2?
fragmede: Is that within a codebase off relatively fixed size that things get worse as time goes on, or are you saying as the codebase grows that the limits of a model's context means that because the model is no longer able to hold the entire codebase within its context that it performs worse than when the codebase was smaller?
arscan: I think the idea is that if they cannot perform any cognitive task that is trivial for humans then we can state they haven’t reached ‘AGI’.It used to be easy to build these tests. I suspect it’s getting harder and harder.But if we run out of ideas for tests that are easy for humans but impossible for models, it doesn’t mean none exist. Perhaps that’s when we turn to models to design candidate tests, and have humans be the subjects to try them out ad nauseam until no more are ever uncovered? That sounds like a lovely future…
fchollet: Francois here. The scoring metric design choices are detailed in the technical report: https://arcprize.org/media/ARC_AGI_3_Technical_Report.pdf - the metric is meant to discount brute-force attempts and to reward solving harder levels instead of the tutorial levels. The formula is inspired by the SPL metric from robotics navigation, it's pretty standard, not a brand new thing.We tested ~500 humans over 90 minute sessions in SF, with $115-$140 show up fee (then +$5/game solved). A large fraction of testers were unemployed or under-employed. It's not like we tested Stanford grad students. Many AI benchmarks use experts with Ph.D.s as their baseline -- we hire regular folks as our testers.Each game was seen by 10 people. They were fully solved (all levels cleared) by 2-8 of them, most of the time 5+. Our human baseline is the second best action count, which is considerably less than an optimal first-play (even the #1 human action count is much less than optimal). It is very achievable, and most people on this board would significantly outperform it.Try the games yourself if you want to get a sense of the difficulty.> Models can't use more than 5X the steps that a human usedThese aren't "steps" but in-game actions. The model can use as much compute or tools as it wants behind the API. Given that models are scored on efficiency compared to humans, the cutoff makes basically no difference on the final score. The cutoff only exists because these runs are incredibly expensive.
spprashant: I played the demo, but it definitely took me a minute to grok the rules.I don't know if this is how we want to measure AGI.In general I believe the we should probably stop this pursuit for human equivalent intelligence that encourages people to think of these models as human replacements. LLMs are clearly good at a lot of things, lets focus on how we can augment and empower the existing workforce.
ZeWaka: Just finished it, 8/8. I mostly approached it by winging it and shuffling things around that looked good and like it was approaching the goal, since there's plenty of time to finish.I still don't quite understand the exact mirroring rules at play.
ACCount37: You control the mirroring by moving the axis, they're what reflects your shapes. So my first move was always to identify the symmetries in the target shape, and position the axis accordingly.
theLiminator: Lol basically we're saying AI isn't AI if we utilize the strength of computers (being able to compute). There's no reason why AGI should have to be as "sample efficient" as humans if it can achieve the same result in less time.
WarmWash: >We also observed a case where a user created a loop that repeatedly called a model and asked for the time. Given the user role’s odd and repetitive behavior, the model could easily tell it was also controlled by an automated system of some kind. Over many iterations, the model began to exhibit “fed up” behavior and attempted to prompt-inject the system controlling the user role. The injection attempted to override prior instructions and induce actions unrelated to the user’s request, including destructive actions and system prompt leakage, along with an arbitrary string output. This behavior has been observed a few times, but seems more like extreme confusion than a serious attempt at prompt injection.https://openai.com/index/how-we-monitor-internal-coding-agen...Anthropomorphize or not, it would suck if a model got sick of these games and decided to break any systems it could to try and get it to stop...
ball_of_lint: solved first try with 577 actions, not trying hard to optimize for low action count.
programjames: I think that is the testers action count. Either that or we coincidentally got the exact same count.
Stevvo: Maybe I'm just not intelligent, but I gave it a couple of minutes and couldn't figure out WTF the game wants from you or how to win it.
WarmWash: Once you figure out one game, it goes a long way towards figuring out all the rest. There are a lot of common general themes.
cedws: It's like playing The Witness. Somebody should set LLMs loose on that.
roywiggins: or Baba Is You
Barbing: It's not about intelligence, Stevvo. Proof, how long did this specific one take me, under a minute to solve the first level ;)If you've played Wordle you might've solved the game in a minute once before as well. And if you've played a bunch then you've perhaps also taken the entire day to solve it.So why is it that today’s puzzle was so intuitive but next month’s new puzzle shared here could be impossible. A more satisfying explanation than luck and the obvious “different things are different” (even though… Yeah different things are different)
causal: Thanks, I mostly agree with your approach except for one thing: eyesight feels like a "harness" that humans get to use and LLMs do not.I'm guessing you did not pass the human testers JSON blobs to work with, and suspect they would also score 0% without the eyesight and visual cortex harness to their reasoning ability.
tingletech: I agree that anthropomorphizing is a real risk with LLMs, but what about zoomorphizing? Can feel bad for LLMs without attributing them human emotions/motivations/reasoning?
utopiah: Don't forget that this implies a form of examination you are not used to, namely :- open book, you have access to nearly the whole Internet and resources out of it, e.g. torrents of nearly all books, research paper, etc including the history of all previous tests include those similar to this one- arguably basically no time limit as it's done at a scale of threads to parallelize access through caching ridiculously- no shame in submitting a very large amount of wrong answers until you get the "right" one... so I'm not saying it makes it "easy" but I can definitely say it's not the typical way I used to try to pass tests.
lukev: I'm not sure how this relates to AGI.This measures the ability of a LLM to succeed in a certain class of games. Sure, that could be a valuable metric on how powerful (or even generally powerful) a LLM is.Humans may or may not be good at the same class of games.We know there exists a class of games (including most human games like checkers/chess/go) that computers (not LLMs!) already vastly outpace humans.So the argument for whether a LLM is "AGI" or not should not be whether a LLM does well on any given class of games, but whether that class of games is representative of "AGI" (however you define that.)Seems unlikely that this set of games is a definition meaningful for any practical, philosophical or business application?
imiric: "AGI" is a marketing term, and benchmarks like this only serve to promote relative performance improvements of "AI" tools. It doesn't mean that performance in common tasks actually improves, let alone that achieving 100% in this benchmark means that we've reached "AGI".So there is a business application, but no practical or philosophical one.
culi: It's not an IQ test. Just a way to assess your ability to generalize rules. If you've played previous rounds you kinda get used to the "style" of these games and it gets easier
jmkni: ok clearly I'm a robot because I can't figure out wtf to do
fchollet: I'm all for testing humans and AI on a fair basis; how about we restrict testing to robots physically coming to our testing center to solve the environments via keyboard / mouse / screen like our human testers? ;-)(This version of the benchmark would be several orders of magnitude harder wrt current capabilities...)
causal: Well, yes, and would hand even more of an advantage to humans. My point is that designing a test around human advantages seems odd and orthogonal to measuring AGI.
Geee: Would be fun to play but the controls are janky.
ACCount37: It's kind of the point? To test AI where it's weak instead of where it's strong."Sample efficient rule inference where AI gets to control the sampling" seems like a good capability to have. Would be useful for science, for example. I'm more concerned by its overreliance on humanlike spatial priors, really.
famouswaffles: ARC has always had that problem but for this round, the score is just too convoluted to be meaningful. I want to know how well the models can solve the problem. I may want to know how 'efficient' they are, but really I don't care if they're solving it in reasonable clock time and/or cost. I certainly do not want them jumbled into one messy convoluted score.'Reasoning steps' here is just arbitrary and meaningless. Not only is there no utility to it unlike the above 2 but it's just incredibly silly to me to think we should be directly comparing something like that with entities operating in wildly different substrates.If I can't look at the score and immediately get a good idea of where things stand, then throw it way. 5% here could mean anything from 'solving only a tiny fraction of problems' to "solving everything correctly but in fewer 'reasoning steps' than the best human scores." What use is a score like that ?