ARC-AGI-3
arcprize.org481 points by lairv a day ago
481 points by lairv a day ago
https://arcprize.org/media/ARC_AGI_3_Technical_Report.pdf
https://x.com/scaling01 has called out a lot of issues with ARC-AGI-3, some of them (directly copied from tweets, with minimal editing): - Human baseline is "defined as the second-best first-run human by action count". Your "regular people" are people who signed up for puzzle solving and you don't compare the score against a human average but against the second best human solution - The scoring doesn't tell you how many levels the models completed, but how efficiently they completed them compared to humans. It uses squared efficiency, meaning if a human took 10 steps to solve it and the model 100 steps then the model gets a score of 1%
((10/100)^2) - 100% just means that all levels are solvable. The 1% number uses uses completely different and extremely skewed scoring based on the 2nd best human score on each level individually. They said that the typical level is solvable by 6 out of 10 people who took the test, so let's just assume that the median human solves about 60% of puzzles (ik not quite right). If the median human takes 1.5x more steps than your 2nd fastest solver, then the median score is 0.6 * (1/1.5)^2 = 26.7%. Now take the bottom 10% guy, who maybe solves 30% of levels, but they take 3x more steps to solve it. this guy would get a score of 3% - The scoring is designed so that even if AI performs on a human level it will score below 100% - No harness at all and very simplistic prompt - Models can't use more than 5X the steps that a human used - Notice how they also gave higher weight to later levels? The benchmark was designed to detect the continual learning breakthrough. When it happens in a year or so they will say "LOOK OUR BENCHMARK SHOWED THAT. WE WERE THE ONLY ONES" Those are supposed to be issues? After reading your list my impression of ARC-AGI has gone up rather than down. All of those things seem like the right way to go about this. No, those aren't issues. But it's good to know the meaning of those numbers we get. For example, 25% is about the average human level (on this category of problems). 100% is either top human level or superhuman level or the information-theoretically optimal level. Sure, but, aim for the stars and you hit the moon right?
Like fundamentally who cares? For the purpose of an AGI benchmark I'd argue you'd rather err on the side of being more intelligent and counting that as less intelligent than vice versa. Yeah I'm quite surprised as to how all of those are supposed to be considered problems. They all make sense to me if we're trying to judge whether these tools are AGI, no? I think that any logic-based test that your average human can "fail" (aka, score below 50%) is not exactly testing for whether something is AGI or not. Though I suppose it depends on your definition of AGI (and whether all humans, or at least your average human, is considered AGI under that definition). If I had a puzzle I really needed solved, then I would not ask a rando on the street, I would ask someone I know is really good at puzzles. My point is: For AGI to be useful, it really should be able to perform at the top 10% or better level for as many professions as possible (ideally all of them). An AI that can only perform at the average human level is useless unless it can be trained for the job like humans can. > An AI that can only perform at the average human level is useless unless it can be trained for the job like humans can. Yes, if you want skilled labour. But that's not at all what ARC-AGI attempts to test for: it's testing for general intelligence as possessed by anyone without a mental incapacity. It seems they don't test for that, since they use the second-best human solution as a baseline. And that's the right way to go. When computers were about to become superhuman at chess, few people cared that it could beat random people for many years prior to that. They cared when Kasparov was dethroned. Remember, the point here is marketing as well as science. And the results speak for themselves. After all, you remember Deep Blue, and not the many runners-up that tried. The only reason you remember is because it beat Kasparov. > The only reason you remember is because it beat Kasparov There is an additional fascinating aspect to these matches, in that Kasparov obviously knew he was facing a computer, and decided to play a number of sub-optimal openings because he hoped they might confound the computer's opening book. It's not at all clear Deep Blue would have eked out the rematch victory had Kasparov respected it as an opponent, in the way he did various human grandmasters at the time. This is supposed to test for AGI, not ASI. ARC-AGI (later labelled "1") was supposed to detect AGI with a test that is easy for humans, not top humans. > Yes, if you want skilled labour. But that's not at all what ARC-AGI attempts to test for: it's testing for general intelligence as possessed by anyone without a mental incapacity. Humans without a clinically recognized mental disability are generally capable of some kind of skilled labor. The "general" part of intelligence is independent of, but sufficient for, any such special application. This issue here is that people have different definitions of AGI. From the description. Getting 100% on this benchmark would be more than AGI and would qualify for ASI (Algorithmic Super Intelligence) not just AGI. If you only outdo humans 50% of the time you're never going to get consensus on if you've qualified. Whereas outdoing 90% of humans on 90% of all the most difficult tasks we could come up with is going to be difficult to argue against. This benchmark is only one such task. After this one there's still the rest of that 90% to go. Beating humans isn't anywhere near sufficient to qualify as ASI. That's an entirely different league with criteria that are even more vague. Even dumb humans are considered to have general intelligence. If the bar is having to outdo the median human, then 50% of humans don't have general intelligence. Not true. We don't have a good definition for intelligence - it's very much an I'll know it when I see it sort of thing. Frontier models are reliably providing high undergraduate to low graduate level customized explanations of highly technical topics at this point. Yet I regularly catch them making errors that a human never would and which betray a fatal lack of any sort of mental model. What are we supposed to make of that? It's an exceedingly weird situation we find ourselves in. These models can provide useful assistance to literal mathematicians yet simultaneously show clear evidence of lacking some sort of reasoning the details of which I find difficult to articulate. They also can't learn on the job whatsoever. Is that intelligence? Probably. But is it general? I don't think so, at least not in the sense that "AGI" implies to me. Once humanity runs out of examples that reliably trip them up I'll agree that they're "general" to the same extent that humans are regardless of if we've figured out the secrets behind things such as cohesive world models, self awareness, active learning during operation, and theory of mind. > Not true. It's certainly true. By definition. If the bar for general intelligence is being smarter than the median human, 50% of people won't reach the threshold for general intelligence. (And if the bar is beating the median in every cognitive test, then a much smaller fraction of people would qualify.) People don't have a consistent definition of AGI, and the definitions have changed over the past couple years, but I think most people have settled on it meaning at least as smart as humans in every cognitive area. But that has to be compared to dumb people, not median. We don't want to say that regular people don't have general intelligence. You are using terms like "smart" and "dumb" as if they have universally-accepted definitions. You can make up as many definitions of intelligence as you like (I would argue that is a sign of intelligence) but using those terms is certainly going to lead to circular reasoning. > Yet I regularly catch them making errors that a human never would I have yet to see a "error" that modern frontier models make that I could not imagine a human making - average humans are way more error prone than the kind of person who posts here thinks, because the social sorting effects of intelligence are so strong you almost never actually interact with people more than a half standard deviation away. (The one exception is errors in spatial reasoning with things humans are intimately familiar with - for example, clothing - because LLMs live in literary space, not physics space, and only know about these things secondhand) > and which betray a fatal lack of any sort of mental model. This has not been a remotely credible claim for at least the past six months, and it seemed obviously untrue for probably a year before then. They clearly do have a mental model of things, it's just not one that maps cleanly to the model of a human who lives in 3D space. In fact, their model of how humans interact is so good that you forget that you're talking to something that has to infer rather than intuit how the physical world works, and then attribute failures of that model to not having one. > you almost never actually interact with people more than a half standard deviation away I wasn't talking about the average person there but rather those who could also craft the high undergrad to low grad level explanations I referred to. > This has not been a remotely credible claim for at least the past six months Well it's happened to me within the past six months (actually within the past month) so I don't know what you want from me. I wasn't claiming that they never exhibit evidence of a mental model (can't prove a negative anyhow). There are cases where they have rendered a detailed explanation to me yet there were issues with it that you simply could not make if you had a working mental model of the subject that matched the level of the explanation provided (IMO obviously). Imagine a toddler spewing a quantum mechanics textbook at you but then uttering something completely absurd that reveals an inherent lack of understanding; not a minor slip up but a fundamental lack of comprehension. Like I said it's really weird and I'm not sure what to make of it nor how to properly articulate the details. I'm aware it's not a rigorous claim. I have no idea how you'd go about characterizing the phenomenon.
Tiberium - a day ago
fc417fc802 - a day ago
red75prime - 14 hours ago
rolandhvar - 4 hours ago
girvo - 21 hours ago
andy12_ - 21 hours ago
chillfox - 19 hours ago
versteegen - 19 hours ago
sillysaurusx - 18 hours ago
swiftcoder - 12 hours ago
cubefox - 14 hours ago
computably - 12 hours ago
benjaminl - 20 hours ago
fc417fc802 - 19 hours ago
nearbuy - 13 hours ago
fc417fc802 - 12 hours ago
nearbuy - 3 hours ago
tadfisher - 2 hours ago
LordDragonfang - 11 hours ago
fc417fc802 - 6 hours ago