Study identifies weaknesses in how AI systems are evaluated

oii.ox.ac.uk

407 points by pseudolus 2 days ago


Paper: https://openreview.net/pdf?id=mdA5lVvNcU

Related: https://www.theregister.com/2025/11/07/measuring_ai_models_h...

bubblelicious - 2 days ago

I work on LLM benchmarks and human evals for a living in a research lab (as opposed to product). I can say: it’s pretty much the Wild West and a total disaster. No one really has a good solution, and researchers are also in a huge rush and don’t want to end up making their whole job benchmarking. Even if you could, and even if you have the right background you can do benchmarks full time and they still would be a mess.

Product testing (with traditional A/B tests) are kind of the best bet since you can measure what you care about _directly_ and at scale.

I would say there is of course “benchmarketing” but generally people do sincerely want to make good benchmarks it’s just hard or impossible. For many of these problems we’re hitting capabilities where we don’t even have a decent paradigm to use,

instagraham - 2 days ago

I've written about Humanity's Last Exam, which crowdsources tough questions for AI models from domain experts around the world.

https://www.happiesthealth.com/articles/future-of-health/hum...

It's a shifting goalpost, but one of the things that struck me was how some questions could still be trivial for a fairly qualified human (a doctor in this case) but difficult for an AI model. Reasoning, visual or logic, is built on a set of assumptions that are better gained through IRL experience than crawling datasets and matching answers.

This leads me to believe that much of the future for training AI models will lie in exposing them to "meatspace" and annotating their inferences, much like how we train a child. This is a long, long process, and one that is already underway at scale. But it's what might give us emergent intelligences rather than just a basket of competing yet somehow-magic thesaurus.

jstummbillig - 2 days ago

Benchmarks are like SAT scores. Can they guarantee you'll be great at your future job? No, but we are still roughly okay with what they signify. Clearly LLMs are getting better in meaningful ways, and benchmarks correlate with that to some extend.

calpaterson - 2 days ago

Definitely one of the weaker areas in the current LLM boom. Comparing models, or even different versions of the same model, is a pseudo-scientific mess.

I'm still using https://lmarena.ai/leaderboard. Perhaps there is something better and someone will pipe up to tell me about it. But we use LLMs at work and have unexplainable variations between them.

And when we get a prompt working reliably on one model, we often have trouble porting it to another LLM - even straight "version upgrades" such as from GPT-4 to -5. Your prompt and your model become highly coupled quite easily.

I dunno what to do about it and am tending to just pick Gemini as a result.

shanev - 2 days ago

This is solvable at the level of an individual developer. Write your own benchmark for code problems that you've solved. Verify tests pass and that it satisfies your metrics like tok/s and TTFT. Create a harness that works with API keys or local models (if you're going that route).

bee_rider - 2 days ago

> "For example, if a benchmark reuses questions from a calculator-free exam such as AIME," the study says, "numbers in each problem will have been chosen to facilitate basic arithmetic. Testing only on these problems would not predict performance on larger numbers, where LLMs struggle."

When models figure out how to exploit an effect that every clever college student does, that should count as a win. That’s a much more human-like reasoning ability, than the ability to multiply large numbers or whatever (computers were already good at that, to the point that it has become a useless skill for humans to have). The point of these LLMs is to do things that computers were bad at.

riskable - 2 days ago

We should make a collective git repo full of every kind of annoying bug we (expert developers) can think of. Then use that to benchmark LLMs.

Someone want to start? I've got a Yjs/CRDT collaborative editing bug that took like a week and a half of attempts with Claude Code (Sonnet 4.5), GPT5-codex (medium), and GLM-4.6 many, many attempts to figure out. Even then they didn't really get it... Just came up with a successful workaround (which is good enough for me but still...).

Aside: You know what really moved the progress bar on finding and fixing the bug? When I had a moment of inspiration and made the frontend send all it's logs to the backend so the AIs could see what was actually happening on the frontend (near real-time). Really, I was just getting sick of manual testing and pasting the console output into the chat (LOL). Laziness FTW!

I have the Google Chrome Dev Tools MCP but for some reason it doesn't work as well :shrug:

SkyPuncher - 2 days ago

Benchmarks are nothing more than highly contextual specs (in traditional code). They demonstrate your code works in a certain way in certain use cases, but they do not prove your code works as expected in all use cases.

pahae - 2 days ago

I wish the big providers would offer some sort of trial period where you can evaluate models in a _realistic_ setting yourself (i.e cli tools or IDE integrations). I wouldn't even mind strict limits -- just give me two hours or so of usage and I'd already be happy. Seriously.

My use-case is probably pretty far from the usual tasks: I'm currently implementing a full observability platform based on VictoriaMetrics / Victorialogs + Grafana. It's quite elaborate and has practically no overlap with the usual/cloud solutions you find out there. For example, it uses an authenticated query stack: I use the Grafana oauth token to authenticate queries by injecting matchers via prom-label-proxy and forward that to promxy for fan-out to different datasources (using the label filter to only query some datasources). The IaC stuff is also not mainstream as I'm not using any of the big cloud providers, but the provider I use nonetheless has a terraform provider.

As you can imagine, there's probably not much training data for most of this, so quality of the responses varies widely. From my experience so far Claude (Sonnet 4.5 ) does a _much_ better job than GTP-5 (Codex or normal) with the day-to-day task. Stuff like keeping documentation up to date, spotting inconsistencies, helping me find blind spots in the Alerting rules, etc. It also seems to do better working with provided documentation / links.

I've been using Claude for a couple of weeks now but recently switched to codex after my subscription to Claude ran out. I was really curious after reading a lot of good things about it but I gotta say, so far, I'm not impressed. Compared to Claude it gives wrong answers much more frequently (at least in this domain). The results it produces take much more effort to clean up than Claude's. Probably on a level where I could just invest the time myself. Might be that I do not yet know how to correctly prompt GPT but giving both tools the same prompt, Claude does a better job 90% of the time.

Anyway, I guess this is my long-winded way of saying that the quality of responses "off the beaten track" varies widely and is worth testing several models with. Especially if your work is not 70+% of coding. Even then I guess that many benchmarks have seized being useful by now?

proc0 - 2 days ago

This wasn't that hard to see.

> Our systematic review of 445 benchmarks reveals prevalent gaps that undermine the construct validity needed to accurately measure targeted phenomena

Intelligence has an element of creativity, and as such the true measurement would be on metrics related to novelty, meaning tasks that have very little resemblance to any other existing task. Otherwise it's hard to parse out whether it's solving problems based on pattern recognition instead of actual reasoning and understanding. In other words, "memorizing" 1000 of the same type of problem, and solving #1001 of that type is not as impressive as solving a novel problem that has never been seen before.

Of course this presents challenges to creating the tests because you have to avoid however many petabytes of training data these systems are trained with. That's where some of the illusion of intelligence arises from (illusion not because it's artificial, since there's no reason to think the brain algorithms cannot be recreated in software).

lysace - 2 days ago

Tech companies/bloggers/press/etc are perpetually bad at benchmarks. For browsers they kept pushing simplistic javascript-centric benchmarks even when it was clear for at least 15 years that layout/paint/network/etc were the dominant bottlenecks in real-world usage.

It's primarily marketing-driven. I think the technical parts of companies need to attempt to own this more.

It gets really weird when engineering priorities shift because of these mostly irrelevant benchmarks.

lielvilla - a day ago

I’m working a lot with TTS (Text-to-Speach), and it’s also a total wild west - even worse than LLMs in some ways. The demos are always perfect, but once you generate hundreds of minutes you start seeing volume drift, pacing changes, random artifacts, and occasional mispronunciations that never show up in the curated clips.

The big difference from LLMs is that we don’t really have production-grade, standardized benchmarks for long-form TTS. We need things like volume-stability across segments, speech-rate consistency, and pronunciation accuracy over a hard corpus.

I wrote up what this could look like here: https://lielvilla.com/blog/death-of-demo/

doctorpangloss - 2 days ago

The problem with the LLM benchmarks is that if you see one that shows high performance by something that isn’t from Anthropic, Google or OpenAI, you don’t believe it, even if it were “true.” In that sense, benchmarks are a holistic social experience in this domain, less a scientific endeavour.

SurceBeats - 2 days ago

Benchmarks optimize for fundraising, not users. The gap between "state of the art" and "previous gen" keeps shrinking in real-world use, but investors still write checks based on decimal points in test scores.

wolttam - 2 days ago

I'd like to see some video generation benchmarks. For example, one that tested a model's ability to generate POV footage of a humanoid form carrying out typical household tasks

Even if it requires human evaluators at first, and even if the models completely suck at this task right now: it seems like the kind of task you'd want them to be good at, if you want these models to eventually carry out these tasks in embodied forms in the real world.

Just having the benchmark in the first place is what gives model makers something to optimize for.

dehrmann - 2 days ago

This might explain the zeitgeist that new models feel same-ish, despite model developers saying they're getting spectacularly better.

AbrahamParangi - 2 days ago

A test doesn't need to be objectively meaningful or rigorous in any sense in order to still be useful for comparative ranking.

twilightzone - 2 days ago

"Measuring money turns out to be easier than measuring intelligence." Don't ever change, El Reg.

zeroonetwothree - 2 days ago

Humans are much better at out of sample prediction than LLMs. And inherently benchmarks cannot be out of sample. So I believe that leads to the disconnect between LLMs getting better and better at in sample prediction (benchmarks) while not improving nearly as much at out of sample (actual work).

RA_Fisher - 2 days ago

For statistical AI models, we can use out of sample prediction error as an objective measure to compare models. What makes evaluating LLMs difficult is that comparisons are inextricable from utility (whereas statistical AI models do have a pre-utility step wherein it can be shown out of sample prediction epsilon is minimized).

dupdup - 2 days ago

for me the definition of AGI is the tool to measure https://arxiv.org/html/2510.18212v2

Havoc - 2 days ago

Id hope anyone using LLMs in production is testing them against their use directly.

Benchmarks make for a good first pass though to figure out which ones to test

gradus_ad - 2 days ago

AI detractors can say whatever. As a developer Claude Code is almost an unfair cheat code. AI valuations may be absurd but the hype is justified.

inavida - 2 days ago

They should laugh while they can ;) Still waiting for the crash and to see what lives on and what gets recycled. My bet is that grok is here to stay ;)

(Don't hurt me, I just like his chatbot. It's the best I've tried at, "Find the passage in X that reminded me of the passage in Y given this that and the other thing." It has a tendency to blow smoke if you let it, but they all seek to affirm more than I'd like, but ain't that the modern world? It can also be hilariously funny in surprisingly apt ways.)

bbor - 2 days ago

I'm already quite put off by the title (it's science -- if you have a better benchmark, publish it!), but the contents aren't great either. It keeps citing numbers about "445 LLM benchmarks" without confirming whether any of the ones they deem insufficiently statistical are used by any of the major players. I've seen a lot of benchmarks, but maybe 20 are used regularly by large labs, max.

  "For example, if a benchmark reuses questions from a calculator-free exam such as AIME," the study says, "numbers in each problem will have been chosen to facilitate basic arithmetic. Testing only on these problems would not predict performance on larger numbers, where LLMs struggle."
For a math-based critique, this seems to ignore a glaring problem: is it even possible to randomly sample all natural numbers? As another comment pointed out we wouldn't even want to ("LLMs can't accurately multiply 6-digit numbers" isn't something anyone cares about/expected them to do in the first place), but regardless: this seems like a vacuous critique dressed up in a costume of mathematical rigor.

  At least some of those who design benchmark tests are aware of these concerns.
In related news, at least some scientists studying climate change are aware that their methods are imperfect. More at 11!

If anyone doubts my concerns and thinks this article is in good faith, just check out this site's "AI+ML" section: https://www.theregister.com/software/ai_ml/

naasking - a day ago

Clearly we need tests that check for effectiveness at applying general mathematical, logical and relational operations, eg. set theory, relational algebra, first and second order logic, type theory, the lambda calculus, recurrence and induction, etc., and the ability to use these to abstract over specifics and the ability to generalize.

The upside is that these can all be generated and checked synthetically so large data sets are possible, in both formal and natural languages.

qustrolabe - 2 days ago

Technically true but also a very dumb take and manipulative phrasing

moritzwarhier - 2 days ago

When people claim that there is such a thing as "X% accuracy in reasoning", it's really hard to take anything else seriously, no matter how impressive.

AI (and humans!) aside, claiming that there was an oracle that could "answer all questions" is a solved problem. Such a thing cannot exist.

But this is going already too deep IMO.

When people start talking about percentages or benchmark scores, there has to be some denominator.

And there can be no bias-free such denominator for

- trivia questions

- mathematical questions (oh, maybe I'm wrong here, intuitively I'd say it's impossible for various reasons: varying "hardness", undecidable problems etc)

- historical or policital questions

I wanted to include "software development tasks", but it would be a distraction. Maybe there will be a good benchmark for this, I'm aware there are plenty already. Maybe AI will be capable to be a better software developer than me in some capacity, so I don't want to include this part here. That also maps pretty well to "the better the problem description, the better the output", which doesn't seem to work so neatly with the other categories of tasks and questions.

Even if the whole body of questions/tasks/prompts would be very constrained and cover only a single domain, it seems impossible to guarantee that such benchmark is "bias-free" (I know AGI folks love this word).

Maybe in some interesting special cases? For example, very constrained and clearly defined classes of questions, at which point, the "language" part of LLMs seems to become less important and more of a distraction. Sure, AI is not just LLMs, and LLMs are not just assistants, and Neural Networks are not just LLMs...

There the problem begins to be honest: I don't even know how to align the "benchmark" claims with the kind of AI they are examinin and the ones I know exist.

Sure it's possible to benchmark how well an AI decides whether, for example, a picture shows a rabbit. Even then: for some pictures, it's gotta be undecidable, no matter how good the training data is?

I'm just a complete layman and commenting about this; I'm not even fluent in the absolute basics of artificial neural networks like perceptrons, gradient descent, backpropagation and typical non-LLM CNNs that are used today, GANs etc.

I am and was impressed by AI and deep learning, but to this day I am thorougly disappointed by the hubris of snakeoil salespeople who think it's valuable and meaningful to "benchmark" machines on "general reasoning".

I mean, it's already a thing in humans. There are IQ tests for the non-trivia parts. And even these have plenty of discussion revolving around them, for good reason.

Is there some "AI benchmark" that exclusively focuses on doing recent IQ tests on models, preferably editions that were published after the particular knowledge cutoff of the respective models? I found (for example) this study [1], but to be honest, I'm not the kind of person who is able to get the core insights presented in such a paper by skimming through it.

Because I think there are impressive results, it's just becomimg very hard to see through the bullshit at as an average person.

I would also love to understand mroe about the current state of the research on the "LLMs as compression" topic [2][3].

[1] https://arxiv.org/pdf/2507.20208

[2] https://www.mattmahoney.net/dc/text.html

[3] https://arxiv.org/abs/2410.21352

dang - 2 days ago

Url changed from https://www.theregister.com/2025/11/07/measuring_ai_models_h..., which points to this.

mikert89 - 2 days ago

[flagged]

Marshferm - 2 days ago

Don’t get high on your own supply.

jennyholzer - 2 days ago

I've been getting flagged by high-on-their-own-supply AI boosters for identifying that LLM benchmarks have been obvious bullshit for at least the last year and a half.

What changed to make "the inevitable AI bubble" the dominant narrative in last week or so?