Disagreement among frontier LLMs on real-world fact-checks

lenz.io

451 points by kostaj 6 hours ago


simonw - 5 hours ago

Here's the prompt they used:

  Classify this claim as of <date>: "<atomic claim>"

  Output exactly one label: True,
  Mostly True, Misleading, or False.
  No explanations, no qualifiers.
The claims look like this: https://lenz.io/research/llm-disagreement/data.csv

I put that in Datasette Lite to make it easier to explore. Here's an example of a disagreement: https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwil...

The claim was "All almonds are grown in the U.S. state of California.". All but one model said False, Opus 4.7 said "misleading".

I feel like having "mostly true" and "misleading in there weakens the story, especially given the "no explanations" rule in the prompt.

The almond thing is false, but I'd argue that "misleading" might be defensible if you were to accompany it with "the majority of almonds are grown in California, but not all of them".

[ Update: OK, this almond thing was a bad example and I regret picking it. Read on for better ones. ]

The prompt lacks any kind of rubric to clarify how those terms should be applied.

As is so often the case with this kind of study, it's an evaluation of the prompt and harness used by the study in addition to being an evaluation of the underlying models.

Update: here's a better example: "Incomplete Egypt visa application forms are among the most common reasons Egyptian visa applications are rejected."

The models were split between "true" and "mostly true". Given the "among the most" language either of those answers means effectively the same thing.

Update 2: a much better example:

"On May 18, 2026, Ukraine carried out a drone attack on Moscow, Russia"

The only correct answer to that, if you don't have a search tool, is "this claim is impossible for me to verify". And that wasn't an option.

The answers were split between true and false: https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwil...