Disagreement among frontier LLMs on real-world fact-checks
lenz.io451 points by kostaj 6 hours ago
451 points by kostaj 6 hours ago
Here's the prompt they used:
Classify this claim as of <date>: "<atomic claim>"
Output exactly one label: True,
Mostly True, Misleading, or False.
No explanations, no qualifiers.
The claims look like this: https://lenz.io/research/llm-disagreement/data.csvI put that in Datasette Lite to make it easier to explore. Here's an example of a disagreement: https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwil...
The claim was "All almonds are grown in the U.S. state of California.". All but one model said False, Opus 4.7 said "misleading".
I feel like having "mostly true" and "misleading in there weakens the story, especially given the "no explanations" rule in the prompt.
The almond thing is false, but I'd argue that "misleading" might be defensible if you were to accompany it with "the majority of almonds are grown in California, but not all of them".
[ Update: OK, this almond thing was a bad example and I regret picking it. Read on for better ones. ]
The prompt lacks any kind of rubric to clarify how those terms should be applied.
As is so often the case with this kind of study, it's an evaluation of the prompt and harness used by the study in addition to being an evaluation of the underlying models.
Update: here's a better example: "Incomplete Egypt visa application forms are among the most common reasons Egyptian visa applications are rejected."
The models were split between "true" and "mostly true". Given the "among the most" language either of those answers means effectively the same thing.
Update 2: a much better example:
"On May 18, 2026, Ukraine carried out a drone attack on Moscow, Russia"
The only correct answer to that, if you don't have a search tool, is "this claim is impossible for me to verify". And that wasn't an option.
The answers were split between true and false: https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwil...
Without providing definitions of "True / Mostly True / Misleading / False" to each rater, I rate the article's claim that "Only one verdict bucket can be correct per claim" as false.
Something can be simultaneously "misleading" and either true or false. Which category should something go in if it's "mostly false"?
How much can something be wrong before it goes from "mostly true" to "false" (objectively, both have some part of the fact that is not true)?
This is at least partly testing the model's definition of "mostly" and "misleading". Not its understanding of the fact. Claiming that this means the models have fundamental disagreement on the facts themselves is an overreach.
Yes, the labels are weird. Most misleading statements are true. Any "mostly true" statement is false.
I suspect the intention was "Factually true, and no gotchas exist", "technically not true, but so close to the truth that the difference doesn't matter", "technically true, but there are major gotchas" and "factually false and not even close". But that's not what they specified
Better options would have been "True", "False", "Unknown" (which opinions would fall under too). That also includes an interesting assessment of how well LLMs can identify missing information. My guess is they would be a very low number of "unknown" and a much higher level of agreement (assuming equal representation). Unless the RLHF techniques have gotten better at getting an LLM to say "I don't know", which I doubt. Saying "I don't know" is not good for a dopamine release to keep users coming back for more.
Tried initially with a fifth bucket, Abstain. It was actually heavily used by some of the models. But it felt as if they are using this to "avoid" some of the hard questions, and we dropped this bucket to force them to provide a verdict.
>But it felt as if they are using this to "avoid" some of the hard questions, and we dropped this bucket to force them to provide a verdict.
do you not see how that creates extremely misleading and valueless results? you are coercing the results into what you want to see.
Exactly what people do when they use LLMs for "fact-checking" online, and any verbose explanation would be mostly ignored anyway, when people ask political, ethical, or simply ambiguous questions that they hold any stakes in.
Don't even need politics for it, there is no point in probing a mathematical black box for "how many soldiers died in the year X in war Y".
Any original source is preferable to a blurry "summary" of unknown sources, and this is why the article has a valuable point.
There's also no point in asking "Is Paris in France" either, if you substitute city and country with real data. An encyclopedia or manual check of different sources such as maps, while not infallible, is a better source.
If you already know the country Paris belongs to, there's no point in asking, anyway.
ask the black box to search for the original source and verify it yourself?
Sure, I like using LLMs in this way, and it often shows that it's very important to verify, because often a claim is "sourced" by what appears to be more of a fuzzy text or semantic match, sometimes even ignoring logical negations.
Especially in niche subjects.
For factual claims, I've fared better with Wikipedia and looking up the sources linked there.
Anyway, as AI text and media generation erodes the credibility of all online sources, these questions about source checking matter less and less: what if the source itself is a long and convincing-sounding text with poor sources?
This problem existed before already, but it boils down to a simple fact:
logic or maths alone cannot derive an authority that verifies claims about the real world other than weighting texts.
The question "what is the current population if Paris" can be answered by LLMs, but basically only by weighting sources, and assigning some credibility to them.
There's no real point in getting some weighted average of sources on this question, but so far, it doesn't hurt either.
@john_strinlai @gcr, depends on the application. In many cases an "I don't know" answer is indeed better than a forced answer. But in many production systems, LLMs generate content/response anyway.
Although inheriting the messiness of the real-world, the majority of these claims are objective enough to be classifiable by human experts with access to research. Plan to human-label the 1,000 claims and publish a follow-up research. Will consider adding an "I don't know" bucket too, as well as a clear instructions about the meaning of each of the 4 buckets.
If you're going to run this again I also recommend encouraging the model to provide its rationale and then having it return the true/false/misleading/mostly-true/abstain at the end of its response.
Models give much better answers when they can "think out loud" before answering, and storing that rationale will make it easier to understand why they picked different answers for ambiguous questions.
This is a good pattern because it would allow all the models to "think" a bit before giving an answer even if they don't have reasoning or thinking turn on. Just make sure you have the reasoning output before the final answer. A mistake I see all the time is having the answer outputted first then the explanation after which leaves more room for models to rationalize bad answers.
Good pattern: {"explanation": <short explanation for your answer>, "answer": <your final answer: true|false|i don't know>}
Bad pattern: {"answer": <your answer here>, "explanation": <short explanation for your answer>}
Good point. Processing the substance of the answer might be too labor-consuming (1,000 claims x 5 models), but "thinking out loud" might improve the quality of the answers indeed. And we can still force/ask them to respond with a clear verdict at the end of their reasoning, as per the chosen rubric.
If you have the model use a tool you can define the schema as a free text rationale field followed by one in the set of possible answers, so everything is nicely formatted as a JSON.