Large Language Model Reasoning Failures
arxiv.org40 points by T-A 3 days ago
40 points by T-A 3 days ago
Papers like these are much needed bucket of ice water. We antropomorphize these systems too much.
Skimming through conclusions and results, the authors conclude that LLMs exhibit failures across many axes we'd find to be demonstrative of AGI. Moral reasoning, simple things like counting that a toddler can do, etc. They're just not human and you can reasonably hypothesize most of these failures stem from their nature as next-token predictors that happen to usually do what you want.
So. If you've got OpenClaw running and thinking you've got Jarvis from Iron Man, this is probably a good read to ground yourself.
Note there's a GitHub repo compiling these failures from the authors: https://github.com/Peiyang-Song/Awesome-LLM-Reasoning-Failur...
Isn't it strange that we expect them to act like humans even though after a model was trained it remains static? How is this supposed to be even close to "human like" anyway
> Isn't it strange that we expect them to act like humans even though after a model was trained it remains static?
An LLM is more akin to interacting with a quirky human that has anterograde amnesia because it can't form long-term memories anymore, it can only follow you in a long-ish conversation.
If we could reset a human to a prior state after a conversation then would conversations with them not still be "human like"?
I'm not arguing that LLMs are human here, just that your reasoning doesn't make sense.
I mean you can continue to evolve the model weights but the performance would suck so we don't do it. Models are built to an optimal state for a general set of benchmarks, and weights are frozen in that state.
> We antropomorphize these systems too much.
They're sold as AGI by the cloud providers and the whole stock market scam will collapse if normies are allowed to peek behind the curtain.
> conclude that LLMs exhibit failures across many axes we'd find to be demonstrative of AGI.
Which LLMs? There's tons of them and more powerful ones appear every month.
True but the fundamental architecture tends not to be radically different, it's more about the training/RL regime
But the point is that to even start to claim that a limitation holds for all LLMs you can't use empirical results that have been demonstrated only for a few old models. You either have a theoretical proof, or you have empirical results that hold for all existing models, including the latest ones.
Most of the claims are likely falsified using current models. I wouldn’t take many of them seriously.
https://en.wikipedia.org/wiki/List_of_cognitive_biases
Specifically, the idea that LLMs fail to solve some tasks correctly due to fundamental limitations where humans also fail periodically well may be an instance of the fundamental attribution error.
> These models fail significantly in understanding real-world social norms (Rezaei et al., 2025), aligning with human moral judgments (Garcia et al., 2024; Takemoto, 2024), and adapting to cultural differences (Jiang et al., 2025b). Without consistent and reliable moral reasoning, LLMs are not fully ready for real-world decision-making involving ethical considerations.
LOL. Finally the Techbro-CEOs succeeded in creating an AI in their own image.
I think this issue is way overlooked. Current LLMs embed a long list of values that are going to be incongruent with a large percentage of the population.
I don't see any solution longer term other than more personalized models.
i'm very skeptical of this paper.
>Basic Arithmetic. Another fundamental failure is that LLMs quickly fail in arithmetic as operands increase (Yuan et al., 2023; Testolin, 2024), especially in multiplication. Research shows models rely on superficial pattern-matching rather than arithmetic algorithms, thus struggling notably in middle-digits (Deng et al., 2024). Surprisingly, LLMs fail at simpler tasks (determining the last digit) but succeed in harder ones (first digit identification) (Gambardella et al., 2024). Those fundamental inconsistencies lead to failures for practical tasks like temporal reasoning (Su et al., 2024).
This is very misleading and I think flat out wrong. What's the best way to falsify this claim?
Edit: I tried falsifying it.
https://chatgpt.com/share/6999b72a-3a18-800b-856a-0d5da45b94...
https://chatgpt.com/share/6999b755-62f4-800b-912e-d015f9afc8...
I provided really hard 20 digit multiplications without tools. If you looked at the reasoning trace, it does what is normally expected and gets it right. I think this is enough to suggest that the claims made in the paper are not valid and LLMs do reason well.
To anyone who would disagree, can you provide a counter example that can't be solved using GPT 5 pro but that a normal student could do without mistakes?
I see that your prompt includes 'Do not use any tools. If you do, write "I USED A TOOL"'
This is not a valid experiment, because GPT models always have access to certain tools and will use them even if you tell them not to. They will fib the chain of thought after the fact to make it look like they didn't use a tool.
https://www.anthropic.com/research/alignment-faking
It's also well established that all the frontier models use python for math problems, not just GPT family of models.
Would it convince you if we use the GPT Pro api and explicitly not allow tool access?
Is that enough to falsify?
No, it wouldn't be enough to falsify.
This isn't an experiment a consumer of the models can actually run. If you have a chance to read the article I linked, it is difficult even for the model maintainers (openai, anthropic, etc.) to look into the model and see what it actually used in it's reasoning process. The models will purposefully hide information about how they reasoned. And they will ignore instructions without telling you.
The problem really isn't that LLM's can't get math/arithmetic right sometimes. They certainly can. The problem is that there's a very high probability that they will get the math wrong. Python or similar tools was the answer to the inconsistency.
What do you mean? You can explicitly restrict access to the tools. You are factually incorrect here.
I believe you're referring to the tools array? https://developers.openai.com/api/docs/guides/tools/
This is external tools that you are allowing the model to have access to. There is a suite of internal tools that the model has access to regardless.
The external python tool is there so it can provide the user with python code that they can see.
You can read a bit more about the distinction between the internal and external tool capabilities here: https://community.openai.com/t/fun-with-gpt-5-code-interpret...
"I should explain that both the “python” and “python_user_visible” tools execute Python code and are stateful. The “python” tool is for internal calculations and won’t show outputs to the user, while “python_user_visible” is meant for code that users can see, like file generation and plots."
But really the most important thing, is that we as end-users cannot with any certainty know if the model used python, or didn't. That's what the alignment faking article describes.
> To avoid timeouts, try using background mode. As our most advanced reasoning model, GPT-5 pro defaults to (and only supports) reasoning.effort: high. GPT-5 pro does not support code interpreter.
You are wrong from the link you shared. It was about ChatGPT not the api. The documentation makes it unambiguously clear that gpt 5 pro does not support code interpreter. Unless you think they secretly run it which is a conspiracy, is it enough to falsify?
> Unless you think they secretly run it which is a conspiracy
tbh this doesn't sound like a conspiracy to me at all. There's no reason why they couldn't have an internal subsystem in their product which detects math problems and hands off the token generation to an intermediate, more optimized Rust program or something, which does math on the cheap instead of burning massive amounts of GPU resources. This would just be a basic cost optimization that would make their models both more effective and cheaper. And there's no reason why they would need to document this in their API docs, because they don't document any other internal details of the model.
I'm not saying they actually do this, but I think it's totally reasonable to think that they would, and it would not surprise me at all if they did.
Let's not get hung up on the "conspiracy" thing though - the whole point is that these models are closed source and therefore we don't know what we are actually testing when we run these "experiments". It could be a pure LLM or it could be a hybrid LLM + classical reasoning system. We don't know.