Large Language Model Reasoning Failures

arxiv.org

40 points by T-A 3 days ago


sergiomattei - 3 days ago

Papers like these are much needed bucket of ice water. We antropomorphize these systems too much.

Skimming through conclusions and results, the authors conclude that LLMs exhibit failures across many axes we'd find to be demonstrative of AGI. Moral reasoning, simple things like counting that a toddler can do, etc. They're just not human and you can reasonably hypothesize most of these failures stem from their nature as next-token predictors that happen to usually do what you want.

So. If you've got OpenClaw running and thinking you've got Jarvis from Iron Man, this is probably a good read to ground yourself.

Note there's a GitHub repo compiling these failures from the authors: https://github.com/Peiyang-Song/Awesome-LLM-Reasoning-Failur...

Lapel2742 - 2 days ago

> These models fail significantly in understanding real-world social norms (Rezaei et al., 2025), aligning with human moral judgments (Garcia et al., 2024; Takemoto, 2024), and adapting to cultural differences (Jiang et al., 2025b). Without consistent and reliable moral reasoning, LLMs are not fully ready for real-world decision-making involving ethical considerations.

LOL. Finally the Techbro-CEOs succeeded in creating an AI in their own image.

simianwords - 2 days ago

i'm very skeptical of this paper.

>Basic Arithmetic. Another fundamental failure is that LLMs quickly fail in arithmetic as operands increase (Yuan et al., 2023; Testolin, 2024), especially in multiplication. Research shows models rely on superficial pattern-matching rather than arithmetic algorithms, thus struggling notably in middle-digits (Deng et al., 2024). Surprisingly, LLMs fail at simpler tasks (determining the last digit) but succeed in harder ones (first digit identification) (Gambardella et al., 2024). Those fundamental inconsistencies lead to failures for practical tasks like temporal reasoning (Su et al., 2024).

This is very misleading and I think flat out wrong. What's the best way to falsify this claim?

Edit: I tried falsifying it.

https://chatgpt.com/share/6999b72a-3a18-800b-856a-0d5da45b94...

https://chatgpt.com/share/6999b755-62f4-800b-912e-d015f9afc8...

I provided really hard 20 digit multiplications without tools. If you looked at the reasoning trace, it does what is normally expected and gets it right. I think this is enough to suggest that the claims made in the paper are not valid and LLMs do reason well.

To anyone who would disagree, can you provide a counter example that can't be solved using GPT 5 pro but that a normal student could do without mistakes?