Reasoning models don't always say what they think

anthropic.com

392 points by meetpateltech a day ago


lsy - a day ago

The fact that it was ever seriously entertained that a "chain of thought" was giving some kind of insight into the internal processes of an LLM bespeaks the lack of rigor in this field. The words that are coming out of the model are generated to optimize for RLHF and closeness to the training data, that's it! They aren't references to internal concepts, the model is not aware that it's doing anything so how could it "explain itself"?

CoT improves results, sure. And part of that is probably because you are telling the LLM to add more things to the context window, which increases the potential of resolving some syllogism in the training data: One inference cycle tells you that "man" has something to do with "mortal" and "Socrates" has something to do with "man", but two cycles will spit those both into the context window and lets you get statistically closer to "Socrates" having something to do with "mortal". But given that the training/RLHF for CoT revolves around generating long chains of human-readable "steps", it can't really be explanatory for a process which is essentially statistical.