Dispelling misconceptions about RLHF

aerial-toothpaste-34a.notion.site

120 points by fpgaminer 4 days ago

The mechanisms the author describe are used for RLHF, but are not sufficient for training the recent slew of “reasoning models.” To do that, you have to generate rewards not based on proximity to some reference full answer transcript, but rather based on how well the final answer (ex: the part after the “thinking tokens”) meets your reward criteria. This turns out to be a lot harder to do than the mechanisms used for RLHF which is one reason why we had RLHF for a while before we got the “reasoning models.” It’s also the only way you can understand the Sutskever quote “You’ll know your RL is working when the thinking tokens are no longer English” (a paraphrase, pulled from my memory).

abatilo - 4 days ago

FWIW, that was Karpathy, not Sutskever:
https://x.com/karpathy/status/1835561952258723930?s=19
- 4 days ago

[deleted]
pas - 4 days ago

sorry, could you explain why is it harder, where the complexity creeps in (compared to some naive "pattern matching the end of the response" tactic)? thanks!
- Lerc - 4 days ago
  
  The pattern matching compares what was said against an example of what a correct response could say.
  Checking a token at a time evaluates if it is going to produce a correct final answer. The intermediate text can be whatever it needs to arrive at that answer, but training at the per token level means training those very tokens that you want to allow the model the leeway to consider. It needs another model to adjudicate how well things are going from incomplete answers.
  I'm not sure how much the adjudicator evaluates based upon knowing the final answer or based upon the quality of the reasoning of the model being trained. I'd be inclined to train two adjudicators, one that knows the answers and one that doesn't. I'm sure there would be interesting things to see in their differential signal.
- markisus - 4 days ago
  
  Just speculating but proximity to a reference answer is a much denser reward signal. In contrast, parsing out a final answer into a pass/fail only provides a sparse reward signal.
  - krackers - 9 hours ago
    
    Yup, RLVR as implemented by Deepseek et al. use only outcome supervision instead of process supervision. There have been attempts to do process supervision though.

einrealist - 4 days ago

> “Successful” is importantly distinct from “correct.”

This is the most important sentence describing the fundamental issue that LLMs have. This severely limits the technology's useful applications. Yet OpenAI and others constantly lie about it.

The article very clearly explains why models won't be able to generalise unless RL is performed constantly. But that's not scalable, has other problems in itself. For example, it still runs into paradoxes where the training mechanism has to know the answer in order to formulate the question. (This is precisely where the concept of World Models comes in or why symbolism becomes more important.)

LLMs perform well in highly specialised scenarios with a well-defined and well-known problem space. It's probably possible to increase accuracy and correctness by using lots of interconnected models that can perform RL with each other. Again, this raises questions of scale and feasibility. But I think our brains (together with the other organs) work this way.

getnormality - 4 days ago

Agreed re the "successful" discussion, we're getting a much appreciated essential point here. I think it would be slightly better expressed by simply saying that we want a 0% error rate. Giving a correct answer and saying "I don't know" are both just ways of avoiding error.
Foreignborn - 4 days ago

can you say more about world models or symbolism?
i thought world models like genie 3 would be the training mechanism, but i likely misunderstand.
- einrealist - 4 days ago
  
  A World Model is a theoretical type of model that has knowledge about the "real world" (or whatever world or bounds you define). It can infer causalities from concepts within this world.
  Yes, you can use Genie 3 to train other models. Its far from perfect. You still need to train Genie 3. And its training and outputs must be useful in the context of what you want to train other models with. That's a paradox. The feedback loop needs to produce useful results. And Genie 3 can still hallucinate or produce implausible responses. Symbolism is a wide term. But a "World Model" needs it to make sense between concepts (e.g. Ontologies or the relation of movement and gravity).
  - logicchains - 4 days ago
    
    >The feedback loop needs to produce useful results. And Genie 3 can still hallucinate or produce implausible responses
    The solution to this is giving the model a physical body and actually letting it interact with the real world and learn from it. But no lab dares to try this because allowing a model to learn from experience would mean allowing it to potentially change its views/alignment.
    
    nickpsecurity - 4 days ago
    
    Labs have been doing that since Brooks' Subsumption Architecture decades ago. The problem with AI now is that the architectural design, unlike the brain, doesn't have grounded memory and hallucination mitigation. Letting those architectures walk around in the real world would show similar flaws.
    Multiple teams already baked memory into designs, some like typical ML and some biologically inspired. Hallucination mitigation needs a ton more research. My proposal was studying the part of the brain that causes hallucinations when damaged in case it's designed to mitigate them. Then, imitate it until we have something better.

vertere - 4 days ago

I'm confused about their definition of RL.

> ... SFT is a subset of RL.

> The first thing to note about traditional SFT is that the responses in the examples are typically human written. ... But it is also possible to build the dataset using responses from the model we’re about to train. ... This is called Rejection Sampling.

I can see why someone might say there's overlap between RL and SFT (or semi-supervised FT), but how is "traditional" SFT considered RL? What is not RL then? Are they saying all supervised learning is a subset of RL, or only if it's fine tuning?

unoti - 4 days ago

> I can see why someone might say there's overlap between RL and SFT (or semi-supervised FT), but how is "traditional" SFT considered RL? What is not RL then? Are they saying all supervised learning is a subset of RL, or only if it's fine tuning?
Sutton and Barto define reinforcement learning as "learning what to do- how to map situations to actions-- so as to maximize a numerical reward signal". This is from their textbook on the topic.
That's a pretty broad definition. But the general formulation of RL involves a state of the world and the ability to take different actions given that state. In the context of an LLM, the state could be what has been said so far, and the action could be what token to produce next.
But as you noted, if you take such a broad definition of RL, tons of machine learning is also RL. When people talk about RL they usually mean the more specific thing of letting a model go try things and then be corrected based on the observations of how that turned out.
Supervised learning defines success by matching the labels. Unsupervised learning is about optimizing a known math function (for example, predicting the likelihood that words would appear near each other). Reinforcement learning would maximize a reward function that may not be directly known by the model, and it learns to optimize it by trying things and observing the results and getting a reward/penalty.
andy99 - 4 days ago

A couple things I've seen go by that make the connection. I haven't looked at them closely enough to have an opinion.
> https://arxiv.org/abs/2507.12856
> https://justinchiu.netlify.app/blog/sftrl/

macleginn - 4 days ago

Everything the post says about the behaviour of OpenAI models seems to be based on pure speculation.

yorwba - 4 days ago

Yeah, in my opinion you can just skip that part and go straight to the author's description of failing to train their own model at first and what they ended up changing to make it work: https://aerial-toothpaste-34a.notion.site/How-OpenAI-Misled-...

thinkzilla - 3 days ago

While the post uses DPO to illustrate RL and RLHF, in fact DPO is an alternative to RLHF that does not use RL. See the abstract of the DPO paper https://arxiv.org/abs/2305.18290, and Figure 1 in the paper: "DPO optimizes for human preferences while avoiding reinforcement learning".

The confusion is understandable. The definition of RL in the Sutton/Barto book extends over two chapters iirc, and after reading it I did not see how it differed from other learning methods. Studying some of the academic papers cleared things up.

krackers - 9 hours ago

I think there was some quote from Karpathy who said that RLHF isn't actually "true" RL. As an armchair person, even after trying to understand it RLHF always seemed so roundabout. You don't have some open ended environment, you already have a fixed set of preferences. Instead of directly optimizing the model against that like DPO does, RLHF goes out of its way to train value/reward networks encoding these preferences then optimizing against that. I assumed that maybe it was just done this way for performance or stability or some other math -heavy reason, it was good to see that my suspicion was not off-base.

m-s-y - 4 days ago

RLHF -> Reinforced Learning with Human Feedback

It’s not defined until the 13th paragraph of the linked article.

williamtrask - 4 days ago

Nit: the author says that supervised fine tuning is a type of RL, but it is not. RL is about delayed reward. Supervised fine tuning is not in any way about delayed reward.

jampekka - 4 days ago

RL is about getting numerical feedback of outputs, in contrast to supervised learning where there are examples of what the output should be. There are many RL problems with no delayed rewards, e.g. multi-armed bandits.
Admittely most interesting cases do have delays.
ProofHouse - 4 days ago

Well they can be used together in some contexts so while they are different, you could also say RL can help Supervised Fine Tuning for further optimization
tempusalaria - 4 days ago

SFT is part of the classic RLHF process though
JoshPurtell - 4 days ago

RL is not about delayed reward. Multi-armed bandit problems have no credit assignment component, but are often the first RL problem taught.
In its most general, RL is about learning a policy (state -> action mapping). Which often requires inferring value, etc.
But copying a strong reference policy ... is still learning a policy. Whether by SFT or not

mehulashah - 4 days ago

This article is really two. One that describes RL. The other is how they applied it. The former was quite helpful because it demystified much of the jargon that I find in AI. All branches of science have jargon. I find the AI ones especially impenetrable.

Nevermark - 4 days ago

Another way to do reinforcement learning is to train a model to judge the quality of its own answers, to match judgements from experts or synthetically created. Until it develops an ability to judge its answer quality even if it can’t yet use that information to improve its responses.

It can be easier to recognize good responses than generate them.

Then feed it queries, generating its responses and judgements. Instead of training the responses to match response data, train it to output a high positive judgement, but while holding its “judgment” weight values constant. To improve its judgement values, the model is now being trained to give better answers since the judgment weights being back propagated act as a distributor of information from judgement back to how the responses should change to improve.

Learn to predict/judge what is good or bad. Then learn to maximize good and minimize bad using the judgment/prediction as a proxy for actual feedback.

This technique is closer to traditional human/animal reinforcement learning.

How we learn to predict situations that will cause us pain or positive affects, then learn to choose actions that minimize our predictions of bad, and maximize our predictions of good. Which is much more efficient way to learn than the expense of having to actually experience everything and always get explicit external feedback.

There are a many many ways to do reinforcement learning.

varispeed - 4 days ago

The snag is: 'experts' aren’t neutral oracles. Many are underpaid and end up parroting whoever funds them. Lobby groups quietly buy authority all the time. So the real challenge isn’t just training on expert judgments, it’s making the model sharp enough to spot the BS in those judgments - otherwise you’re just encoding the bias straight into the weights.
- htfu - 4 days ago
  
  Which is why the foundation players must soon take on the additional role of being an ad buyer.
  Interactive stuff, within content. A mini game in a game, school homework of course, or "whichever text box the viewer looks at longest by WorldCoin Eyeball Tracker for Democracy x Samsung" for an interstitial turned captcha.
  Better hope your taste isn't too bland and derivative!
  Amazon and Ali soon lap the field by allowing coupon farming, but somehow eventually end up where they started.
- Nevermark - 4 days ago
  
  > The snag is: 'experts' aren’t neutral oracles.
  Without knowing who/what the experts are, how they are used, what they are judging, what structure and mitigations are in place around their use, and what degree of neutrality is required - with all other factors and techniques being used - you can't make any such claim.
  It's so easy to dismiss something.
  A general algorithm isn't a claim that its practical use won't require accommodating the specific complications of each context.
  Very much like how data scientists don't expect their best algorithms to operate well, without also resolving a stream of practical issues. In standard and ad hoc ways, as needed.

- 4 days ago

[deleted]

byyoung3 - 4 days ago

this seems to disagree with a lot of research showing RL is not necessary for reasoning -- im not sure about alignment

schlipity - 4 days ago

The site is designed poorly and is stopping me from reading the article. I use NoScript, and it immediately redirects me to a "Hey you don't have javascript enabled, please enable it to read" page that is on a different domain from the website the article is on. I tried to visit notion.site to try and whitelist it temporarily, but it redirects back to notion.so and notion.com.

Rather than jump through more hoops, I'm just going to give up on reading this one.