Dispelling misconceptions about RLHF

aerial-toothpaste-34a.notion.site

120 points by fpgaminer 4 days ago


josh-sematic - 4 days ago

The mechanisms the author describe are used for RLHF, but are not sufficient for training the recent slew of “reasoning models.” To do that, you have to generate rewards not based on proximity to some reference full answer transcript, but rather based on how well the final answer (ex: the part after the “thinking tokens”) meets your reward criteria. This turns out to be a lot harder to do than the mechanisms used for RLHF which is one reason why we had RLHF for a while before we got the “reasoning models.” It’s also the only way you can understand the Sutskever quote “You’ll know your RL is working when the thinking tokens are no longer English” (a paraphrase, pulled from my memory).

einrealist - 4 days ago

> “Successful” is importantly distinct from “correct.”

This is the most important sentence describing the fundamental issue that LLMs have. This severely limits the technology's useful applications. Yet OpenAI and others constantly lie about it.

The article very clearly explains why models won't be able to generalise unless RL is performed constantly. But that's not scalable, has other problems in itself. For example, it still runs into paradoxes where the training mechanism has to know the answer in order to formulate the question. (This is precisely where the concept of World Models comes in or why symbolism becomes more important.)

LLMs perform well in highly specialised scenarios with a well-defined and well-known problem space. It's probably possible to increase accuracy and correctness by using lots of interconnected models that can perform RL with each other. Again, this raises questions of scale and feasibility. But I think our brains (together with the other organs) work this way.

vertere - 4 days ago

I'm confused about their definition of RL.

> ... SFT is a subset of RL.

> The first thing to note about traditional SFT is that the responses in the examples are typically human written. ... But it is also possible to build the dataset using responses from the model we’re about to train. ... This is called Rejection Sampling.

I can see why someone might say there's overlap between RL and SFT (or semi-supervised FT), but how is "traditional" SFT considered RL? What is not RL then? Are they saying all supervised learning is a subset of RL, or only if it's fine tuning?

macleginn - 4 days ago

Everything the post says about the behaviour of OpenAI models seems to be based on pure speculation.

thinkzilla - 3 days ago

While the post uses DPO to illustrate RL and RLHF, in fact DPO is an alternative to RLHF that does not use RL. See the abstract of the DPO paper https://arxiv.org/abs/2305.18290, and Figure 1 in the paper: "DPO optimizes for human preferences while avoiding reinforcement learning".

The confusion is understandable. The definition of RL in the Sutton/Barto book extends over two chapters iirc, and after reading it I did not see how it differed from other learning methods. Studying some of the academic papers cleared things up.

m-s-y - 4 days ago

RLHF -> Reinforced Learning with Human Feedback

It’s not defined until the 13th paragraph of the linked article.

williamtrask - 4 days ago

Nit: the author says that supervised fine tuning is a type of RL, but it is not. RL is about delayed reward. Supervised fine tuning is not in any way about delayed reward.

mehulashah - 4 days ago

This article is really two. One that describes RL. The other is how they applied it. The former was quite helpful because it demystified much of the jargon that I find in AI. All branches of science have jargon. I find the AI ones especially impenetrable.

Nevermark - 4 days ago

Another way to do reinforcement learning is to train a model to judge the quality of its own answers, to match judgements from experts or synthetically created. Until it develops an ability to judge its answer quality even if it can’t yet use that information to improve its responses.

It can be easier to recognize good responses than generate them.

Then feed it queries, generating its responses and judgements. Instead of training the responses to match response data, train it to output a high positive judgement, but while holding its “judgment” weight values constant. To improve its judgement values, the model is now being trained to give better answers since the judgment weights being back propagated act as a distributor of information from judgement back to how the responses should change to improve.

Learn to predict/judge what is good or bad. Then learn to maximize good and minimize bad using the judgment/prediction as a proxy for actual feedback.

This technique is closer to traditional human/animal reinforcement learning.

How we learn to predict situations that will cause us pain or positive affects, then learn to choose actions that minimize our predictions of bad, and maximize our predictions of good. Which is much more efficient way to learn than the expense of having to actually experience everything and always get explicit external feedback.

There are a many many ways to do reinforcement learning.

- 4 days ago
[deleted]
byyoung3 - 4 days ago

this seems to disagree with a lot of research showing RL is not necessary for reasoning -- im not sure about alignment

schlipity - 4 days ago

The site is designed poorly and is stopping me from reading the article. I use NoScript, and it immediately redirects me to a "Hey you don't have javascript enabled, please enable it to read" page that is on a different domain from the website the article is on. I tried to visit notion.site to try and whitelist it temporarily, but it redirects back to notion.so and notion.com.

Rather than jump through more hoops, I'm just going to give up on reading this one.