The inefficiency of RL, and implications for RLVR progress

dwarkesh.com

118 points by cubefox 6 days ago


bogtog - 3 days ago

The premise of this post and the one cited near the start (https://www.tobyord.com/writing/inefficiency-of-reinforcemen...) is that RL involves just 1 bit of learning for a rollout, rewarding success/failure.

However, the way I'm seeing this is that a RL rollout may involve, say, 100 small decisions out of a pool of 1,000 possible decisions. Each training step, will slightly upregulate/downregulate a given training step in the step's condition. There will be uncertainty about which decision was helpful/harmful -- we only have 1 bit of information after all -- but this setup where many steps are slowly learned across many examples seems like it would lend itself well to generalization (e.g., instead of 1 bit in one context, you get a hundred 0.01 bit insights across 100 contexts). There may be some benefits not captured by comparing the number of bits relative to pretraining.

As the blog says, "Fewer bits, sure, but very valuable bits", this also seems like a different factor that would also be true. Learning these small decisions may be vastly more valuable for producing accurate outputs than learning through pretraining.

derbOac - 3 days ago

There's some insights there about the base rate of correct responses and pretraining to boost that. Basically searching a suboptimal versus optimal area of the model space at a suboptimal versus optimal rate.

I think the framing of the discussion in general is kind of misleading though, because it kind of avoids the question of "information inefficient about what?"

In RL, the model is becoming more informative about a stimulus-action-feedback space; in SL the model is becoming more informative about a stimulus-feedback space. RL is effectively "built for" searching a larger space.

In situations like the essay where you are directly comparing SL and RL, you're kind of saying for RL "the action space is restricted to dictionary X and the feedback space is binary yes or no" and for SL "the feedback space is restricted to dictionary X". So in a certain sense you're equating the RL action space to the SL feedback space.

In that case, maybe searching over suboptimal regions of the RL-action-SL-feedback space is inefficient. But the reason why, I think RL exists is because it generalizes to situations where the feedback and action space is bigger. Maybe you want to differentially associate different responses with different rewards, or sample a response space that is so large that you can't define it a priori. Then SL breaks down?

Maybe this is obvious but I guess I get a little uneasy about talking about information efficiency of RL and SL without a broader framework of equivalence and what information is being represented by the model in both cases. It seems to me RL is a kind of superset of SL in terms of what it is capable of representing, which maybe leads to inefficiencies when it's not being used to its fullest.

macleginn - 3 days ago

In the limit, the "happy" case (positive reward), policy gradients boil down to performing more or less the same update as the usual supervised strategy for each generated token (or some subset of those if we use sampling). In the unhappy case, they penalise the model for selecting particular tokens in particular circumstances -- this is not something you can normally do with supervised learning, but it is unclear to what extent this is helpful (if a bad and a good answer share a prefix, it will be upvoted in one case and penalised in another case, not in the same exact way but still). So during on-policy learning we desperately need the model to stumble on correct answers often enough, and this can only happen if the model knows how to solve the problem to begin with, otherwise the search space is too big. In other words, while in supervised learning we moved away from providing models with inductive biases and trusting them to figure out everything by themselves, in RL this does not really seem possible.

swordsmith - a day ago

Seems like he thinks RLVR == learning from binary reward for the whole chain, completely discounting techniques to provide denser rewards like process reward supervision?

hereme888 - 2 days ago

recent results like PEFT-Bench (arxiv.org/abs/2511.21285) found that while SFT is efficient for formatting, it actually degraded Llama-3-8B's reasoning on math and code tasks compared to the base model.

So is RL required to preserve those logic circuits?

There seems to be a trade-off in compute-efficiency and format vs intelligence

a-dub - 2 days ago

i think in order to make this kind of argument you would need to be able to show all of the trajectories that are effectively reachable as a result of pre-training, and then how much effective pruning takes place as a result of total adjustment of the weights in response to one RL sample.

scaredginger - 3 days ago

Bit of a nitpick, but I think his terminology is wrong. Like RL, pretraining is also a form of *un*supervised learning

andyjohnson0 - 3 days ago

Since it is not explicitly stated, "RL" in this article means Reinforcement Learning.

https://en.wikipedia.org/wiki/Reinforcement_learning