Self-Distillation Enables Continual Learning [pdf]

arxiv.org

53 points by teleforce 7 hours ago


ArchieScrivener - 5 hours ago

From Jan 2026.

This is very interesting:

"Empirical Validation. While we cannot verify these theoretically, we evaluate each empirically. We use the Qwen-2.5-7B-Instruct model (Hui et al., 2024) as the base policy and the ToolAlpaca dataset (Tang et al., 2023). In this benchmark, the model receives a tool-API specification and a user request, and must identify the correct tool call. Without demonstrations, the base model solves only 42% of examples. When provided with the appropriate demonstration c for each prompt x , the teacher achieves a 100% success rate. To further test reward proximity, we manually inspected 50 teacher reasoning traces. In all cases, not only were the final tool calls correct, but the intermediate chain-of-thought was valid and semantically grounded. This suggests that the teacher is reconstructing a correct reasoning process rather than merely copying the expert output. These observations provide evidence for the first requirement, that the demonstration-conditioned model behaves as an optimal policy."

airstrike - 5 hours ago

Both title and abstract feel a little too confident, which ironically makes me more skeptical rather than less.

I find the choice of the words "enable" in the title and "establishing" at the end of the abstract to be particularly jarring.

teleforce - 3 hours ago

Fun facts, this paper is cited by Simple Self-Distillation (SSD) paper by Apple [1],[2]. I think it is a bad naming scheme due to the very common SSD namesake and the fact that it belongs to on-policy self-distillation [3]. But again according to the authors their proposed solution is simple because "SSD uses only temperature-shifted samples from the base model and standard cross-entropy training,without privileged context, feedback-conditioned teachers,or auxiliary supervision."

The Apple paper also cited another very similar idea of self-distillation paper by UCLA team. Both cited papers namely by MIT & ETH team, and the other by UCLA team proposed novel on-policy self-distillation technique. Interestingly both teams submitted their papers within one day from each other back in January this year to arXiv [4],[5]. No price for guessing who actually published the idea first.

IMHO, self-distillation fine-tuning is the future of LLM fine-tuning because it mitigates the forgetfulness of the SFT approach that can be cumbersome for lightweight fine-tuning rather than full post-training of LLM.

With the advent and proliferation of plethora open source and open weight LLM foundation models, anyone can fine-tuning these models for domain specialization or sub-specialization (like medicine sub-specialization, law disciplines, branches of architecture practices, etc) [6]. This fine-tuning process can be performed with the minimum resources of 8 H200 or even 4 H100 GPUs as reported respectively in either of the papers [4],[5]. Let's see if we can replicate that with much cheaper arrangements consisting of a couple of DGX Spark, or the latest eight of DGX Spark based nodes arrangement with a total of 1 TB RAM (128 GB x 8) [7],[8].

IMHO, if the results are valid, the self-distillation can be the second best thing happened to LLM after the transformer.

[1] Embarrassingly simple self-distillation improves code generation (2026 - 201 comments):

https://news.ycombinator.com/item?id=47637757

[2] Embarrassingly Simple Self-Distillation Improves Code Generation:

https://arxiv.org/abs/2604.01193

[3] Comment on "Embarrassingly simple self-distillation improves code generation":

https://news.ycombinator.com/item?id=47644784

[4] Self-Distillation Enables Continual Learning:

https://arxiv.org/abs/2601.19897

[5] Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models:

https://arxiv.org/abs/2601.18734

[6] Why domain specific LLMs won't exist: an intuition (2026 - 4 comments):

https://news.ycombinator.com/item?id=47649167

[7] NVIDIA DGX Spark Review The GB10 Machine is so Freaking Cool:

https://www.servethehome.com/nvidia-dgx-spark-review-the-gb1...

[8] BIG AI Cluster Little Power the 8x NVIDIA GB10 Cluster:

https://www.servethehome.com/big-cluster-little-power-the-8x...

greesil - 4 hours ago

Wtf is a policy? Is this some sort of RL thing that I'm too ML to understand?

Gemini tells me it's the probability of the next token for an LLM. Okay then.