Self-Adapting Language Models

arxiv.org

233 points by archon1410 2 days ago


https://jyopari.github.io/posts/seal

xianshou - 2 days ago

The self-edit approach is clever - using RL to optimize how models restructure information for their own learning. The key insight is that different representations work better for different types of knowledge, just like how humans take notes differently for math vs history.

Two things that stand out:

- The knowledge incorporation results (47% vs 46.3% with GPT-4.1 data, both much higher than the small-model baseline) show the model does discover better training formats, not just more data. Though the catastrophic forgetting problem remains unsolved, and it's not completely clear whether data diversity is improved.

- The computational overhead is brutal - 30-45 seconds per reward evaluation makes this impractical for most use cases. But for high-value document processing where you really need optimal retention, it could be worth it.

The restriction to tasks with explicit evaluation metrics is the main limitation. You need ground truth Q&A pairs or test cases to compute rewards. Still, for domains like technical documentation or educational content where you can generate evaluations, this could significantly improve how we process new information.

Feels like an important step toward models that can adapt their own learning strategies, even if we're not quite at the "continuously self-improving agent" stage yet.

gavinray - 2 days ago

Two close friends of mine who were math prodigies that went on to do ML very early (mid 2010's) were always talking to me about an algorithm that sounds similar to this:

"NEAT/HyperNEAT" (Neuroevolution of Augmented Topologies) [0]

I'm no ML practictioner, but as I understood it, the primary difference between NEAT and what is described in this paper is that while NEAT evolves the topology of the network, this paper seems to evolve the weights.

Seems like two approaches trying to solve the same problem -- one evolving networking structure, and the other the weights.

Those 2 friends are quite possibly the most intelligent people I've ever met, and they were very convinced that RL and evolutionary algorithms were the path forward in ML.

[0] https://en.wikipedia.org/wiki/Neuroevolution_of_augmenting_t...

cma - 2 days ago

From Anthropic a couple days ago too, self finetuning:

https://arxiv.org/html/2506.10139v1

perrygeo - a day ago

> Large language models (LLMs) are powerful but static; they lack mechanisms to adapt their weights in response to new tasks

The learning and inference process are entirely separate, which is very confusing to people familiar with traditional notions of human intelligence. For humans, learning things and applying that knowledge in the real world is one integrated feedback process. Not so with LLMs, we train them, deploy them, and discard them for a new model that has "learned" slightly more. For an LLM, inference is the end of learning.

Probably the biggest misconception out there about AI. If you think LLMs are learning, it's easy to fantasize that AGI is right around the corner.

libraryofbabel - 2 days ago

I wonder if anyone who’s really in the know could summarize where the research is at with getting LLMs to learn “on the job” (through continuous fine tuning or whatever) and what the blockers are to this being a useful deployable thing, e.g. having a model+coding agent that can actually learn a codebase over time (cost? model collapse? something else?).

I’m sure this is something the big labs are trying but from the outside as a user of LLMs it feels like people don’t talk about this very much and instead the focus right now is on better training (eg reinforcement learning) with the assumption that anything else not learned during training will be stuffed into the context somehow as needed. But from a naive perspective the lack of learning from experience after training seems like the biggest thing standing between us and AGI.

yahoozoo - 2 days ago

Hmm, it looks like it’s just a framework that fine-tunes LoRA adapter then merges the adapter into the original model. It is using the PeftModel and its “merge_and_unload” from the HuggingFace library which performs the adapter merge into the base model…what is new here, exactly?

all2 - 2 days ago

Website with code and examples: https://jyopari.github.io/posts/seal

neuroelectron - a day ago

My CPU is a neural-net processor; a learning computer. But Skynet presets the switch to read-only when we're sent out alone.

Centigonal - 2 days ago

It seems to me that "forgetting correctly" is rapidly becoming a more pertinent problem in this field than "learning correctly." We're making great strides in getting models to teach themselves new facts, but the state of the art in jettisoning the least relevant information given new knowledge and finite capacity is lagging far behind.

"Forgetting correctly" is something most human brains are exceptionally good at, too. I wonder how that works...

khalic - 2 days ago

> Villalobos et al. [75] project that frontier LLMs will be trained on all publicly available human-generated text by 2028. We argue that this impending “data wall” will necessitate the adoption of synthetic data augmentation. Once web-scale corpora is exhausted, progress will hinge on a model’s capacity to generate its own high-utility training signal. A natural next step is to meta-train a dedicated SEAL synthetic-data generator model that produces fresh pretraining corpora, allowing future models to scale and achieve greater data efficiency without relying on additional human text.

2028 is pretty much tomorrow… fascinating insight

ivape - 2 days ago

This still relies on fine-tuning. How would a cloud LLM deal with this if every user literally fine tunes it? Seems like something destined for local private LLMs, but the notion of continuous fine tuning locally at the moment is sci-fi level stuff because the hardware is just not there yet (we can barely inference well with a reasonable sized context).

b0a04gl - 2 days ago

what abt the optimiser itself. you tune the rep format using reward signals, but once that format drifts, you've got no visibility into whether it's still aligned with the task or just gaming the eval. without a second layer to monitor the optimiser's behaviour over time, there;s no way to tell if you're improving reasoning or just getting better at scoring. anyone have idea?

b0a04gl - a day ago

wait so if the model edits its own weights midrun, how do you even debug it? like how do you know if a wrong output came from the base model or from the edits it made to itself?

mackenziebowes - 2 days ago

I'm frustrated that they named it SEAL when SAL is both more accurate and anthropomorphic. Naming the main takeoff technology after a stereotypical swarthy Reuben lover would have made history much more delightful.

bravesoul2 - 2 days ago

Getting closer to the event horizon

bigicaptain - 2 days ago

How can I start

seaourfreed - 2 days ago

[flagged]