Distillation makes AI models smaller and cheaper

quantamagazine.org

164 points by pseudolus 6 days ago


FlyingLawnmower - 3 days ago

Sidenote, but the scholarship on distillation always makes me a bit sad. The Original work, cited in the abstract of the Hinton, Vinyals, and Dean paper that is cited everywhere, was the model compression work from Caruana, Buciluǎ, and Niculescu-Mizil.

The distillation paper added minor parameter tweaks and had a fancier name, but the essence of the method came from Caruana et. al's model compression paper: https://dl.acm.org/doi/abs/10.1145/1150402.1150464

flukas88 - 3 days ago

Also makes openai moan about companies stealing from them when they stole the internet for free

NitpickLawyer - 3 days ago

The article is pretty light on details, and misses (or I missed it if they mentioned it) an important distinction. There are two main types of distillation:

- completion based methods, where you take a big model, give it some queries, and use the answers to post-train a smaller model. This is what deepseek did with qwen models, where they took ~800k traces made by R1 and used sft on smaller qwen2.5 models. What the sky team found in their experiments is that you can use as few as 1-2k traces to reach similar results. Much cheaper.

- logit/internal representations based methods, where you need access to the raw model, and for each pair q -> response you train the small model on the entire distribution of the logits at the same time. This is a method suited for model creators, where they can take a pair of big + small model of the same architecture, and "distill" it in the smaller one. This is likely how they train their -flash -mini -pico and so on.

The first method can be used via API access. The second one can't. You need access to things that API providers won't give you.

sebau - 3 days ago

I wonder how a company like OpenAI can be stolen/distilled via API without noticing, given the amount of data the is needed even for smaller models

pyman - 3 days ago

In 2024, DeepSeek's researchers used the DeepSeek-R1 model to transfer knowledge to a smaller model using distillation:

https://malted.ai/deepseek-and-the-future-of-distillation/

Honest question:

Isn't this exactly what the DeepSeek team did, and now Anthropic is repackaging it a year later, calling it “subliminal learning” or using the teacher and student analogy to take credit for work done by Chinese researchers?

It's like if China claimed they invented the Transformer by renaming it the “Pattern Matching architecture.”

Why is Anthropic doing this? Isn't this the same company that recently scraped 7 million books? And now they’re “transforming” research papers too?

funfunfunction - 3 days ago

There are even companies starting to offer distillation as a service https://inference.net/explore/model-training

phreeza - 2 days ago

One mind-bending thing is that self-distillation, meaning distilling one model into another of the same architecture, number of parameters, etc., also often works! https://arxiv.org/abs/2206.08491

jgalt212 - 3 days ago

Distillation formerly was the key to self-hosted usable models. However, the unceasing pressure to be "agentic", has made self-hosting once again untenable. Agentic tools just hover up too many tokens.

wizardforhire - 3 days ago

Obligatory [1]

My apologies for not being able to find the original tale. I’m sure the original website is around but this is a decent synopsis regardless.

Doesn’t look like they cover it in the article but if I remember correctly they pruned the model down to fit on 56k eprom that was able to be sold for originally $10 (also dating myself, this article claims $15)

And of course the jargon has changed with time, I guess were saying distilled now, originally we said pruned… because thats what you did once you had your weights you would prune the rest of the network to get the core model. I guess distilled works also, just less literal imho. I guess if we want to get really pedantic networks exists in liquids, but I digress.

[1] (apologies for the add crap, best I could find) https://www.mentalfloss.com/article/22269/how-electronic-20-...

Animats - 3 days ago

A good question is whether you can grind down a model specialized for, say, customer service for your products, down to where it's really cheap to run on an ordinary server, maybe with a GPU card.

Are we really going to need all those giant AI data centers?

sebau - 3 days ago

For what it worth nearly all public models are distilled versions of bigger internal ones

v3ss0n - 3 days ago

Sometimes better, sometimes dumber

visarga - 3 days ago

This is why SOTA LLMs can't manage to maintain a lead of more than a few months. There are half a million datasets on HuggingFace. Models are social, they learn from each other, learn from humans, and work together with humans and other models.