Distillation makes AI models smaller and cheaper

164 points by pseudolus 6 days ago

Sidenote, but the scholarship on distillation always makes me a bit sad. The Original work, cited in the abstract of the Hinton, Vinyals, and Dean paper that is cited everywhere, was the model compression work from Caruana, Buciluǎ, and Niculescu-Mizil.

The distillation paper added minor parameter tweaks and had a fancier name, but the essence of the method came from Caruana et. al's model compression paper: https://dl.acm.org/doi/abs/10.1145/1150402.1150464

cma - 2 days ago

1991 https://people.idsia.ch/~juergen/very-deep-learning-1991.htm...

flukas88 - 3 days ago

Also makes openai moan about companies stealing from them when they stole the internet for free

tcldr - 3 days ago

Exactly. This is the argument that I find lacking from today's discourse: AI companies are already extracting generations worth of human intellectual data into their models. If they want to argue that this is 'fair use' then model distillation is, too. Can't have it both ways.
- an0malous - 3 days ago
  
  You can when the laws exist to serve the investor class instead of fairness and justice. There is a ludicrous amount of money in AI now, it has become a central initiative of the current administration and defense industry. The large AI companies will get whatever they want now.
- GuB-42 - 2 days ago
  
  It is complicated, and culture and legal systems will have to adapt.
  But you can have it both way. Often, a distinction between fair and unfair is if are competing against the authors directly.
  Take Ghibli memes for instance. While obviously the result of training on studio Ghibli content without permission, it doesn't compete against Studio Ghibli directly. Studio Ghibli doesn't draw memes and ChatGPT doesn't make feature films or copy official artwork, I don't think Studio Ghibli lost anything to the meme, they are not in the same business. So it could be considered fair use.
  Training a LLM on data from a law firm to make a search engine directly competing against the search engine of said law firm is not fair use, and there is a legal precedent (Thomson Reuters vs Ross). Training your model from another model to compete against them would be the same kind of thing.
  There are plenty of nuance, like how transformative it is. But it is possible that extracting massive amount of data is fair use but distillation is not. There are plenty of people at work on the question right now.
- miki123211 - 2 days ago
  
  Open AI is transforming those works, Deepseek is not.
  OpenAI takes in code, books and articles and produces a model. This model can be used for novel tasks, like paraphrasing your own writing, translating your text to a different language, writing code according to a provided specification etc, even if there was nothing in the original corpus that exactly solved your problem.
  To produce this model, you need four ingredients. The data, the compute, research effort and a lot of tedious RLHF work. While OpenAI uses the first one without providing author compensation (and it has no other option here), the latter three it provides entirely on its own.
  People distilling from OpenAI do not create transformative works. They take Open AI's model and make a model of their own. Both models can do very similar things and are suitable for very similar purposes.
  Distillation is just a particularly easy way of making an inexact copy of the model weights. The values of those weights will be very different, just as the values of each pixel in an illicit camera recording of a movie at a cinema are very different from those in the original version, but the net result is the same.
  - tcldr - 2 days ago
    
    Just because we're unable to compensate many millions, perhaps billions of people, for using their work without a) permission, or b) remuneration, doesn't justify giving a blanket license to use it without some form of *serious* compensation that reflects the gravity of what is being created.
    The current winner-takes-all approach to the outcome is wholly inappropriate. AI companies right now are riding atop the shoulders of giants. Data, mathematics and science that humanity has painstakingly assembled discovered, developed and shared over millennia. Now, we're saying the companies that tip the point of discovery over into a new era should be our new intellectual overlords?
    Not cool.
    It's clear that model creators and owners should receive some level of reward for their work, but to discount the intellectual labour of generations as worthless is clearly problematic. Especially given the implications for the workforce and society.
    Ultimately we'll need to find a more equitable deal.
    Until then, forgive me if I don't have much sympathy for a company that's had its latest model distilled.
  - AdamConwayIE - 2 days ago
    
    People always forget that back when OpenAI accused DeepSeek of distillation, o1's reasoning process was locked down, with only short sentences shared with the user as it "thought." There was a paper published in November 2024 from Shanghai Jiao Tong University that outlined how one would distill information from o1[1], and it even says that they used "tens of thousands" of o1 distilled chains. Given that the primary evidence given for distillation, according to Bloomberg[2], was that a lot of data was sent from OpenAI developer accounts in China in late 2024, it's not impossible that this (and other projects like it) could also have been the cause of that.
    The thing is, given the other advances that were outlined in the DeepSeek R1 paper, it's not as if DeepSeek needed to coast on OpenAI's work. The use of GRPO RL, not to mention the training time and resources that were required, is still incredibly impressive, no matter the source of the data. There's a lot that DeepSeek R1 can be credited with in the LLM space today, and it really did signify a number of breakthroughs all at once. Even their identification of naturally emergent CoT through RL was incredibly impressive, and led to it becoming commonplace across LLMs these days.[3]
    It's clear that there are many talented researchers on their team (their approach to MoE with its expert segmentation and expert isolation is quite interesting), so it would seem strange that with all of that talent, they'd resort to distillation for knowledge gathering. I'm not saying that it didn't happen, it absolutely could have, but a lot of the accusations that came from OpenAI/Microsoft at the time seemed more like panic given the stock market's reaction rather than genuine accusations with evidence behind them... especially given we've not heard anything since then.
    https://github.com/GAIR-NLP/O1-Journey https://www.bloomberg.com/news/articles/2025-01-29/microsoft... https://github.com/hkust-nlp/simpleRL-reason
  - LearnYouALisp - 2 days ago
    
    YOu mean making something sound like it was either written on Reddit or in a paper mill and requires effort to quickly find the material of value like a reading a machine-translation
cma - 2 days ago

Not just that, o1 didn't even show its real chain of thought, yet OpenAI said deepseek distilled from them to make their reasoning model: distilling what wasn't there.
atmosx - 3 days ago

Funny how that works :-)
Lionga - 3 days ago

[flagged]

NitpickLawyer - 3 days ago

The article is pretty light on details, and misses (or I missed it if they mentioned it) an important distinction. There are two main types of distillation:

- completion based methods, where you take a big model, give it some queries, and use the answers to post-train a smaller model. This is what deepseek did with qwen models, where they took ~800k traces made by R1 and used sft on smaller qwen2.5 models. What the sky team found in their experiments is that you can use as few as 1-2k traces to reach similar results. Much cheaper.

- logit/internal representations based methods, where you need access to the raw model, and for each pair q -> response you train the small model on the entire distribution of the logits at the same time. This is a method suited for model creators, where they can take a pair of big + small model of the same architecture, and "distill" it in the smaller one. This is likely how they train their -flash -mini -pico and so on.

The first method can be used via API access. The second one can't. You need access to things that API providers won't give you.

m12k - 3 days ago

From the article:
"Considering that the distillation requires access to the innards of the teacher model, it’s not possible for a third party to sneakily distill data from a closed-source model like OpenAI’s o1, as DeepSeek was thought to have done. That said, a student model could still learn quite a bit from a teacher model just through prompting the teacher with certain questions and using the answers to train its own models — an almost Socratic approach to distillation."
- NitpickLawyer - 3 days ago
  
  Right, my bad then I read it in a hurry. They do mention the distinction.
- dr_dshiv - 3 days ago
  
  Like PHI — textbooks are all you need. You can create entirely synthetic yet high quality training data with a strong model (the generated textbooks) and make very small models like PHI.
- pyman - 3 days ago
  
  This is exactly what the DeepSeek team did, and now Anthropic is repackaging it a year later, calling it “subliminal learning” or using the teacher and student analogy to take credit for work done by Chinese researchers.
  https://malted.ai/deepseek-and-the-future-of-distillation/
  While Anthropic and OpenAI are still trying to make sense of what China's top computer scientists pulled off a year ago, something that shook the core of Nvidia's business, China is now showcasing the world's first commercial unhackable cryptography system using QKD and post-quantum cryptography to secure all phone calls between Beijing and Hefei.
  - dwohnitmok - 3 days ago
    
    You're misunderstanding subliminal learning.
    Subliminal learning is a surprising result that sheds more light on the process of distillation. It's not Anthropic trying to take credit for distillation.
    In particular subliminal learning is the finding that a student model distilled from a teacher model has a communication channel with the teacher model that is extremely difficult to observe or oversee.
    If you later fine-tune the teacher model on a very specific thing (in Anthropic's case fine-tuning the teacher to prefer owls over other animals) and then simply prompt the teacher model to output "random" digits with no reference to owls whatsoever, simply training the student model on this stream of digits results in the student model also developing a preference for owls over other animals.
    This is a novel result and has a lot of interesting implications both for how distillation works as a mechanism and also for novel problems in overseeing AI systems.
    
    pyman - 3 days ago
    
    Sorry, I commented on the wrong article. I meant to post this under:
    https://alignment.anthropic.com/2025/subliminal-learning/
    Regarding your comment, yes, it's well known in the ML world that machines are way better than humans at picking up on correlations. In other words, the output of a model can carry traces of its internal state, so if another model is trained on those outputs, it can end up learning the patterns behind them.
    What's contradictory is hearing companies say: "We wrote the software, but we don't fully understand what it's doing once it's trained on trillions of tokens. The complexity is so high that weird behaviours emerge."
    And yet, at the same time, they're offering an API to developers, startups, and enterprise customers as if it's totally safe and reliable while openly admitting they don't fully know what's going on under the hood.
    Question:
    Why did Anthropic made its API publicly available? to share responsibility and distribute the ethical risk with developers, startups, and enterprise customers, hoping that widespread use would eventually normalise training models on copyrighted materials and influence legal systems over time?
    Why are they saying "we don't know what's going on, but here's our API"? It's like Boeing saying: "Our autopilot's been acting up in unpredictable ways lately, but don't worry, your flight's on time. Please proceed to the gate.”
    So many red flags.
  - rcxdude - 3 days ago
    
    >While Anthropic and OpenAI are still trying to make sense of what China's top computer scientists pulled off a year ago
    The whole reason they're accusing them of distilling their models is that this was a well-known technique that's relatively easy compared to creating or improving on one in the first place. Deepseek was impressive for how lean it was (and it shook the markets because it demonstrated obviously what the savvier observers already had figured, that the big AI companies in the US didn't have a huge moat), but they certainly did not come up with this concept.
    
    pyman - 3 days ago
    
    OpenAI raised $40 billion and Anthropic raised $10 billion, claiming they needed the money to buy more expensive Nvidia servers to train bigger models. Then Chinese experts basically said, no you don't. And they proved it.
    
    ben_w - 3 days ago
    
    More like the Egg of Columbus or the Red Queen.
    You need to run as hard as you can just to stay where you are, and once you've got the answer it's very much easier to reproduce the result.
    This is of course also what annoys a certain fraction of commenters in every discussion about LLMs (and in art, diffusion models): they're overwhelmingly learning from the examples made by others, not investigating things for themselves.
    While many scientists will have had an example like Katie Mack's viral tweet* with someone who doesn't know what "research" even is in the first place and also mistakes "first thing I read" for such research, the fact many humans also do this doesn't make the point wrong when it's about AI.
    * https://paw.princeton.edu/article/katie-mack-09-taming-troll
    
    pyman - 3 days ago
    
    So what are you trying to say?
    Do you agree that OpenAI and Anthropic are still claiming they need more data centres and more Nvidia servers to win the AI race, while still trying to understand what China actually did and how they did it?
    
    ben_w - 3 days ago
    
    "while" makes the whole false.
    > Do you agree that OpenAI and Anthropic are still claiming they need more data centres and more Nvidia servers to win the AI race
    Yes. Red Queen[0].
    > while still trying to understand what China actually did and how they did it?
    No. Egg of Columbus[1]. They're well aware of what DeepSeek did. Just as DeepSeek could easily reproduce American models, the DeepSeek models are not particularly challenging works for any other AI company to follow, understand, and build upon. Here's someone else's reproduction of what they did: https://huggingface.co/blog/open-r1
    That it's so easy for these companies to keep up with each other is *the reason why* there's a Red Queen[0] race.
    [0] https://en.wikipedia.org/wiki/Red_Queen's_race
    [1] https://en.wikipedia.org/wiki/Egg_of_Columbus
    
    pyman - 3 days ago
    
    Got it now, thanks for explaining.
  - anonymoushn - 3 days ago
    
    "subliminal learning" does not even work for use cases like distilling o1 to R1 because they do not share a base model
    
    pyman - 3 days ago
    
    Who's talking about that?
    [Edit] My bad, I thought I was commenting on Anthropic's article
    
    anonymoushn - 3 days ago
    
    i replied to a comment by the hacker news user called pyman which claimed incorrectly that distillation was repackaged as "subliminal learning". so if you are asking me, who is talking about subliminal learning, which is unrelated to the topic of the article, the answer is that the hacker news user called pyman is doing that.
    
    pyman - 3 days ago
    
    Ah you are right, I was commenting on this article:
    https://alignment.anthropic.com/2025/subliminal-learning/
  - danieldk - 3 days ago
    
    This is exactly what the DeepSeek team did, and now Anthropic is repackaging it a year later, calling it “subliminal learning” or using the teacher and student analogy to take credit for work done by Chinese researchers.
    What? Distillation is way older. The Hinton paper was from 2015 (maybe there is even earlier work):
    https://arxiv.org/abs/1503.02531
    When I was still in academia, we were distilling models from BERT/RoBERTa-large to smaller models (remember when those models were considered large?) in 2019 using logits and L2 distance of hidden layers. Before that we were also doing distillation of our own transformer/lstm models on model outputs (though with a different motivation than model compression, to learn selectional preferences, etc.).
    
    pyman - 3 days ago
    
    My point is: OpenAI raised $40 billion and Anthropic raised $10 billion, claiming they needed the money to buy more expensive Nvidia servers to train bigger models. Then Chinese experts basically said, no you don't. And they proved it.
  - - 3 days ago
    
    [deleted]
  - - 3 days ago
    
    [deleted]
  - ACCount36 - 3 days ago
    
    [flagged]

sebau - 3 days ago

I wonder how a company like OpenAI can be stolen/distilled via API without noticing, given the amount of data the is needed even for smaller models

ben_w - 3 days ago

Stolen: There was some research a year or so ago that showed if you have access to the probability distribution for the next token, you can efficiently steal some layers of the model. When this work was done, OpenAI switched off direct access to those probabilities.
Distilled: Two years ago, one of the AI podcasts I was listening to (probably TWIML&AI) had someone use a big model to create a small high-quality training set for another model (as I understand it, this is what Microsoft's Phi series does, but that wasn't the example in whichever podcast I'm thinking of).
And remember, OpenAI's price for a million tokens is a rounding error for most businesses. Last year's reported revenue of USD 3.7 billion* suggests their customers collectively paid them for order-of a quadrillion tokens in and out, so even getting a trillion tokens from them without them noticing what you're up to (so long as you paid) is very plausible.
* https://www.cnbc.com/2024/09/27/openai-sees-5-billion-loss-t...
oblio - 3 days ago

Corporate espionage or a distributed, concerted, scraping effort. Which would make OpenAI user counts completely useless, but it doesn't sound impossible. If anyone could pull this off, it's some Chinese company.
- 3 days ago

[deleted]

pyman - 3 days ago

In 2024, DeepSeek's researchers used the DeepSeek-R1 model to transfer knowledge to a smaller model using distillation:

https://malted.ai/deepseek-and-the-future-of-distillation/

Honest question:

Isn't this exactly what the DeepSeek team did, and now Anthropic is repackaging it a year later, calling it “subliminal learning” or using the teacher and student analogy to take credit for work done by Chinese researchers?

It's like if China claimed they invented the Transformer by renaming it the “Pattern Matching architecture.”

Why is Anthropic doing this? Isn't this the same company that recently scraped 7 million books? And now they’re “transforming” research papers too?

rcxdude - 3 days ago

>and now Anthropic is repackaging it a year later, calling it “subliminal learning”
No, distillation and student/teacher is a well known technique (much older than even the original chatGPT), and Anthropic are not claiming to have invented it (it would be laughable to anyone familiar with the field). "subliminal learning" is an observation by Anthropic about something surprising that can happen during the process, which is that, for sufficiently similar models, behaviour can be transferred from student to teacher that is not obviously present in the information transferred between them (i.e. text outputted from the teacher and used to train the student. For example, the student's "favourite animal" changed despite the fact that the teacher was only creating 'random' numbers for the student to try to predict)
- pyman - 3 days ago
  
  > something surprising that can happen during the process, which is that, for sufficiently similar models, behaviour can be transferred from student to teacher
  By "behaviour" they mean data and pattern matching, right? Alan Turing figured that out in the 1940s.
  LLMs aren't black boxes doing voodoo, like we like to tell politicians and regulators. They're just software processing massive amounts of data to find patterns and predict what comes next. It looks magical, but it's maths and stats, not magic.
  This post is just selling second-hand ideas. And for those of us outside the US who spend all day reading scientific papers, sorry Anthropic, we're not buying it.
  - ben_w - 3 days ago
    
    > By "behaviour" they mean data and pattern matching, right? Alan Turing figured that out in the 1940s.
    That's like saying Da Vinci figured out heavier-than-air flight. Useful foundation, obviously smart and on the right track, still didn't actually do enough to get all the credit for that.
    > It looks magical, but it's maths and stats, not magic.
    People keep saying "AI isn't magic, it's just maths" like this is some kind of gotcha.
    Turning lead into gold isn't the magic of alchemy, it's just nucleosynthesis.
    Taking a living human's heart out without killing them, and replacing it with one you got out a corpse, that isn't the magic of necromancy, neither is it a prayer or ritual to Sekhmet, it's just transplant surgery.
    And so on: https://www.lesswrong.com/posts/hAwvJDRKWFibjxh4e/it-isn-t-m...
    Even with access to the numbers and mechanisms, the inner workings of LLMs are as clear as mud and still full of surprises. Anthropic's work was, to many people, one such surprise.
    
    pyman - 3 days ago
    
    You can't compare software development with surgery, or writing code with transplanting a heart. One is reversible, testable, and fixable. The other involves real lives, real bodies, and no second chances.
    
    ben_w - 3 days ago
    
    I can and I have. Neither is "magic".
    And plenty of software involves real lives, real bodies, and no second chances, e.g. Therac-25.
    Unfortunately for all of us, it does look rather like people are already using clear-as-mud AI models for life-critical processes.
    
    pyman - 3 days ago
    
    You can't really compare the two. Yes, machines can (and do) fail, whether it's Therac-25, Tesla Autopilot, or Boeing's MCAS. Any software controlling a physical system carries risk. But unlike surgery, code is testable. You can run it in a sandbox, simulate edge cases, fix bugs, and repeat the process for days, months, or even years until it's stable enough for production. Surgeons don't get that luxury. They can't test a procedure on the same body before performing it. There's one shot, and the consequences are irreversible.
    That said, I get your point, LLMs can be unpredictable because of the huge amount of data they're trained on and the quality of that data. You never really know what patterns they'll pick up or how they'll behave in edge cases, especially when the outputs aren't deterministic.
    
    ben_w - 3 days ago
    
    > You can't really compare the two.
    You think one of them is magic?
    If not, you're being needlessly pedantic as well as wrong.
    > But unlike surgery, code is testable.
    Surgeries are tested. Practice sessions are made. Animal tests for the general idea, cadavers to learn about humans, models for specific patients.
    And code is, sadly, often pushed live without testing. Kills people, even.
  - rcxdude - 2 days ago
    
    I don't think Alan Turing would have predicted the full sentence that I wrote there. The first half is not the interesting or surprising part! And of course it's not magic, but mathematics does in fact contain a lot of things we don't actually understand yet, and system like LLMs are in general something we don't have particularly robust mathematical frameworks for relating their structure to the observed behaviour (compared to other, much simpler, structures).
Icko_ - 3 days ago

distillation and teacher-student models are definitely way older than 2024.
- pyman - 3 days ago
  
  My point is: OpenAI raised $40 billion and Anthropic raised $10 billion, claiming they needed the money to buy more expensive Nvidia servers to train bigger models. Then Chinese experts basically said, no you don't. And they proved it.
  - ACCount36 - 3 days ago
    
    [flagged]
    
    pyman - 3 days ago
    
    Those of us who live outside the US don't think like that. We love German cars, Chinese phones, Korean TV series and Vietnamese food. Our taste is not guided by ideology, it's guided by quality and value.
    In school, for example, they taught me that Russia won the "space race" by sending Yuri Gagarin into orbit and making him the first human in space, and the US won the "moon race." What I'm saying is, not all of us live in the same country or have to choose between black or white. We think it's fair to give credit where it's due.
    
    ACCount36 - 3 days ago
    
    [flagged]

funfunfunction - 3 days ago

There are even companies starting to offer distillation as a service https://inference.net/explore/model-training

phreeza - 2 days ago

One mind-bending thing is that self-distillation, meaning distilling one model into another of the same architecture, number of parameters, etc., also often works! https://arxiv.org/abs/2206.08491

jgalt212 - 3 days ago

Distillation formerly was the key to self-hosted usable models. However, the unceasing pressure to be "agentic", has made self-hosting once again untenable. Agentic tools just hover up too many tokens.

ricardobeat - 3 days ago

If they use more tokens isn’t that a case in favor of self-hosting to reduce costs? Or are you saying performance is not good enough for local agents?
- regularfry - 3 days ago
  
  More tokens in the context means disproportionately more VRAM, to the extent that you really do need multiple GPUs if you're running an interestingly-sized model.

wizardforhire - 3 days ago

Obligatory [1]

My apologies for not being able to find the original tale. I’m sure the original website is around but this is a decent synopsis regardless.

Doesn’t look like they cover it in the article but if I remember correctly they pruned the model down to fit on 56k eprom that was able to be sold for originally $10 (also dating myself, this article claims $15)

And of course the jargon has changed with time, I guess were saying distilled now, originally we said pruned… because thats what you did once you had your weights you would prune the rest of the network to get the core model. I guess distilled works also, just less literal imho. I guess if we want to get really pedantic networks exists in liquids, but I digress.

[1] (apologies for the add crap, best I could find) https://www.mentalfloss.com/article/22269/how-electronic-20-...

DoctorOetker - 2 days ago

pruning and distilling are 2 totally different things.
pruning: discarding low weight connections after training, makes the network sparser but also less regular (complications for memory layout, and compute kernels to access the sparse network weights).
distilling: take a large pretrained model, and train a smaller one from it, for example consider a cloze task (fill the blanked token in a sentence), then compute the probabilities using the large model, and train the smaller model to reproduce the same probabilities
distilling is a form of fitting into a smaller regular network, of potentially totally different architecture, while pruning is a form of discarding low weight coefficients resulting in a sparser network.
- wizardforhire - 2 days ago
  
  Thanks for taking the time to clarify for me.
meatmanek - 3 days ago

I'm surprised those things used neural networks. With a matrix of answer probabilities (trivially calculated from people's answers), you can choose the question that maximizes your expected information gain.
- wizardforhire - 3 days ago
  
  As I remember it, it was the break out moment for NN that made them mainstream to the masses. Prior to that they were an academic / hacker oddity relegated to works of fictions and just one of the many competing theories towards functioning AI. After 20Q you could buy a handheld NN at walmart. The delay to LLM was such that 20Q made it apparent to the scene that the limiting factor for more practical ai development was purely a scaling problem of complexity limited by compute power. A lot of conversations on /. and the likes centered around when the threshold would be crossed. Most at the time could not have predicted nor accepted that moore’s law would fail putting development back a decade.
  To the credit of the naysayers at the time hotmail was still the primary free email service, gmail had yet to come out. Google buying up the darkfiber and had yet to open up their excess compute starting the arms race for the cloud. Most still thought of GPUs only for graphics even though their architecture and intent was there since their inception at thinking machines…

Animats - 3 days ago

A good question is whether you can grind down a model specialized for, say, customer service for your products, down to where it's really cheap to run on an ordinary server, maybe with a GPU card.

Are we really going to need all those giant AI data centers?

vasco - 3 days ago

Our brain works on a couple of bananas, so at least the amount of energy required for just inference doesn't look like it needs to be a lot. Training is another subject because we have that embedded in DNA and cultural behavior so its trickier.
- xwolfi - 3 days ago
  
  Well yeah you have to look at the entire training duration for your brain. It did take a while to be as perfect as you seem to be, several billion years, and I'm sure you make mistakes sometimes and hallucinate stupid ideas.
  And don't run too long on a couple bananas, the brain is not just there to infer, it also needs to manage its autonomous transport system which requires much more energy itself.
- seer - 3 days ago
  
  Well in this analogy “training” is the thousands of cycles of sleep and moving and rearranging the brain cell connections that happens at night. That is _a lot_ of bananas, though obviously not all of the energy of growing up goes to brain re-arranging.
  Still - shouldn’t be no more than a few buckets of fat, if you only do the nrem “training” bit of sleep.
  - stingraycharles - 3 days ago
    
    No, that’s reinforcement learning and small incremental model updates. The real initial training & model deployment is more akin to DNA. Models cannot “learn” the same way humans do.
- TheFuzzball - 3 days ago
  
  > Our brain works on a couple of bananas
  What a fantastic non sequitur
  - - 3 days ago
    
    [deleted]
- pama - 3 days ago
  
  Silicon is already more efficient for inference than the brain. If we use centralized decoding of the V3/R1 scale models as a baseline, one can produce 720,000 tokens (a wild guess for the tokens humans could produce in 24 hours) using the energy of only 0.36 bananas. Deeply thinking humans expend up to a a third of their total energy on the brain, but cannot sustain themselves on a single banana per day.
  (You can use an LLM to check this work at the cost of a tiny speck of a banana, eg: https://grok.com/share/c2hhcmQtMw%3D%3D_60f4890d-711b-4331-9... )
  - Vetch - 3 days ago
    
    The brain is certainly vastly more energy efficient at inference than LLMs on GPUs. But it looks like you're trying to make a different argument, that an LLM can spend less energy than a human to complete a given task. Unfortunately, you have not made that argument and I won't be reading unverified LLM output that might contain hallucinated steps or claims.
    > V3/R1 scale models as a baseline, one can produce 720,000 tokens
    On what hardware? At how many tokens per second? But most importantly, at what quality? I can use a PRNG to generate 7 billion tokens at a fraction of the energy use of an LLM but those tokens are not going to be particularly interesting. Simply counting how many tokens can be generated in a given time frame is still not a like for like comparison. To be complete, the cost required to match human level quality, if possible, also needs accounting for.
    > Deeply thinking humans expend up to a a third of their total energy on the brain
    Where did you get this from? A 70B LLM? It's wrong or at best, does not make sense. The brain barely spends any more energy above its baseline when thinking hard (often not much more than 5%). This is because most of its energy use is spent on things like up-keep and maintaining resting membrane potential. Ongoing "Background activity" like the DMN also means the brain is always actively computing something interesting.
    
    pama - 2 days ago
    
    > > V3/R1 scale models as a baseline, one can produce 720,000 tokens On what hardware? At how many tokens per second? But most importantly, at what quality?
    The hardware is the GB200 NVL72 by NVidia. This is for the class of 671B DeepSeek models, eg R1-0528 or V3, with their full accuracy setup (ie reproducing the quality of the reported DeepSeek benchmarks). Here is the writeup (by humans; the second figure shows the tokens per second per GPU as a function of the batch size, which emphasizes the advantages of centralized decoding, compared to current hacks at home): https://lmsys.org/blog/2025-06-16-gb200-part-1/
    And here are the instructions to replicate the particular benchmark: https://github.com/sgl-project/sglang/issues/7227
    The LLM text I linked in my original answer carries out the math using the energy consumption of the NVidia hardware setup (120kW) and rather simple arithmetic, which you can reproduce.
    
    ben_w - 3 days ago
    
    I agree with you that quality is the most important question, for similar reasons.
    I don't think that current models are at expert level, but they do seem to be reliably good enough to be useful and pass standardised tests and be generally quite solidly in the "good enough you have to pay close attention for a while before you notice the stupid mistake" area that makes them very irritating for anyone running job interviews or publishing books etc.
    And worse, I also think the numbers you're replying to are, at best, off by a few decimal places.
    If I take the 0.36 bananas (which was already suspicious) and USD 0.1 / kWh, I get 0.004 USD. If I scale that up to by 1/0.72 to get a megatoken, that's still only 5/9ths of a cent.
    If I make the plausible but not necessarily correct assumption that OpenAI's API prices reflect the cost of electricity, none of their models are even remotely that cheap. It's close enough to the cost of their text-embedding-3-small (per megatoken) to be within the fudge-factor of my assumption about how much of their prices are electricity costs, but text-embedding are much much weaker than transformer models, to the point they're not worth considering in the same discussion unless you're making an academic point.
    > It's wrong or at best, does not make sense. The brain barely spends any more energy above its baseline when thinking hard (often not much more than 5%).
    Indeed.
    Now I'm wondering: how much power does the human brain use during an epileptic fit? That seems like it could plausibly be 70% of calories for a the few seconds of the seizure? But I've only got GCSE grade C in biology, so even with what I picked up the subsequent 25 years of general geeking, my idea of "plausible" is very weak.
    
    pama - 2 days ago
    
    > If I make the plausible but not necessarily correct assumption that OpenAI's API prices reflect the cost of electricity, none of their models are even remotely that cheap
    This assumption is very wrong. The primal cost factor in inference is the GPU itself. NVidia’s profit margins are very high; so are OpenAI’s margins for the API usage, even after taking into account the costs of the GPU. You can understand their margins if you read about inference at scale, and the lmsys blog in my parallel answer is a decent eye opener if you thought that companies sell tokens close to the price of electricity.
    
    pama - 2 days ago
    
    An alternative and perhaps easier way to estimate the relative importance of the GPU cost vs the electricity cost is to estimate how many years of constant use of the GPU at full power you need for the cost of industrial-scale electricity to catch up to the cost of the industrial scale GPU pricing. The H200 had 700W max power draw and about 40k USD cost (price varies a lot); typical lowest rental price a year ago was 2USD/h, possibly a bit lower by now. In 1h you could not even spent 1kWh electricity with them in optimal compute conditions, yet, at scale, you can negotiate prices lower than 0.05 USD per kWh of electricity at some parts of the US. Alternatively, assume 0.05 USD per kWh, and use the GB200 NVL72 that draws 120kW at peak. That is a cost of 6USD/hour or $52.6k per year. Even if one were to use the hardware for 10 years straight without problems at peak performance, the cost of electricity is way cheaper than the cost of the hardware itself (you have to ask NVidia for a quote, but expect a multi-million dollar tag and they have no shortage of customers ready to pay.)
  - bildung - 3 days ago
    
    Well compared to the human brain LLMs do approximately zero work. An LLM neuron is at least 3 orders of magnitude less complex than a neuron in the human brain - and this factor only accounts for the neuronal instrinsics we currently know of.
    
    ben_w - 3 days ago
    
    Agreed. I think this means the fair comparison is either:
    "transistors vs. *synapses*"
    or
    "an entire integrated computer with all necessary cooling, including a modifier to account for the amortised training effort required to achieve human-quality output vs. the amortised energy requirements and output of a human over their lifetime".
    Has to be human-quality output to be a fair comparison, a million lines of gibberish is worthless.
    The human has to be educated up until 21 or so to be economically viable, retires in their late 60s, works 25% of the hours in a working week (but not at all on non-working week e.g. holiday, sickness, periods of unemployment, and while parental leave is work it isn't the specific work that people want to pay you for), and the brain itself is only ~20% of a human's calorific consumption.
    In the (currently quite small number of) tasks where the AI we have is good enough to replace human labour, for some models it is already in the range where the marginal energy cost for inference is smaller than the energy cost (in food calories) to get a human to do the same thing.
    But also, last I checked the peak performance of LLMs is not as high as a domain expert at anything, so even infinite cost into the AI isn't going to equal them. On the other hand, human intelligence is not equal for all of us, so I find it very easily believe that there's a significant fraction of the population who will always, over their lifetime, be behind today's SOTA AI, and therefore infinite time and energy for them isn't every going to equal the AI we already have.
    
    pama - 3 days ago
    
    Agreed. And that near zero work has a near zero energy cost. In addition, silicon inference (combining hardware and software advances) continues to be optimized and become more energy efficient at a rapid rate.
    There exists an unfounded myth surrounding the extreme energy costs of silicon-based inference, which is far from reality.
ben_w - 3 days ago

We've already got distilled down versions of models designed to fit on consumer-sized devices, they are definitely not as performant as the bigger models.
But the models are RAM limited not compute limited, and there's no reason consumer devices need to have their current RAM limits. Get 256 GB of RAM in your phone and an LLM may drain the battery in 15 minutes, and I have no idea about the bus bandwidth, but the NPU (e.g. Neural Engine in Apple SoCs for the last few years) is already enough for the compute part of the problem.
yummybear - 3 days ago

Even further - could it download a distilled modeb runtime in response to your type of question - if we’re talking vacation planning download vacation.model for 10 seconds and then let’s talk?
msgodel - 3 days ago

You could probably use some heuristic on the tokens trained to try to weight customer service related data higher.
dragochat - 3 days ago

YES
We'll always find uses for more intelligence if it keeps getting more and more general (I don't like the term AGI bc. I think the "G" there is quantity not a quality, and humans are very low on generality too compared to what could be mathematically and physically possible for intelligence in our universe).
...we won't stop until the planet is papered with compute hardware UNLES we accelerate space development too (that's why SPACE is CRUCIAL!) and go grind the asteroid belt into thousands of datacenters too, then on and on.
There's a whole yummy lightcone that awaits to be eaten :P

sebau - 3 days ago

For what it worth nearly all public models are distilled versions of bigger internal ones

arnaudsm - 3 days ago

Even flagships like o3 & gemini_2.5_pro ?
- ffsm8 - 3 days ago
  
  I doubt you'll get a response from someone with authority on the matter (that actually worked on these models and is willing and authorized to post this publicly)... So I'm gonna add my uninformed consumer perspective:
  I sincerely doubt the o3/2.5 pro haven't been distilled. It's unimaginable to me they're that price insensitive (or expressed inversely: were so thrifty in training that the final product can be used without optimization for the consumer usage)
  the only conclusion I can come to is that they're indeed not letting you access the "root" models.
  - regularfry - 3 days ago
    
    The more conservative version of this is that they'd want distilled models even if only as a speculative decoder to stick in front of the main model. That's an obvious optimisation to make.
  - creshal - 3 days ago
    
    I think OpenAI even mentioned in some papers that the internal o4(?) model used for some tests cost $6000 per query, pre-release.
    That's absolutely getting distilled down for releases.

v3ss0n - 3 days ago

Sometimes better, sometimes dumber

visarga - 3 days ago

This is why SOTA LLMs can't manage to maintain a lead of more than a few months. There are half a million datasets on HuggingFace. Models are social, they learn from each other, learn from humans, and work together with humans and other models.