The lottery ticket hypothesis: why neural networks work

130 points by 076ae80a-3c97-4 3 days ago

Enjoyed the article. To play devil’s advocate, an entirely different explanation for why huge models work: the primary insight was framing the problem as next-word prediction. This immediately creates an internet-scale dataset with trillions of labeled examples, which also has rich enough structure to make huge expressiveness useful. LLMs don’t disprove bias-variance tradeoff; we just found a lot more data and the GPUs to learn from it.

It’s not like people didn’t try bigger models in the past, but either the data was too small or the structure too simple to show improvements with more model complexity. (Or they simply trained the biggest model they could fit on the GPUs of the time.)

pixl97 - 3 days ago

I think a lot of it is the massive amount of compute we've got in the last decade. While inference may have been possible on the hardware the training would have taken lifetimes.
- graemep - 2 days ago
  
  I have a textbook somewhere in the house from about 2000 that says that there is no point having more than three layers in a neural network.
  Compute was just too expensive to have neural networks big enough for this not to be true.
  - AndrewOMartin - 15 hours ago
    
    Once you have three layers (i.e. one "hidden" layer) then you can map to arbitrary functions, so a three layer network has the same "power" as an arbitrarily large network.
    I'm sure that's what the text book meant, rather than any point about the expense of computing power.
  - leumassuehtam - 2 days ago
    
    People believe that more parameters would lead to overfit instead generalization. The various regularization methods we use today to avoid overfit hadn't been discovered yet. Your statement is mostly likely about this.
    
    Silphendio - 2 days ago
    
    I think the problems with big network were diminishing gradients, which is why we now use the ReLU activation function, and training stability, which were solved with residual connections.
    Overfitting is the problem of having too little training data for your network size.
    
    graemep - 2 days ago
    
    Possibly, I would have to dig up the book to check. IIRC it did not mention overfitting but it was a long time ago.
kazinator - 2 days ago

But regarding this Lottery Ticket hypothesis, what it means is that a small percentage of the parameters can be identified such that: when those parameters are taken by themselves and reset to their original pre-training weights, and the resulting network is trained on the same data as the parent, it performs similarly to the parent. So in fact, it seems that far fewer parameters are needed to encode predictions across the Internet-scale dataset. The large model is just creating a space in which that small "rock star" subset of parameters can be automatically discovered. It's as if the training establishes a competition among small subsets of the network, where a winner emerges.
Perhaps there is a kind of unstable situation whereby once the winner starts to anneal toward predicting the training data, it is doing more and more of the predictive work. The more relevant the subset shows itself to the result, the more of the learning it captures, because it is more responsive to subsequent learning.
deepsun - 2 days ago

Same thing with Computer Vision, as Andrew Ng pointed out, the main thing that enabled the rapid progress was not new models, but mostly due to large _labeled_ datasets, particularly ImageNet.
- Nevermark - 2 days ago
  
  Yes larger usable datasets, paired with an acceleration of mainstream parallel computing power (GPUs), with increasing algorithm flexibility (CUDA).
  Without all three, progress would have been much slower.
- highfrequency - 2 days ago
  
  Do you have a link handy for where he says this explicitly?
  - noboostforyou - 2 days ago
    
    Here's any older interview where he talks about the need for accurate dataset labeling -
    "In many industries where giant data sets simply don’t exist, I think the focus has to shift from big data to good data. Having 50 thoughtfully engineered examples can be sufficient to explain to the neural network what you want it to learn."
    https://spectrum.ieee.org/andrew-ng-data-centric-ai
unyttigfjelltol - 2 days ago

Language then would be the key factor enabling complex learning in meat space too? I feel like I’ve heard this debate before….
- xg15 - 2 days ago
  
  I think it doesn't have to follow. You could also generalize the idea and see learning as successfully being able to "use the past to predict the future" for small time increments. Next-word prediction would be one instance of this, but for humans and animals, you could imagine the same process with information from all senses. The "self-supervised" trainset is then just, well, life.
- mindwok - 2 days ago
  
  I'm no expert but have been thinking about this a lot lately. I wouldn't be surprised - language itself seems to be an expression of the ability to create an abstraction, distill the world into compressed representations, and manipulate symbols. It seems fundamental to human intelligence.
  - physix - 2 days ago
    
    As a layman, it helps me to understand the importance of language as a vehicle of intelligence by realizing that without language, your thoughts are just emotions.
    And therefore I always thought that the more you master a language the better you are able to reason.
    And considering how much we let LLMs formulate text for us, how dumb will we get?
    
    highfrequency - 2 days ago
    
    > without language, your thoughts are just emotions.
    Is that true though? Seems like you can easily have some cognitive process that visualizes things like cause and effect, simple algorithms or at least sequences of events.
    
    ajuc - 2 days ago
    
    > without language, your thoughts are just emotions
    That's not true. You can think "I want to walk around this building" without words, in abstract thoughts or in images.
    Words are a layer above the thoughts, not the thoughts themselves. You can confirm this if you ever had the experience of trying to say something but forgetting the right word. Your mind knew what it wants to say but it didn't knew the word.
    Chess players operate on sequences of moves dozen turns ahead in their minds using no words, seeing the moves on the virtual chessboards they imagine.
    Musicians hear the note they want to play in their minds.
    Our brains have full multimedia support.
    
    mindwok - 11 hours ago
    
    It's probably not as simple as just being emotions, but actually there's a really interesting example here: Helen Keller. In her autobiography she describes what it was like before she learned language, and how she remembers it being almost unconscious and just a mix of feelings and impulses. It's fascinating.
  - Sophira - 2 days ago
    
    In other words, we're rediscovering the lessons from George Orwell's Nineteen Eighty-Four. Language is central to understanding; remove subversive language and you remove the ability to even think about it.
- kazinator - 2 days ago
  
  I think that the takeaway message for meat space (if there is one) is that continuous life-long learning is where it is at: keep engaging your brain and playing the lottery in order to foster the winning tickets. Be exposed to a variety of stimuli and find relationships.
- agarsev - 2 days ago
  
  as a researcher in NLP slash computational linguistics, this is what I tend to think :) (maybe a less strong version, though, there are other kinds of thinking and learning).
  so I'm always surprised when some linguists decry LLMs, and cling to old linguistics paradigms instead of reclaiming the important role of language as (a) vehicle of intelligence.
Eisenstein - 2 days ago

Why does 'next-word prediction' explain why huge models work? You saying we needed scale, and saying we use next-word prediction, but how does one relate to the other? Diffusion models also exist and work well for images, and they do seem to work for LLMs too.
- krackers - 2 days ago
  
  I think it's the same underlying principle of learning the "joint distribution of things humans have said". Whether done autoregressively via LLMs or via diffusion models, you still end up learning this distribution. The insight seems to be the crazy leap that this is A) a valid thing to talk about and B) that learning this distribution gives you something meaningful.
  The leap is in transforming an ill-defined objective of "modeling intelligence" into a concrete proxy objective. Note that the task isn't even "distribution set of valid/true things", since validity/truth is hard to define. It's something akin to "distribution of things a human might say" implemented in the "dumbest" possible way of "modeling the distribution of humanity's collective textual output".
  - - 2 days ago
    
    [deleted]
- highfrequency - 2 days ago
  
  To crack NLP we needed a large dataset of labeled language examples. Prior to next-word prediction, the dominant benchmarks and datasets were things like translation of English to German sentences. These datasets were on the order of millions of labeled examples. Next-word prediction turned the entire Internet into labeled data.
Salgat - 2 days ago

RNN worked that way too, the difference is that Transformers are parallelized, which is what made next-word prediction work so good, you could have an input thousands of tokens in length without needing your training to be thousands of times longer.

math_dandy - 2 days ago

I don't buy the narrative that the article is promoting.

I think the machine learning community was largely over overfitophobia by 2019 and people were routinely using overparametrized models capable of interpolating their training data while still generalizing well.

The Belkin et al. paper wasn't heresy. The authors were making a technical point - that certain theories of generalization are incompatible with this interpolation phenomenon.

The lottery ticket hypothesis paper's demonstration of the ubiquity of "winning tickets" - sparse parameter configurations that generalize - is striking, but these "winning tickets" aren't the solutions found by stochastic gradient descent (SGD) algorithms in practice. In the interpolating regime, the minima found by SGD are simple in a different sense perhaps more closely related to generalization. In the case of logistic regression, they are maximum margin classifiers; see https://arxiv.org/pdf/1710.10345.

The article points out some cool papers, but the narrative of plucky researchers bucking orthodoxy in 2019 doesn't track for me.

ActorNightly - 2 days ago

Yeah this article gets a whole bunch of history wrong.
Back in 2000s, the reason why nobody was pursuing neural nets was simply due to compute power, and the fact that you couldn't iterate fast enough to make smaller neural networks work.
People were doing genetic algorithms and PSO for quite some time. Everyone knew that multi dimentionality was the solution to overfitting - the more directions you can use to climb out of valleys the better the system performed.

derbOac - 3 days ago

In some sense, isn't this overfitting, but "hidden" by the typical feature sets that are observed?

Time and time again, some kind of process will identify some simple but absurd adversarial "trick stimulus" that throws off the deep network solution. These seem like blatant cases of over fitting that go unrecognized or unchallenged in typical life because the sampling space of stimuli doesn't usually include the adversarial trick stimuli.

I guess I've not really thought of the bias-variance tradeoff necessarily as being about number of parameters, but rather, the flexibility of the model relative to the learnable information in the sample space. There's some formulations (e.g., Shtarkov-Rissanen normalized maximum likelihood) that treat overfitting in terms of the ability to reproduce data that is wildly outside a typical training set. This is related to, but not the same as, the number of parameters per se.

- 2 days ago

[deleted]

xg15 - 3 days ago

Wouldn't this imply that most of the inference time storage and compute might be unnecessary?

If the hypothesis is true, it makes sense to scale up models as much as possible during training - but once the model is sufficiently trained for the task, wouldn't 99% of the weights be literal "dead weight" - because they represent the "failed lottery tickets", i.e. the subnetworks that did not have the right starting values to learn anything useful? So why do we keep them around and waste enormous amounts of storage and compute on them?

janalsncm - 2 days ago

Quick example, Kimi K2 is a recent large mixture of experts model. Each “expert” is really just a path within it. At each token, 32B out of 1T are active. This means only 3.2% are active for any one token.
- Sophira - 2 days ago
  
  That sounds surprisingly like "Humans only use 10% of their brain at any given time."
paulsutter - 3 days ago

That’s exactly how it works, read up on pruning. You can ignore most of the weights and still get great results. One issue is that sparse matrices are vastly less efficient to multiply.
But yes you’ve got it
tough - 3 days ago

someone on twitter was exploring and linked to some related papers where you can for example trim experts on a MoE model if you're 100% sure they're never active for your specific task
what the bigger wide net bigs you is generalization
markeroon - 3 days ago

Look into pruning
FuckButtons - 3 days ago

For any particular single pattern learned 99% of the weights are dead weight. But it’s not the same 99% for each lesson learned.

gotoeleven - 3 days ago

This article gives a really bad/wrong explanation of the lottery ticket hypothesis. Here's the original paper

https://arxiv.org/abs/1803.03635

brulard - 2 days ago

Thanks for the 42 page long document. Can you explain in few words why you evaluated it as "really bad/wrong explanation"?
- frrlpp - a day ago
  
  What are LLMs for?
jeremyscanvic - 2 days ago

Exactly what I was looking for while reading the post. Thanks!

nitwit005 - 2 days ago

> For over 300 years, one principle governed every learning system

This seems strangely worded. I assume that date is when some statistics paper was published, but there's no way to know with no definition or citations.

littlestymaar - 2 days ago

There is in fact a footnote about the date:
> 1. The 300-year timeframe refers to the foundational mathematical principles underlying modern bias-variance analysis, not the contemporary terminology. Bayes' theorem (1763) established the mathematical framework for updating beliefs with evidence, whilst Laplace's early work on statistical inference (1780s-1810s) formalised the principle that models must balance fit with simplicity to avoid spurious conclusions. These early statistical insights—that overly complex explanations tend to capture noise rather than signal—form the mathematical bedrock of what we now call the bias-variance tradeoff. The specific modern formulation emerged over several decades in the latter 20th century, but the core principle has governed statistical reasoning for centuries.
- nitwit005 - 2 days ago
  
  They added it after my post.

belter - 3 days ago

This article is like a quick street rap. Lots of rhythm, not much thesis. Big on tone, light on analysis...Or no actual thesis other than a feelgood factor. I want these 5 min back.

JasonSage - 2 days ago

On the other hand, as somebody not well-read in AI I found it to be a rather intuitive explanation for why pruning helps avoid the overfitting scenario I learned when I first touched neural networks in the ‘10s.
Sure, this could’ve been a paragraph, but it wasn’t. I don’t think it’s particularly offensive for that.
fgfarben - 2 days ago

Do you think a GPT that already trained on something "feels" the same way when reading it a second time?

quantgenius - 2 days ago

The idea that simply having a lot of parameters leads to overfitting was shown to not be the case over 30 years ago by Vapnik et al. He proved that a large number of parameters is fine so long as you regularize enough. This is why Support Vector Machines work and I believe has a lot to do with why deep NNs work.

The issue with Vapnik's work is that it's pretty dense and actually figuring out the Vapnik-Chervonekis (VC) dimension etc is pretty complicated, and one can develop pretty good intuition once you understand the stuff without having to actually calculate, so most people don't take the time to do the calculation. And frankly, a lot of the time, you don't need to.

There may be something I'm missing completely, but to me the fact that models continue to generalize with a huge number of parameters is not all that surprising given how much we regularize when we fit NNs. A lot of the surprise comes from the fact that people in mathematical statistics and people who do neural networks (computer scientists) don't talk to each other as much as they should.

Strongly recommend the book Statistical Learning Theory by Vapnik for more on this.

- 2 days ago

[deleted]

api - 3 days ago

This sounds like it's proposing that what's happening during large model training is a little bit akin to genetic algorithms: many small networks emerge and there is a selection process, some get fixed, and the rest fade and are then repurposed/drifted into other roles, repeat.

doctoboggan - 2 days ago

This article definitely feels like chatgptese.

Also, I don't necessarily feel like the size of LLMs even comes close to overfitting the data. From a very unscientific standpoint it seems like the size of weights on disk would have to meet or exceed the size of the training data (modulo lossless encryption techniques) for overfitting to occur. Since the training data is multiple orders of magnitude larger than the resulting weights, isn't that proof that the weights are some sort of generalization of the input data rather than a memorization?

porridgeraisin - 2 days ago
1) yes it's definitely chatgpt
2) The weights are definitely a generalization. The compression-based argument is sound.
3) There is definitely no overfitting. The article however used the word over-parameterization, which is a different thing. And LLMs are certainly over-parameterized. They have more parameters than strictly required to represent the dataset in a degrees-of-freedom statistical sense. This is not a bad thing though.
Just like having an over-parameterized database schema:
```
  quiz(id, title, num_qns)
  question(id, text, answer, quiz_id FK)
```
can be good for performance sometimes,
The lottery ticket hypothesis as chatgpt explained in TFA means that over-parameterization can also be good for neural networks sometimes. Note that this hypothesis is strictly tied to the fact that we use SGD (or adam or ...) as the optimisation algorithm. SGD is known to be biased towards generalized compressions [the lottery ticket hypothesis hypothesises why this is so]. That is to say, it's not an inherent property of the neural network architecture or transformers or such.

abhinuvpitale - 3 days ago

Interesting article, is it concluding that different small networks are formed for different types of problems that we are trying to solve with the larger network?

How is this different from overfitting though? (PS: Overfitting isn't that bad if you think about it, as long as the test dataset or inference time model is trying to solve problems in the supposedly large enough training dataset)

ghssds - 3 days ago

Can someone explain how AI research can have a 300 years history?

woadwarrior01 - 2 days ago

300 years is a stretch. But Legendre described linear regression ~220 years ago (1805). And from a very high level perspective, modern neural networks are mostly just stacks of linear regression layers with non-linearities sandwiched between them. I'm obviously oversimplifying it a lot, but that't the gist of it.
anthonj - 2 days ago

"For over 300 years, one principle governed every learning system: the bias-variance tradeoff."
The bias-variance tradeoff is a very old concept in statistics (but not sure how old, might very well be 300)
Anyway note the first algorithms realted to neural networks are older then the digital computer by a decade at least.
littlestymaar - 2 days ago

Maybe it wasn't there originally, but now there's a footnote:
> 1. The 300-year timeframe refers to the foundational mathematical principles underlying modern bias-variance analysis, not the contemporary terminology. Bayes' theorem (1763) established the mathematical framework for updating beliefs with evidence, whilst Laplace's early work on statistical inference (1780s-1810s) formalised the principle that models must balance fit with simplicity to avoid spurious conclusions. These early statistical insights—that overly complex explanations tend to capture noise rather than signal—form the mathematical bedrock of what we now call the bias-variance tradeoff. The specific modern formulation emerged over several decades in the latter 20th century, but the core principle has governed statistical reasoning for centuries.↩

deepfriedchokes - 3 days ago

Rather than reframing intelligence itself, wouldn’t Occam’s Razor suggest instead that this isn’t intelligence at all?

aeternum - 2 days ago

IMO Occam's Razor suggests that this is exactly what intelligence is.
The ability to compress information, specifically run it through a simple rule that allows you to predict some future state.
The rules are simple but finding them is hard. The ability to find those rules, compress information, and thus predict the future efficiently is the very essence of intelligence.
pixl97 - 2 days ago

I don't really think that is what Occam’s Razor is about. The Razor says the simplest answer is most likely the best, but we already know that intelligence is very complex so the simplest answer to intelligence is still going to be a massively complex solution.
In some ways this answer does fit Occam's Razor by saying the simplicity is simply scale, not complex algorithms.
gavmor - 2 days ago

> Intelligence isn't about memorising information—it's about finding elegant patterns that explain complex phenomena. Scale provides the computational space needed for this search, not storage for complicated solutions.
I think the word finding is overloaded, here. Are we "discovering," "deriving," "deducing," or simple "looking up" these patterns?
If "finding" can be implemented via a multi-page tour—ie deterministic choose-your-own-adventure—of a three-ring-binder (which is, essentially, how inference operates) then we're back at Searle's Chinese Room, and no intelligence is operative at runtime.
On the other hand, if the satisfaction of "finding" necessitates the creative synthesis of novel records pertaining to—if not outright modeling—external phenomena, ie "finding" a proof, then arguably it's not happening at training time, either.
How many novel proofs have LLMs found?
- akomtu - 2 days ago
  
  Even simpler: intelligence is the art of simplifying. LLMs can fool us if they reduce a book into one wise-looking statement, but remove the deceptive medium - our language - and tell it to reduce a vast dataset of points into one formula, and LLMs will show how much intelligence they truly have.
Eisenstein - 2 days ago

Unless you can provide a definition for intelligence which is internally consistent and does not exclude things are obviously intelligent or include things which are obviously not intelligent, the only thing occam's razor suggests is that the basis for solving novel problems is the ability to pattern match combined with a lot of background knowledge.

solbob - 3 days ago

[dead]

jfrankle - 2 days ago

whyyy