The lottery ticket hypothesis: why neural networks work

nearlyright.com

130 points by 076ae80a-3c97-4 3 days ago


highfrequency - 3 days ago

Enjoyed the article. To play devil’s advocate, an entirely different explanation for why huge models work: the primary insight was framing the problem as next-word prediction. This immediately creates an internet-scale dataset with trillions of labeled examples, which also has rich enough structure to make huge expressiveness useful. LLMs don’t disprove bias-variance tradeoff; we just found a lot more data and the GPUs to learn from it.

It’s not like people didn’t try bigger models in the past, but either the data was too small or the structure too simple to show improvements with more model complexity. (Or they simply trained the biggest model they could fit on the GPUs of the time.)

math_dandy - 2 days ago

I don't buy the narrative that the article is promoting.

I think the machine learning community was largely over overfitophobia by 2019 and people were routinely using overparametrized models capable of interpolating their training data while still generalizing well.

The Belkin et al. paper wasn't heresy. The authors were making a technical point - that certain theories of generalization are incompatible with this interpolation phenomenon.

The lottery ticket hypothesis paper's demonstration of the ubiquity of "winning tickets" - sparse parameter configurations that generalize - is striking, but these "winning tickets" aren't the solutions found by stochastic gradient descent (SGD) algorithms in practice. In the interpolating regime, the minima found by SGD are simple in a different sense perhaps more closely related to generalization. In the case of logistic regression, they are maximum margin classifiers; see https://arxiv.org/pdf/1710.10345.

The article points out some cool papers, but the narrative of plucky researchers bucking orthodoxy in 2019 doesn't track for me.

derbOac - 3 days ago

In some sense, isn't this overfitting, but "hidden" by the typical feature sets that are observed?

Time and time again, some kind of process will identify some simple but absurd adversarial "trick stimulus" that throws off the deep network solution. These seem like blatant cases of over fitting that go unrecognized or unchallenged in typical life because the sampling space of stimuli doesn't usually include the adversarial trick stimuli.

I guess I've not really thought of the bias-variance tradeoff necessarily as being about number of parameters, but rather, the flexibility of the model relative to the learnable information in the sample space. There's some formulations (e.g., Shtarkov-Rissanen normalized maximum likelihood) that treat overfitting in terms of the ability to reproduce data that is wildly outside a typical training set. This is related to, but not the same as, the number of parameters per se.

xg15 - 3 days ago

Wouldn't this imply that most of the inference time storage and compute might be unnecessary?

If the hypothesis is true, it makes sense to scale up models as much as possible during training - but once the model is sufficiently trained for the task, wouldn't 99% of the weights be literal "dead weight" - because they represent the "failed lottery tickets", i.e. the subnetworks that did not have the right starting values to learn anything useful? So why do we keep them around and waste enormous amounts of storage and compute on them?

gotoeleven - 3 days ago

This article gives a really bad/wrong explanation of the lottery ticket hypothesis. Here's the original paper

https://arxiv.org/abs/1803.03635

nitwit005 - 2 days ago

> For over 300 years, one principle governed every learning system

This seems strangely worded. I assume that date is when some statistics paper was published, but there's no way to know with no definition or citations.

belter - 3 days ago

This article is like a quick street rap. Lots of rhythm, not much thesis. Big on tone, light on analysis...Or no actual thesis other than a feelgood factor. I want these 5 min back.

quantgenius - 2 days ago

The idea that simply having a lot of parameters leads to overfitting was shown to not be the case over 30 years ago by Vapnik et al. He proved that a large number of parameters is fine so long as you regularize enough. This is why Support Vector Machines work and I believe has a lot to do with why deep NNs work.

The issue with Vapnik's work is that it's pretty dense and actually figuring out the Vapnik-Chervonekis (VC) dimension etc is pretty complicated, and one can develop pretty good intuition once you understand the stuff without having to actually calculate, so most people don't take the time to do the calculation. And frankly, a lot of the time, you don't need to.

There may be something I'm missing completely, but to me the fact that models continue to generalize with a huge number of parameters is not all that surprising given how much we regularize when we fit NNs. A lot of the surprise comes from the fact that people in mathematical statistics and people who do neural networks (computer scientists) don't talk to each other as much as they should.

Strongly recommend the book Statistical Learning Theory by Vapnik for more on this.

- 2 days ago
[deleted]
api - 3 days ago

This sounds like it's proposing that what's happening during large model training is a little bit akin to genetic algorithms: many small networks emerge and there is a selection process, some get fixed, and the rest fade and are then repurposed/drifted into other roles, repeat.

doctoboggan - 2 days ago

This article definitely feels like chatgptese.

Also, I don't necessarily feel like the size of LLMs even comes close to overfitting the data. From a very unscientific standpoint it seems like the size of weights on disk would have to meet or exceed the size of the training data (modulo lossless encryption techniques) for overfitting to occur. Since the training data is multiple orders of magnitude larger than the resulting weights, isn't that proof that the weights are some sort of generalization of the input data rather than a memorization?

abhinuvpitale - 3 days ago

Interesting article, is it concluding that different small networks are formed for different types of problems that we are trying to solve with the larger network?

How is this different from overfitting though? (PS: Overfitting isn't that bad if you think about it, as long as the test dataset or inference time model is trying to solve problems in the supposedly large enough training dataset)

ghssds - 3 days ago

Can someone explain how AI research can have a 300 years history?

deepfriedchokes - 3 days ago

Rather than reframing intelligence itself, wouldn’t Occam’s Razor suggest instead that this isn’t intelligence at all?

solbob - 3 days ago

[dead]

jfrankle - 2 days ago

whyyy