Puzzling Success of Overparameterization: Lottery Tickets or Escape Dimensions?

infoscience.epfl.ch

43 points by rbanffy 2 days ago


cherryteastain - 7 hours ago

A related viewpoint is that overparametrization is good because the model is stranded when the Hessian has all positive/zero eigenvalues. If we treat the probability that a particular Hessian eigenvalue turns positive as a Bernoulli process, the chance of all eigenvalues going positive/zero exponentially decreases as the parameter count increases

[1] https://arxiv.org/abs/1406.2572

HarHarVeryFunny - 4 hours ago

I have a very hand-wavy explanation for how (but not fully why) overparametized nets tend to generalize rather than overfit.

First a couple of facts:

1) An ANN works by learning decision boundaries that separate and group training samples and their associated labels.

2) If you train an overparametized net on random data then it will memorize it, but if you train it on consistently labelled structured data lying on some lower dimensional manifold, then rather than memorizing it, it will instead generalize, so the behavior depends on the nature of the data it is trained on.

Now the hand-wavy bit:

As training progresses the weights move the decision surfaces around until each training sample maps to a region of output space corresponding to the correct label, with these regions of output/latent space being separated by the learnt decision surfaces.

Initially during training (up to the double descent phase in cases where that happens) these regions of "gerrymandered" output space may only correspond to a single or very few training samples, so there may be multiple disconnected regions each mapping to the label "cat", and another group of disconnected regions each mapping to the label "dog". This is the the overfitting phase.

Now, if the data permits, with the data manifold being consistently labelled (nothing that looks like a cat being labelled a dog), there will often be potential to merge some of these disconnected regions of output space that map to the same label. So, for example we might go from four small regions of "cat" space to two larger merged regions of "cat" space. This is the mechanism of generalization with the extra space contained by the merged regions corresponding to interpolation - no training samples "forced" those larger merged regions, but also none prevented it ("dog" that looks like a cat).

The question then remains why the dynamics of training may cause the decision surfaces to initially be highly "gerrimanderd" (because it's easier?), but on continued training to merge (because without any dogs among the cats there is no reason not to, and once merged no label error causing them to unmerge - a ratcheting up process from smaller to larger regions with increasing generalization?).

Scene_Cast2 - 9 hours ago

IIRC the original author of the Lottery Ticket Hypothesis now disavows that idea.

One intuitive way of looking at it is like so - let's say that you have a gaussian-looking plot. You want to fit a gaussian. You have a stupid simple model where you can slide your gaussian left and right.

If your initial starting point happens to be roughly within range, great, your optimizer will take care of it for you and slide it into the correct place. If you're too far, too bad, no meaningful gradient.

Instead, neural nets give you the option to spawn a gaussian anywhere you please. In this case, no sliding is necessary, but it comes at a heavy parametrization cost.

vatsachak - 5 hours ago

Isn't this trivial?

What's more interesting is as to why double descent happens

TestINGNG - 8 hours ago

[dead]