Moritz Haas's Avatar

Moritz Haas

@mohaas.bsky.social

IMPRS-IS PhD student @ University of Tรผbingen with Ulrike von Luxburg and Bedartha Goswami. Mostly thinking about deep learning theory. Also interested in ML for climate science. mohawastaken.github.io

250 Followers  |  149 Following  |  11 Posts  |  Joined: 21.11.2024  |  1.9752

Latest posts by mohaas.bsky.social on Bluesky

This is joint work with wonderful collaborators @leenacvankadara.bsky.social , @cevherlions.bsky.social and Jin Xu during our time at Amazon.

๐Ÿงต 10/10

10.12.2024 07:08 โ€” ๐Ÿ‘ 3    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0


How achieve correct scaling with arbitrary gradient-based perturbation rules? ๐Ÿค”

โœจIn ๐P, scale perturbations like updates in every layer.โœจ

๐Ÿ’กGradients and incoming activations generally scale LLN-like, as they are correlated.

โžก๏ธ Perturbations and updates have similar scaling properties.

๐Ÿงต 9/10

10.12.2024 07:08 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image Post image


In experiments across MLPs and ResNets on CIFAR10 and ViTs on ImageNet1K, we show that ๐Pยฒ indeed jointly transfers optimal learning rate and perturbation radius across model scales and can improve training stability and generalization.

๐Ÿงต 8/10

10.12.2024 07:08 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image Post image

... there exists a โœจuniqueโœจ parameterization with layerwise perturbation scaling that fulfills all of our constraints:

(1) stability,
(2) feature learning in all layers,
(3) effective perturbations in all layers.

We call it the โœจMaximal Update and Perturbation Parameterization (๐Pยฒ)โœจ.

๐Ÿงต 7/10

10.12.2024 07:08 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

Hence, we study arbitrary layerwise learning rate, initialization variance and perturbation scaling.

For us, an ideal parametrization should fulfill: updates and perturbations of all weights should have a non-vanishing and non-exploding effect on the output function. ๐Ÿ’ก

We show that ...

๐Ÿงต 6/10

10.12.2024 07:08 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image


... we show that ๐P is not able to consistently improve generalization or to transfer SAM's perturbation radius, because it effectively only perturbs the last layer. โŒ

๐Ÿ’กSo we need to allow layerwise perturbation scaling!

๐Ÿงต 5/10

10.12.2024 07:08 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Preview
Feature Learning in Infinite-Width Neural Networks As its width tends to infinity, a deep neural network's behavior under gradient descent can become simplified and predictable (e.g. given by the Neural Tangent Kernel (NTK)), if it is parametrized app...

The maximal update parametrization (๐P) by arxiv.org/abs/2011.14522 is a layerwise scaling rule of learning rates and initialization variances that yields width-independent dynamics and learning rate transfer for SGD and Adam in common architectures. But for standard SAM, ...

๐Ÿงต 4/10

10.12.2024 07:08 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

SAM and model scale are widely observed to improve generalization across datasets and architectures. But can we understand how to optimally scale in a principled way? ๐Ÿ“ˆ๐Ÿค”

๐Ÿงต 3/10

10.12.2024 07:08 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

Short thread for effective SAM scaling here.
arxiv.org/pdf/2411.00075

A thread on our Mamba scaling will be coming soon by
๐Ÿ”œ @leenacvankadara.bsky.social

๐Ÿงต2/10

10.12.2024 07:08 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image Post image

Stable model scaling with width-independent dynamics?

Thrilled to present 2 papers at #NeurIPS ๐ŸŽ‰ that study width-scaling in Sharpness Aware Minimization (SAM) (Th 16:30, #2104) and in Mamba (Fr 11, #7110). Our scaling rules stabilize training and transfer optimal hyperparams across scales.

๐Ÿงต 1/10

10.12.2024 07:08 โ€” ๐Ÿ‘ 22    ๐Ÿ” 5    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

I'd love to be added :)

28.11.2024 19:28 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

@mohaas is following 20 prominent accounts