Fabian Schaipp's Avatar

Fabian Schaipp

@fschaipp.bsky.social

Researcher in Optimization for ML at Inria Paris. Previously at TU Munich. sbatch and apero. https://fabian-sp.github.io/

419 Followers  |  230 Following  |  15 Posts  |  Joined: 21.11.2024  |  1.8259

Latest posts by fschaipp.bsky.social on Bluesky

yes, it does raise questions (and I don't have an answer yet). but I am not sure whether the practical setting falls within the smooth case neither (if smooth=Lipschitz smooth; and even if smooth=differentiable, there are non-diff elements in the architecture like RMSNorm)

05.02.2025 15:15 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training We show that learning-rate schedules for large model training behave surprisingly similar to a performance bound from non-smooth convex optimization theory. We provide a bound for the constant schedul...

This is joint work with @haeggee.bsky.social, Adrien Taylor, Umut Simsekli and @bachfrancis.bsky.social

๐Ÿ—ž๏ธ arxiv.org/abs/2501.18965

๐Ÿ”ฆ github.com/fabian-sp/lr...

05.02.2025 10:13 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

Bonus: this provides a provable explanation for the benefit of cooldown: if we plug in the wsd schedule into the bound, a log-term (H_T+1) vanishes compared to constant LR (dark grey).

05.02.2025 10:13 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

How does this help in practice? In continued training, we need to decrease the learning rate in the second phase. But by how much?

Using the theoretically optimal schedule (which can be computed for free), we obtain noticeable improvement in training 124M and 210M models.

05.02.2025 10:13 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

This allows to understand LR schedules beyond experiments: we study (i) optimal cooldown length, (ii) the impact of gradient norm on the schedule performance.
The second part suggests that the sudden drop in loss during cooldown happens when gradient norms do not go to zero.

05.02.2025 10:13 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

Using a bound from arxiv.org/pdf/2310.07831, we can reproduce the empirical behaviour of cosine and wsd (=constant+cooldown) schedule. Surprisingly the result is for convex problems, but still matches the actual loss of (nonconvex) LLM training.

05.02.2025 10:13 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Preview
The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training We show that learning-rate schedules for large model training behave surprisingly similar to a performance bound from non-smooth convex optimization theory. We provide a bound for the constant schedul...

Learning rate schedules seem mysterious? Why is the loss going down so fast during cooldown?
Turns out that this behaviour can be described with a bound from *convex, nonsmooth* optimization.

A short thread on our latest paper ๐Ÿšž

arxiv.org/abs/2501.18965

05.02.2025 10:13 โ€” ๐Ÿ‘ 31    ๐Ÿ” 6    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 0

That time of the year again, where you delete a word and latex manages to make the line <longer>.

24.01.2025 09:26 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
A Bibliography Database for Machine Learning Getting the correct bibtex entry for a conference paper (e.g. published at NeurIPS, ICML, ICLR) is annoyingly hard: if you search for the title, you will often find a link to arxiv or to the pdf file,...

Want all NeurIPS/ICML/ICLR papers in one single .bib file? Here you go!

๐Ÿ—ž๏ธ short blog post: fabian-sp.github.io/posts/2024/1...

๐Ÿ“‡ bib files: github.com/fabian-sp/ml-bib

17.12.2024 10:42 โ€” ๐Ÿ‘ 6    ๐Ÿ” 2    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

nice!
Figure 9 looks like a lighthouse guiding the way (towards the data distribution)

13.12.2024 16:27 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
SGD with Clipping is Secretly Estimating the Median Gradient There are several applications of stochastic optimization where one can benefit from a robust estimate of the gradient. For example, domains such as distributed learning with corrupted nodes, the pres...

you could run an online method for the quantile problem? sth similar to the online median arxiv.org/abs/2402.12828

06.12.2024 08:19 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

Generating cat videos is nice, but what if you could tackle real scientific problems with the same methods? ๐Ÿงช๐ŸŒŒ
Introducing The Well: 16 datasets (15TB) for Machine Learning, from astrophysics to fluid dynamics and biology.
๐Ÿ™: github.com/PolymathicAI...
๐Ÿ“œ: openreview.net/pdf?id=00Sx5...

02.12.2024 16:08 โ€” ๐Ÿ‘ 66    ๐Ÿ” 19    ๐Ÿ’ฌ 3    ๐Ÿ“Œ 2

could you add me? โœŒ๐Ÿป

28.11.2024 07:32 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

Not so fun exercise: take a recent paper that you consider exceptionally good, and one that you think is mediocre (at best).

Then look up their reviews on ICLR 2025. I find these reviews completely arbitrary most of the times.

25.11.2024 15:40 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

my French ๐Ÿ‡จ๐Ÿ‡ต digital bank (supposedly!) today asked me (via letter) to confirm an account action via sending them a signed letter. wtf

25.11.2024 12:41 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

I made a #starterpack for computational math ๐Ÿ’ป๐Ÿงฎ so please
1. share
2. let me know if you want to be on the list!

(I have many new followers which I do not know well yet, so I'm sorry if you follow me and are not on here, but want to - drop me a note and I'll add you!)

go.bsky.app/DXdZkzV

18.11.2024 15:08 โ€” ๐Ÿ‘ 60    ๐Ÿ” 28    ๐Ÿ’ฌ 29    ๐Ÿ“Œ 1

would love to be added :)

22.11.2024 12:53 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

@fschaipp is following 20 prominent accounts