Jacob Springer's Avatar

Jacob Springer

@jacobspringer.bsky.social

Machine Learning (the science part) | PhD student @ CMU

140 Followers  |  125 Following  |  10 Posts  |  Joined: 03.12.2024  |  1.6359

Latest posts by jacobspringer.bsky.social on Bluesky

Preview
Overtrained Language Models Are Harder to Fine-Tune Large language models are pre-trained on ever-growing token budgets under the assumption that better pre-training performance translates to improved downstream models. In this work, we challenge this ...

We also have so many other interesting details in the paper that have entirely changed the way I think about pre-training!

And thanks to my collaborators!
Sachin Goyal
Kaiyue Wen
Tanishq Kumar
@xiangyue96.bsky.social
@sadhika.bsky.social
@gneubig.bsky.social
@adtraghunathan.bsky.social

10/10

26.03.2025 18:35 β€” πŸ‘ 15    πŸ” 4    πŸ’¬ 1    πŸ“Œ 1

For the theorists in the room: we dive deeper into why this happens using a linear transfer learning setup, revealing that incremental learning leads to catastrophic overtraining.

9/10

26.03.2025 18:35 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Fine-tuning behaves similarly: using a fixed learning rate across different pre-training checkpoints, we see eventual degradation in both task performance and web-data perplexity. This often holds even after hyperparameter tuning. Overtraining = worse fine-tuning outcomes!

8/10

26.03.2025 18:35 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

πŸ‘‰ Early in training: Models have low sensitivity & the base model improves quickly; performance improves πŸ“ˆ
πŸ‘‰ Late in training: Models become highly sensitive & the base model improves slowly; performance degrades! πŸ“‰

7/10

26.03.2025 18:35 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

What's happening? Beyond Gaussian perturbations, extended pre-training increases model sensitivity to all types of parameter updates πŸ‘‡

6/10

26.03.2025 18:35 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

πŸ”Ή Early checkpoints: Robust to parameter changes.
πŸ”Έ Later checkpoints: Highly sensitive, leading to worse performance after perturbation! (Left plot: sensitivity increases over training, Right plot: final performance eventually degrades.)

5/10

26.03.2025 18:35 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0

Let’s step back and consider a simpler setting: we train our own 30M parameter models and test how Gaussian noise affects model parameters at different pre-training stagesπŸ‘‡

4/10

26.03.2025 18:35 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Example: OLMo-1B trained on 3T tokens performs over 2% *worse* after instruction tuning than its 2.3T-token versionβ€”even though it saw 30% more data! We see similar observations for many other post-training setups.

Why does extended pre-training hurt fine-tuning performance? πŸ€”

3/10

26.03.2025 18:35 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

The latest language models are pre-trained on more and more tokens while holding the number of model parameters fixedβ€”and this trend isn't slowing down!
➑️ Better base models? Yes.
➑️ Better starting point for post-training? Let’s check!

2/10

26.03.2025 18:35 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Training with more data = better LLMs, right? 🚨

False! Scaling language models by adding more pre-training data can decrease your performance after post-training!
Introducing "catastrophic overtraining." πŸ₯πŸ§΅πŸ‘‡

arxiv.org/abs/2503.19206

1/10

26.03.2025 18:35 β€” πŸ‘ 34    πŸ” 14    πŸ’¬ 1    πŸ“Œ 1

@jacobspringer is following 20 prominent accounts