Nicolas Zucchet's Avatar

Nicolas Zucchet

@nzucchet.bsky.social

PhD Student @ ETH Zurich Previously: Student Researcher @ Google DeepMind, @École polytechnique https://nicolaszucchet.github.io

32 Followers  |  66 Following  |  8 Posts  |  Joined: 03.04.2025  |  1.6186

Latest posts by nzucchet.bsky.social on Bluesky

Haven’t tried this. My guess would be that you need the same fact to be repeated across multiple batches (ie training steps) for the effect to be visible.

04.04.2025 19:13 — 👍 1    🔁 0    💬 0    📌 0
Preview
How do language models learn facts? Dynamics, curricula and hallucinations Large language models accumulate vast knowledge during pre-training, yet the dynamics governing this acquisition remain poorly understood. This work investigates the learning dynamics of language mode...

Thanks to my co-authors @jb.capsec.org,
@scychan.bsky.social, @lampinen.bsky.social, @razvan-pascanu.bsky.social, and @soham-de.bsky.social. I couldn't have dreamed of a better team for this collaboration!

Check out the full paper for all the technical details arxiv.org/abs/2503.21676.

03.04.2025 12:20 — 👍 7    🔁 0    💬 0    📌 0

Our work suggests practical LLM training strategies:
1. use synthetic data early as plateau phase data isn't retained anyway
2. implement dynamic data schedulers that use low diversity during plateaus and high diversity afterward (which is similar to how we learn as infants!)

03.04.2025 12:20 — 👍 5    🔁 0    💬 1    📌 0
Post image

Hallucinations emerge with knowledge. As models learn facts about seen individuals, they also make overconfident predictions about unseen ones.
On top of that, fine-tuning struggles to add new knowledge: existing memories are quickly corrupted when learning new ones.

03.04.2025 12:20 — 👍 3    🔁 0    💬 1    📌 0
Post image

The training data distribution has a massive impact on learning. Imbalanced distributions (some individuals appearing more frequently) accelerate the plateau phase.
This suggests exciting new data scheduling strategies for training - we show that a simple warmup works well!

03.04.2025 12:20 — 👍 4    🔁 3    💬 2    📌 0
Post image

During that plateau, something crucial happens: the model builds the attention-based circuits that enable recall.
This is when the model learns how to recall facts, and it only remembers specific facts afterward!

03.04.2025 12:20 — 👍 3    🔁 0    💬 1    📌 0
Post image

We studied how models learn on a synthetic biography task and found three key phases in knowledge acquisition:
1. Models initially learn generic statistics
2. Performance plateaus while attention-based circuits form
3. Knowledge emerges as models learn individual-specific facts

03.04.2025 12:20 — 👍 5    🔁 2    💬 1    📌 0
Post image

Large language models store vast amounts of knowledge, but how exactly do they learn it?

Excited to share my Google DeepMind internship results, which reveal the fascinating dynamics behind factual knowledge acquisition in LLMs!

03.04.2025 12:20 — 👍 27    🔁 3    💬 1    📌 2

@nzucchet is following 20 prominent accounts