Damien Teney damienteney - Bluesky Statics

Reviewer #2 striking again?

21.02.2026 13:15 — 👍 2 🔁 0 💬 0 📌 0

Procedural Pretraining: Warming Up Language Models with Abstract Data Pretraining directly on web-scale corpora is the de facto paradigm for building language models. We study an alternative setting where the model is initially exposed to abstract structured data, as a ...

This all looks very promising, and there's a lot more to explore! Paper and code ⬇️
Procedural Pretraining: Warming Up Language Models with Abstract Data
www.arxiv.org/abs/2601.21725
github.com/zlshinnick/p...

20.02.2026 12:39 — 👍 1 🔁 0 💬 0 📌 0

🧩 Multiple types of procedural data can be combined.
We get further gains by mixing either
• multiple types of data, or
• weights of models individually warmed-up on different types of data.

20.02.2026 12:39 — 👍 0 🔁 0 💬 1 📌 0

⚙️ MLPs vs. attention: where is the information located?
We try resetting selected weights to random, before standard pretraining. Surprisingly, we obtain further gains, but they're domain-specific:
• warmed-up MLPs benefit natural language
• warmed-up attention helps code/math

20.02.2026 12:39 — 👍 0 🔁 0 💬 1 📌 0

📈 Benefits on subsequent standard pretraining.
By front-loading as little as 0.1% procedural data, models achieve significantly better pretraining performance on language, code, and math. They use up to 45% less semantic data to reach a baseline perplexity.

20.02.2026 12:39 — 👍 0 🔁 0 💬 1 📌 0

🔍 Different procedural data = different benefits.
We first determine the effect of different types of procedural data with algorithmic diagnostic tasks. The benefits range from long-context recall to arithmetic, depending on the type of procedural data.

20.02.2026 12:39 — 👍 0 🔁 0 💬 1 📌 0

💡 Humans learn better when starting with simple structure and logic rather than memorizing a massive set of facts. By analogy, we use abstract, structured data to build a scaffold in language models, free of semantic biases.

20.02.2026 12:39 — 👍 0 🔁 0 💬 1 📌 0

🔥What if web text isn’t the best place to start training LLMs? Our latest work shows that warming up models on procedural data (e.g. from formal languages & simple algorithms) speeds up subsequent pretraining on language, code, and math, on models up to 1.3B parameters⬇️🧵

20.02.2026 12:39 — 👍 49 🔁 3 💬 1 📌 0

Indeed the effect in *late* layers was very surprising!
My optimist interpretation is that the procedural pretraining creates circuits for computations general enough to serve as a useful scaffold for visual tasks. This would explain why they help and why they don't wash out with more training.

10.12.2025 20:29 — 👍 2 🔁 0 💬 0 📌 0

Sounds 😋 What's the objective function? simplicity/low cost/?

10.12.2025 20:22 — 👍 0 🔁 0 💬 1 📌 0

Can You Learn to See Without Images? Procedural Warm-Up for Vision Transformers Transformers show remarkable versatility across domains, suggesting the existence of inductive biases beneficial across modalities. In this work, we explore a new way to instil such generic biases in ...

In summary, a lightweight generic warm-up improves accuracy and data efficiency, with effects distinct from ImageNet pretraining.
Lots of exciting open questions! 🔍
-Other types of procedural data?
-Other downstream tasks?
-Closed-form instantiation?
arxiv.org/abs/2511.13945

10.12.2025 13:22 — 👍 3 🔁 0 💬 0 📌 0

🔍𝐖𝐡𝐞𝐫𝐞 𝐢𝐬 𝐭𝐡𝐢𝐬 𝐤𝐧𝐨𝐰𝐥𝐞𝐝𝐠𝐞 𝐬𝐭𝐨𝐫𝐞𝐝?
Ablations show that the knowledge mostly locates in *late* layers: the opposite of normal visual pretraining which shapes early layers. Procedural data seems to provide a qualitatively unique training signal!

10.12.2025 13:22 — 👍 1 🔁 0 💬 1 📌 0

🧠𝐖𝐡𝐚𝐭 𝐤𝐢𝐧𝐝 𝐨𝐟 𝐝𝐚𝐭𝐚 𝐰𝐨𝐫𝐤𝐬?
Formal languages with hierarchical structure seem best. If we shuffle the training tokens (eliminating nested structures), the gains disappear, showing that the benefits are not due to surface-level frequencies.

10.12.2025 13:22 — 👍 1 🔁 0 💬 1 📌 0

📉𝐏𝐫𝐨𝐜𝐞𝐝𝐮𝐫𝐚𝐥 𝐝𝐚𝐭𝐚 𝐜𝐚𝐧 𝐫𝐞𝐩𝐥𝐚𝐜𝐞 𝐫𝐞𝐚𝐥 𝐢𝐦𝐚𝐠𝐞𝐬
Allocating just 1% of the ImageNet pretraining budget to the procedural warmup lets the ViT match the baseline accuracy with 28% fewer images!

10.12.2025 13:22 — 👍 2 🔁 0 💬 1 📌 0

📈𝐀 𝐝𝐢𝐟𝐟𝐞𝐫𝐞𝐧𝐭 𝐨𝐩𝐭𝐢𝐦𝐢𝐳𝐚𝐭𝐢𝐨𝐧 𝐭𝐫𝐚𝐣𝐞𝐜𝐭𝐨𝐫𝐲
Our warmed-up models don't just get a head-start, they train differently. On ImageNet (below), they follow a distinct training trajectory and converge to a better accuracy.

10.12.2025 13:22 — 👍 1 🔁 0 💬 1 📌 0

🔥Our procedural data has no semantic or visual meaning: it simply forces the model to discover generic structure in the data. As initialisation for standard image-based training, it
-boosts accuracy,
-improves data efficiency,
-complements ImageNet pretraining.

10.12.2025 13:22 — 👍 1 🔁 0 💬 1 📌 0

💡Prior work has already shown that LLMs acquire useful knowledge when pretrained on formal languages. To test this on ViTs, we devise a procedural warm-up: pretraining for next-token prediction on symbolic sequences, bypassing the visual patch embedding.

10.12.2025 13:22 — 👍 1 🔁 0 💬 1 📌 0

Can You Learn to See Without Images? Procedural Warm-Up for Vision Transformers Transformers show remarkable versatility across domains, suggesting the existence of inductive biases beneficial across modalities. In this work, we explore a new way to instil such generic biases in ...

In summary, a lightweight generic warm-up improves accuracy and data efficiency, with effects distinct from ImageNet pretraining.
Lots of exciting open questions! 🔍
- Other types of procedural data?
- Other downstream tasks?
- Closed-form instantiation?
arxiv.org/abs/2511.13945

10.12.2025 11:08 — 👍 2 🔁 0 💬 0 📌 0

🔍𝐖𝐡𝐞𝐫𝐞 𝐢𝐬 𝐭𝐡𝐢𝐬 𝐤𝐧𝐨𝐰𝐥𝐞𝐝𝐠𝐞 𝐬𝐭𝐨𝐫𝐞𝐝?
Ablations show that the knowledge mostly locates in *late* layers: the opposite of normal visual pretraining which shapes early layers. Procedural data seems to provide a qualitatively unique training signal!

10.12.2025 11:05 — 👍 2 🔁 0 💬 2 📌 1

🧠𝐖𝐡𝐚𝐭 𝐤𝐢𝐧𝐝 𝐨𝐟 𝐝𝐚𝐭𝐚 𝐰𝐨𝐫𝐤𝐬?
Formal languages with hierarchical structure seem best. If we shuffle the training tokens (eliminating nested structures), the gains disappear, showing that the benefits are not due to surface-level frequencies.

10.12.2025 11:05 — 👍 2 🔁 0 💬 1 📌 0

📉𝐏𝐫𝐨𝐜𝐞𝐝𝐮𝐫𝐚𝐥 𝐝𝐚𝐭𝐚 𝐜𝐚𝐧 𝐫𝐞𝐩𝐥𝐚𝐜𝐞 𝐫𝐞𝐚𝐥 𝐢𝐦𝐚𝐠𝐞𝐬
Allocating just 1% of the ImageNet pretraining budget to the procedural warmup lets the ViT match the baseline accuracy with 28% fewer images!

10.12.2025 11:05 — 👍 2 🔁 0 💬 1 📌 0

📈𝐀 𝐝𝐢𝐟𝐟𝐞𝐫𝐞𝐧𝐭 𝐨𝐩𝐭𝐢𝐦𝐢𝐳𝐚𝐭𝐢𝐨𝐧 𝐭𝐫𝐚𝐣𝐞𝐜𝐭𝐨𝐫𝐲
Our warmed-up models don't just get a head-start, they train differently. On ImageNet (below), they follow a distinct training trajectory and converge to a better accuracy.

10.12.2025 11:05 — 👍 3 🔁 0 💬 1 📌 0

🔥Our procedural data has no semantic or visual meaning: it simply forces the model to discover generic structure in the data. As initialisation for standard image-based training, it
-boosts accuracy,
-improves data efficiency,
-complements ImageNet pretraining.

10.12.2025 11:05 — 👍 2 🔁 0 💬 1 📌 0

Can vision transformers learn without images?🤔👀
Our latest work shows that pretraining ViTs on procedural symbolic data (eg sequences of balanced parentheses) makes subsequent standard training (eg on ImageNet) more data efficient! How is this possible?! ⬇️🧵

10.12.2025 11:05 — 👍 47 🔁 6 💬 3 📌 1

Academic Strava?🤓 It feels like an underrepresented group in my Strava feed!

26.07.2025 10:10 — 👍 1 🔁 0 💬 1 📌 0

It'd be nice to provide complete analyses (that you have precomputed) of existing papers, so we can see what kind of output the tool provides, without having to submit any of my own work.

24.07.2025 08:24 — 👍 23 🔁 0 💬 1 📌 0

Dang it just never ends 😱

22.07.2025 18:08 — 👍 0 🔁 0 💬 0 📌 0

In this setting, does the student (sometimes?) get better than the teacher? One hypothesis could be that the teacher, even if "less correct" than the GT, provides supervision that's easier to learn for another NN (the student). The optimization follows a less tortuous path & finds a better solution.

16.07.2025 15:23 — 👍 3 🔁 0 💬 1 📌 0

👍 I had my very first paper published at DAGM. It was a while ago but I remember it as a very welcoming conference.

07.07.2025 19:56 — 👍 1 🔁 0 💬 0 📌 0

OOD-Chameleon: Is Algorithm Selection for OOD Generalization Learnable? Out-of-distribution (OOD) generalization is challenging because distribution shifts come in many forms. Numerous algorithms exist to address specific settings, but choosing the right training algorith...

🎯There's already a plethora of methods to handle distribution shifts: most gains may now simply be in better using them! Automatic selection looks promising, yet there's lots more to do. Interested? Come chat with us at ICML!
📄 arxiv.org/abs/2410.02735
💻 github.com/LiangzeJiang...

07.07.2025 16:50 — 👍 0 🔁 0 💬 0 📌 0

Posts by Damien Teney (@damienteney.bsky.social)