Damien Teney's Avatar

Damien Teney

@damienteney.bsky.social

Research Scientist @ Idiap Research Institute. @idiap.bsky.social Adjunct lecturer @ Australian Institute for ML. @aimlofficial.bsky.social Occasionally cycling across continents. https://www.damienteney.info

507 Followers  |  289 Following  |  166 Posts  |  Joined: 17.11.2024
Posts Following

Posts by Damien Teney (@damienteney.bsky.social)

Reviewer #2 striking again?

21.02.2026 13:15 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
Procedural Pretraining: Warming Up Language Models with Abstract Data Pretraining directly on web-scale corpora is the de facto paradigm for building language models. We study an alternative setting where the model is initially exposed to abstract structured data, as a ...

This all looks very promising, and there's a lot more to explore! Paper and code โฌ‡๏ธ
Procedural Pretraining: Warming Up Language Models with Abstract Data
www.arxiv.org/abs/2601.21725
github.com/zlshinnick/p...

20.02.2026 12:39 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

๐Ÿงฉ Multiple types of procedural data can be combined.
We get further gains by mixing either
โ€ข multiple types of data, or
โ€ข weights of models individually warmed-up on different types of data.

20.02.2026 12:39 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

โš™๏ธ MLPs vs. attention: where is the information located?
We try resetting selected weights to random, before standard pretraining. Surprisingly, we obtain further gains, but they're domain-specific:
โ€ข warmed-up MLPs benefit natural language
โ€ข warmed-up attention helps code/math

20.02.2026 12:39 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

๐Ÿ“ˆ Benefits on subsequent standard pretraining.
By front-loading as little as 0.1% procedural data, models achieve significantly better pretraining performance on language, code, and math. They use up to 45% less semantic data to reach a baseline perplexity.

20.02.2026 12:39 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

๐Ÿ” Different procedural data = different benefits.
We first determine the effect of different types of procedural data with algorithmic diagnostic tasks. The benefits range from long-context recall to arithmetic, depending on the type of procedural data.

20.02.2026 12:39 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

๐Ÿ’ก Humans learn better when starting with simple structure and logic rather than memorizing a massive set of facts. By analogy, we use abstract, structured data to build a scaffold in language models, free of semantic biases.

20.02.2026 12:39 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

๐Ÿ”ฅWhat if web text isnโ€™t the best place to start training LLMs? Our latest work shows that warming up models on procedural data (e.g. from formal languages & simple algorithms) speeds up subsequent pretraining on language, code, and math, on models up to 1.3B parametersโฌ‡๏ธ๐Ÿงต

20.02.2026 12:39 โ€” ๐Ÿ‘ 49    ๐Ÿ” 3    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

Indeed the effect in *late* layers was very surprising!
My optimist interpretation is that the procedural pretraining creates circuits for computations general enough to serve as a useful scaffold for visual tasks. This would explain why they help and why they don't wash out with more training.

10.12.2025 20:29 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Sounds ๐Ÿ˜‹ What's the objective function? simplicity/low cost/?

10.12.2025 20:22 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Preview
Can You Learn to See Without Images? Procedural Warm-Up for Vision Transformers Transformers show remarkable versatility across domains, suggesting the existence of inductive biases beneficial across modalities. In this work, we explore a new way to instil such generic biases in ...

In summary, a lightweight generic warm-up improves accuracy and data efficiency, with effects distinct from ImageNet pretraining.
Lots of exciting open questions! ๐Ÿ”
-Other types of procedural data?
-Other downstream tasks?
-Closed-form instantiation?
arxiv.org/abs/2511.13945

10.12.2025 13:22 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

๐Ÿ”๐–๐ก๐ž๐ซ๐ž ๐ข๐ฌ ๐ญ๐ก๐ข๐ฌ ๐ค๐ง๐จ๐ฐ๐ฅ๐ž๐๐ ๐ž ๐ฌ๐ญ๐จ๐ซ๐ž๐?
Ablations show that the knowledge mostly locates in *late* layers: the opposite of normal visual pretraining which shapes early layers. Procedural data seems to provide a qualitatively unique training signal!

10.12.2025 13:22 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

๐Ÿง ๐–๐ก๐š๐ญ ๐ค๐ข๐ง๐ ๐จ๐Ÿ ๐๐š๐ญ๐š ๐ฐ๐จ๐ซ๐ค๐ฌ?
Formal languages with hierarchical structure seem best. If we shuffle the training tokens (eliminating nested structures), the gains disappear, showing that the benefits are not due to surface-level frequencies.

10.12.2025 13:22 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

๐Ÿ“‰๐๐ซ๐จ๐œ๐ž๐๐ฎ๐ซ๐š๐ฅ ๐๐š๐ญ๐š ๐œ๐š๐ง ๐ซ๐ž๐ฉ๐ฅ๐š๐œ๐ž ๐ซ๐ž๐š๐ฅ ๐ข๐ฆ๐š๐ ๐ž๐ฌ
Allocating just 1% of the ImageNet pretraining budget to the procedural warmup lets the ViT match the baseline accuracy with 28% fewer images!

10.12.2025 13:22 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

๐Ÿ“ˆ๐€ ๐๐ข๐Ÿ๐Ÿ๐ž๐ซ๐ž๐ง๐ญ ๐จ๐ฉ๐ญ๐ข๐ฆ๐ข๐ณ๐š๐ญ๐ข๐จ๐ง ๐ญ๐ซ๐š๐ฃ๐ž๐œ๐ญ๐จ๐ซ๐ฒ
Our warmed-up models don't just get a head-start, they train differently. On ImageNet (below), they follow a distinct training trajectory and converge to a better accuracy.

10.12.2025 13:22 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

๐Ÿ”ฅOur procedural data has no semantic or visual meaning: it simply forces the model to discover generic structure in the data. As initialisation for standard image-based training, it
-boosts accuracy,
-improves data efficiency,
-complements ImageNet pretraining.

10.12.2025 13:22 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

๐Ÿ’กPrior work has already shown that LLMs acquire useful knowledge when pretrained on formal languages. To test this on ViTs, we devise a procedural warm-up: pretraining for next-token prediction on symbolic sequences, bypassing the visual patch embedding.

10.12.2025 13:22 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Preview
Can You Learn to See Without Images? Procedural Warm-Up for Vision Transformers Transformers show remarkable versatility across domains, suggesting the existence of inductive biases beneficial across modalities. In this work, we explore a new way to instil such generic biases in ...

In summary, a lightweight generic warm-up improves accuracy and data efficiency, with effects distinct from ImageNet pretraining.
Lots of exciting open questions! ๐Ÿ”
- Other types of procedural data?
- Other downstream tasks?
- Closed-form instantiation?
arxiv.org/abs/2511.13945

10.12.2025 11:08 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

๐Ÿ”๐–๐ก๐ž๐ซ๐ž ๐ข๐ฌ ๐ญ๐ก๐ข๐ฌ ๐ค๐ง๐จ๐ฐ๐ฅ๐ž๐๐ ๐ž ๐ฌ๐ญ๐จ๐ซ๐ž๐?
Ablations show that the knowledge mostly locates in *late* layers: the opposite of normal visual pretraining which shapes early layers. Procedural data seems to provide a qualitatively unique training signal!

10.12.2025 11:05 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 1
Post image

๐Ÿง ๐–๐ก๐š๐ญ ๐ค๐ข๐ง๐ ๐จ๐Ÿ ๐๐š๐ญ๐š ๐ฐ๐จ๐ซ๐ค๐ฌ?
Formal languages with hierarchical structure seem best. If we shuffle the training tokens (eliminating nested structures), the gains disappear, showing that the benefits are not due to surface-level frequencies.

10.12.2025 11:05 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

๐Ÿ“‰๐๐ซ๐จ๐œ๐ž๐๐ฎ๐ซ๐š๐ฅ ๐๐š๐ญ๐š ๐œ๐š๐ง ๐ซ๐ž๐ฉ๐ฅ๐š๐œ๐ž ๐ซ๐ž๐š๐ฅ ๐ข๐ฆ๐š๐ ๐ž๐ฌ
Allocating just 1% of the ImageNet pretraining budget to the procedural warmup lets the ViT match the baseline accuracy with 28% fewer images!

10.12.2025 11:05 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

๐Ÿ“ˆ๐€ ๐๐ข๐Ÿ๐Ÿ๐ž๐ซ๐ž๐ง๐ญ ๐จ๐ฉ๐ญ๐ข๐ฆ๐ข๐ณ๐š๐ญ๐ข๐จ๐ง ๐ญ๐ซ๐š๐ฃ๐ž๐œ๐ญ๐จ๐ซ๐ฒ
Our warmed-up models don't just get a head-start, they train differently. On ImageNet (below), they follow a distinct training trajectory and converge to a better accuracy.

10.12.2025 11:05 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

๐Ÿ”ฅOur procedural data has no semantic or visual meaning: it simply forces the model to discover generic structure in the data. As initialisation for standard image-based training, it
-boosts accuracy,
-improves data efficiency,
-complements ImageNet pretraining.

10.12.2025 11:05 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

Can vision transformers learn without images?๐Ÿค”๐Ÿ‘€
Our latest work shows that pretraining ViTs on procedural symbolic data (eg sequences of balanced parentheses) makes subsequent standard training (eg on ImageNet) more data efficient! How is this possible?! โฌ‡๏ธ๐Ÿงต

10.12.2025 11:05 โ€” ๐Ÿ‘ 47    ๐Ÿ” 6    ๐Ÿ’ฌ 3    ๐Ÿ“Œ 1

Academic Strava?๐Ÿค“ It feels like an underrepresented group in my Strava feed!

26.07.2025 10:10 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

It'd be nice to provide complete analyses (that you have precomputed) of existing papers, so we can see what kind of output the tool provides, without having to submit any of my own work.

24.07.2025 08:24 โ€” ๐Ÿ‘ 23    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

Dang it just never ends ๐Ÿ˜ฑ

22.07.2025 18:08 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

In this setting, does the student (sometimes?) get better than the teacher? One hypothesis could be that the teacher, even if "less correct" than the GT, provides supervision that's easier to learn for another NN (the student). The optimization follows a less tortuous path & finds a better solution.

16.07.2025 15:23 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

๐Ÿ‘ I had my very first paper published at DAGM. It was a while ago but I remember it as a very welcoming conference.

07.07.2025 19:56 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
OOD-Chameleon: Is Algorithm Selection for OOD Generalization Learnable? Out-of-distribution (OOD) generalization is challenging because distribution shifts come in many forms. Numerous algorithms exist to address specific settings, but choosing the right training algorith...

๐ŸŽฏThere's already a plethora of methods to handle distribution shifts: most gains may now simply be in better using them! Automatic selection looks promising, yet there's lots more to do. Interested? Come chat with us at ICML!
๐Ÿ“„ arxiv.org/abs/2410.02735
๐Ÿ’ป github.com/LiangzeJiang...

07.07.2025 16:50 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0