Reviewer #2 striking again?
21.02.2026 13:15 โ ๐ 2 ๐ 0 ๐ฌ 0 ๐ 0@damienteney.bsky.social
Research Scientist @ Idiap Research Institute. @idiap.bsky.social Adjunct lecturer @ Australian Institute for ML. @aimlofficial.bsky.social Occasionally cycling across continents. https://www.damienteney.info
Reviewer #2 striking again?
21.02.2026 13:15 โ ๐ 2 ๐ 0 ๐ฌ 0 ๐ 0
This all looks very promising, and there's a lot more to explore! Paper and code โฌ๏ธ
Procedural Pretraining: Warming Up Language Models with Abstract Data
www.arxiv.org/abs/2601.21725
github.com/zlshinnick/p...
๐งฉ Multiple types of procedural data can be combined.
We get further gains by mixing either
โข multiple types of data, or
โข weights of models individually warmed-up on different types of data.
โ๏ธ MLPs vs. attention: where is the information located?
We try resetting selected weights to random, before standard pretraining. Surprisingly, we obtain further gains, but they're domain-specific:
โข warmed-up MLPs benefit natural language
โข warmed-up attention helps code/math
๐ Benefits on subsequent standard pretraining.
By front-loading as little as 0.1% procedural data, models achieve significantly better pretraining performance on language, code, and math. They use up to 45% less semantic data to reach a baseline perplexity.
๐ Different procedural data = different benefits.
We first determine the effect of different types of procedural data with algorithmic diagnostic tasks. The benefits range from long-context recall to arithmetic, depending on the type of procedural data.
๐ก Humans learn better when starting with simple structure and logic rather than memorizing a massive set of facts. By analogy, we use abstract, structured data to build a scaffold in language models, free of semantic biases.
20.02.2026 12:39 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0๐ฅWhat if web text isnโt the best place to start training LLMs? Our latest work shows that warming up models on procedural data (e.g. from formal languages & simple algorithms) speeds up subsequent pretraining on language, code, and math, on models up to 1.3B parametersโฌ๏ธ๐งต
20.02.2026 12:39 โ ๐ 49 ๐ 3 ๐ฌ 1 ๐ 0
Indeed the effect in *late* layers was very surprising!
My optimist interpretation is that the procedural pretraining creates circuits for computations general enough to serve as a useful scaffold for visual tasks. This would explain why they help and why they don't wash out with more training.
Sounds ๐ What's the objective function? simplicity/low cost/?
10.12.2025 20:22 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
In summary, a lightweight generic warm-up improves accuracy and data efficiency, with effects distinct from ImageNet pretraining.
Lots of exciting open questions! ๐
-Other types of procedural data?
-Other downstream tasks?
-Closed-form instantiation?
arxiv.org/abs/2511.13945
๐๐๐ก๐๐ซ๐ ๐ข๐ฌ ๐ญ๐ก๐ข๐ฌ ๐ค๐ง๐จ๐ฐ๐ฅ๐๐๐ ๐ ๐ฌ๐ญ๐จ๐ซ๐๐?
Ablations show that the knowledge mostly locates in *late* layers: the opposite of normal visual pretraining which shapes early layers. Procedural data seems to provide a qualitatively unique training signal!
๐ง ๐๐ก๐๐ญ ๐ค๐ข๐ง๐ ๐จ๐ ๐๐๐ญ๐ ๐ฐ๐จ๐ซ๐ค๐ฌ?
Formal languages with hierarchical structure seem best. If we shuffle the training tokens (eliminating nested structures), the gains disappear, showing that the benefits are not due to surface-level frequencies.
๐๐๐ซ๐จ๐๐๐๐ฎ๐ซ๐๐ฅ ๐๐๐ญ๐ ๐๐๐ง ๐ซ๐๐ฉ๐ฅ๐๐๐ ๐ซ๐๐๐ฅ ๐ข๐ฆ๐๐ ๐๐ฌ
Allocating just 1% of the ImageNet pretraining budget to the procedural warmup lets the ViT match the baseline accuracy with 28% fewer images!
๐๐ ๐๐ข๐๐๐๐ซ๐๐ง๐ญ ๐จ๐ฉ๐ญ๐ข๐ฆ๐ข๐ณ๐๐ญ๐ข๐จ๐ง ๐ญ๐ซ๐๐ฃ๐๐๐ญ๐จ๐ซ๐ฒ
Our warmed-up models don't just get a head-start, they train differently. On ImageNet (below), they follow a distinct training trajectory and converge to a better accuracy.
๐ฅOur procedural data has no semantic or visual meaning: it simply forces the model to discover generic structure in the data. As initialisation for standard image-based training, it
-boosts accuracy,
-improves data efficiency,
-complements ImageNet pretraining.
๐กPrior work has already shown that LLMs acquire useful knowledge when pretrained on formal languages. To test this on ViTs, we devise a procedural warm-up: pretraining for next-token prediction on symbolic sequences, bypassing the visual patch embedding.
10.12.2025 13:22 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0
In summary, a lightweight generic warm-up improves accuracy and data efficiency, with effects distinct from ImageNet pretraining.
Lots of exciting open questions! ๐
- Other types of procedural data?
- Other downstream tasks?
- Closed-form instantiation?
arxiv.org/abs/2511.13945
๐๐๐ก๐๐ซ๐ ๐ข๐ฌ ๐ญ๐ก๐ข๐ฌ ๐ค๐ง๐จ๐ฐ๐ฅ๐๐๐ ๐ ๐ฌ๐ญ๐จ๐ซ๐๐?
Ablations show that the knowledge mostly locates in *late* layers: the opposite of normal visual pretraining which shapes early layers. Procedural data seems to provide a qualitatively unique training signal!
๐ง ๐๐ก๐๐ญ ๐ค๐ข๐ง๐ ๐จ๐ ๐๐๐ญ๐ ๐ฐ๐จ๐ซ๐ค๐ฌ?
Formal languages with hierarchical structure seem best. If we shuffle the training tokens (eliminating nested structures), the gains disappear, showing that the benefits are not due to surface-level frequencies.
๐๐๐ซ๐จ๐๐๐๐ฎ๐ซ๐๐ฅ ๐๐๐ญ๐ ๐๐๐ง ๐ซ๐๐ฉ๐ฅ๐๐๐ ๐ซ๐๐๐ฅ ๐ข๐ฆ๐๐ ๐๐ฌ
Allocating just 1% of the ImageNet pretraining budget to the procedural warmup lets the ViT match the baseline accuracy with 28% fewer images!
๐๐ ๐๐ข๐๐๐๐ซ๐๐ง๐ญ ๐จ๐ฉ๐ญ๐ข๐ฆ๐ข๐ณ๐๐ญ๐ข๐จ๐ง ๐ญ๐ซ๐๐ฃ๐๐๐ญ๐จ๐ซ๐ฒ
Our warmed-up models don't just get a head-start, they train differently. On ImageNet (below), they follow a distinct training trajectory and converge to a better accuracy.
๐ฅOur procedural data has no semantic or visual meaning: it simply forces the model to discover generic structure in the data. As initialisation for standard image-based training, it
-boosts accuracy,
-improves data efficiency,
-complements ImageNet pretraining.
Can vision transformers learn without images?๐ค๐
Our latest work shows that pretraining ViTs on procedural symbolic data (eg sequences of balanced parentheses) makes subsequent standard training (eg on ImageNet) more data efficient! How is this possible?! โฌ๏ธ๐งต
Academic Strava?๐ค It feels like an underrepresented group in my Strava feed!
26.07.2025 10:10 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0It'd be nice to provide complete analyses (that you have precomputed) of existing papers, so we can see what kind of output the tool provides, without having to submit any of my own work.
24.07.2025 08:24 โ ๐ 23 ๐ 0 ๐ฌ 1 ๐ 0Dang it just never ends ๐ฑ
22.07.2025 18:08 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0In this setting, does the student (sometimes?) get better than the teacher? One hypothesis could be that the teacher, even if "less correct" than the GT, provides supervision that's easier to learn for another NN (the student). The optimization follows a less tortuous path & finds a better solution.
16.07.2025 15:23 โ ๐ 3 ๐ 0 ๐ฌ 1 ๐ 0๐ I had my very first paper published at DAGM. It was a while ago but I remember it as a very welcoming conference.
07.07.2025 19:56 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0
๐ฏThere's already a plethora of methods to handle distribution shifts: most gains may now simply be in better using them! Automatic selection looks promising, yet there's lots more to do. Interested? Come chat with us at ICML!
๐ arxiv.org/abs/2410.02735
๐ป github.com/LiangzeJiang...