10/n
Joint work with @spyrosgidaris.bsky.social and Nikos Komodakis.
This research was conducted during my first year as a PhD fellow of ARCHIMEDES Research on AI, Data Science, and Algorithms.
#MuToR #Transformers #GenerativeAI #MachineLearning #LLMs #LanguageModeling #DeepLearning
20.05.2025 12:48 β π 0 π 0 π¬ 1 π 0
9/n Key Takeaways
MuToR offers a simple yet powerful approach for multi-token prediction:
β
Scalable prediction horizons
β
Effective across modalities
β
Foundation for token-based lookahead mechanisms
20.05.2025 12:43 β π 0 π 0 π¬ 1 π 0
8/n
We even test MuToR on the star-graph pathfinding task, where next-token prediction with teacher-forcing models fail due to shortcut learning. MuToR succeeds, showing its ability to learn long-range structure and planning.
20.05.2025 12:43 β π 0 π 0 π¬ 1 π 0
7/n πΌοΈ MuToR for Images
Our 2D extension brings multi-token prediction to autoregressive image generation:
β
Better samples: Outperforms next-token prediction on both FID and IS
β
Efficient: Achieves comparable performance even with only a small number of register tokens
20.05.2025 12:43 β π 0 π 0 π¬ 1 π 0
5/n
We also adapt MuToR to images by modifying the offset sampling to respect the 2D image structure. This 2D extension enriches the training signal by capturing spatial dependencies inherent in visual data, while requiring no architectural changes.
20.05.2025 12:43 β π 1 π 0 π¬ 1 π 0
4/n Why MuToR? (2/2)
β
Negligible parameters (just a single learnable embedding for registers)
β
Scalable prediction horizons (training cost remains fixed regardless of prediction span)
β
Richer training signal
β
Identical inference speed
20.05.2025 12:43 β π 0 π 0 π¬ 1 π 0
3/n Why MuToR? (1/2)
β
No architecture changes (unlike prior multi-token setups with extra transformer blocks)
β
Fully compatible with off-the-shelf pretrained LLMs
β
Ideal for supervised finetuning (aligns multi-token training with next-token pretraining setup)
20.05.2025 12:43 β π 0 π 0 π¬ 1 π 0
Multi-Token Prediction Needs Registers
Multi-token prediction has emerged as a promising objective for improving language model pretraining, but its benefits have not consistently generalized to other settings such as fine-tuning. In...
2/n Meet MuToR β Multi-token prediction, simplified
πΉ Training: registers (interleaved with regular tokens) predict future tokens several steps ahead for a richer learning signal
πΉ Inference: Registers are discardedβpure next-token prediction
Paper: arxiv.org/abs/2505.10518
20.05.2025 12:43 β π 2 π 0 π¬ 1 π 0
1/n Multi-token prediction training boosts LLMs (DeepSeek-V3), tackling key limitations of the next-token prediction objective:
β’ Short-term focus
β’ Struggles with long-range decisions
β’ Weaker supervision
Prior methods add complexity (extra layers)
π Our fix? Register tokensβelegant and powerful
20.05.2025 12:43 β π 2 π 0 π¬ 1 π 0