Anastasios Gerontopoulos's Avatar

Anastasios Gerontopoulos

@nasosger.bsky.social

2 Followers  |  1 Following  |  10 Posts  |  Joined: 18.05.2025  |  1.7133

Latest posts by nasosger.bsky.social on Bluesky

10/n
Joint work with @spyrosgidaris.bsky.social and Nikos Komodakis.

This research was conducted during my first year as a PhD fellow of ARCHIMEDES Research on AI, Data Science, and Algorithms.

#MuToR #Transformers #GenerativeAI #MachineLearning #LLMs #LanguageModeling #DeepLearning

20.05.2025 12:48 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

9/n Key Takeaways
MuToR offers a simple yet powerful approach for multi-token prediction:

βœ… Scalable prediction horizons
βœ… Effective across modalities
βœ… Foundation for token-based lookahead mechanisms

20.05.2025 12:43 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

8/n
We even test MuToR on the star-graph pathfinding task, where next-token prediction with teacher-forcing models fail due to shortcut learning. MuToR succeeds, showing its ability to learn long-range structure and planning.

20.05.2025 12:43 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image Post image

7/n πŸ–ΌοΈ MuToR for Images
Our 2D extension brings multi-token prediction to autoregressive image generation:

βœ… Better samples: Outperforms next-token prediction on both FID and IS
βœ… Efficient: Achieves comparable performance even with only a small number of register tokens

20.05.2025 12:43 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image Post image Post image

6/n
πŸ“ˆ Key Results:
MuToR surpasses both standard next-token prediction and prior multi-token work across diverse benchmarks.

βœ… Math reasoning & summarization tasks
βœ… Model-agnostic: Effective across sizes
βœ… LoRA-compatible: matches / exceeds full fine-tuning accuracy

20.05.2025 12:43 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

5/n
We also adapt MuToR to images by modifying the offset sampling to respect the 2D image structure. This 2D extension enriches the training signal by capturing spatial dependencies inherent in visual data, while requiring no architectural changes.

20.05.2025 12:43 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

4/n Why MuToR? (2/2)

βœ… Negligible parameters (just a single learnable embedding for registers)
βœ… Scalable prediction horizons (training cost remains fixed regardless of prediction span)
βœ… Richer training signal
βœ… Identical inference speed

20.05.2025 12:43 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

3/n Why MuToR? (1/2)

βœ… No architecture changes (unlike prior multi-token setups with extra transformer blocks)
βœ… Fully compatible with off-the-shelf pretrained LLMs
βœ… Ideal for supervised finetuning (aligns multi-token training with next-token pretraining setup)

20.05.2025 12:43 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
Multi-Token Prediction Needs Registers Multi-token prediction has emerged as a promising objective for improving language model pretraining, but its benefits have not consistently generalized to other settings such as fine-tuning. In...

2/n Meet MuToR β€” Multi-token prediction, simplified

πŸ”Ή Training: registers (interleaved with regular tokens) predict future tokens several steps ahead for a richer learning signal
πŸ”Ή Inference: Registers are discardedβ€”pure next-token prediction

Paper: arxiv.org/abs/2505.10518

20.05.2025 12:43 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

1/n Multi-token prediction training boosts LLMs (DeepSeek-V3), tackling key limitations of the next-token prediction objective:
β€’ Short-term focus
β€’ Struggles with long-range decisions
β€’ Weaker supervision

Prior methods add complexity (extra layers)
πŸ”‘ Our fix? Register tokensβ€”elegant and powerful

20.05.2025 12:43 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

@nasosger is following 1 prominent accounts