Anastasios Gerontopoulos @nasosger

10/n
Joint work with @spyrosgidaris.bsky.social and Nikos Komodakis.

This research was conducted during my first year as a PhD fellow of ARCHIMEDES Research on AI, Data Science, and Algorithms.

#MuToR #Transformers #GenerativeAI #MachineLearning #LLMs #LanguageModeling #DeepLearning

20.05.2025 12:48 — 👍 0 🔁 0 💬 1 📌 0

9/n Key Takeaways
MuToR offers a simple yet powerful approach for multi-token prediction:

✅ Scalable prediction horizons
✅ Effective across modalities
✅ Foundation for token-based lookahead mechanisms

20.05.2025 12:43 — 👍 0 🔁 0 💬 1 📌 0

8/n
We even test MuToR on the star-graph pathfinding task, where next-token prediction with teacher-forcing models fail due to shortcut learning. MuToR succeeds, showing its ability to learn long-range structure and planning.

20.05.2025 12:43 — 👍 0 🔁 0 💬 1 📌 0

7/n 🖼️ MuToR for Images
Our 2D extension brings multi-token prediction to autoregressive image generation:

✅ Better samples: Outperforms next-token prediction on both FID and IS
✅ Efficient: Achieves comparable performance even with only a small number of register tokens

20.05.2025 12:43 — 👍 0 🔁 0 💬 1 📌 0

6/n
📈 Key Results:
MuToR surpasses both standard next-token prediction and prior multi-token work across diverse benchmarks.

✅ Math reasoning & summarization tasks
✅ Model-agnostic: Effective across sizes
✅ LoRA-compatible: matches / exceeds full fine-tuning accuracy

20.05.2025 12:43 — 👍 0 🔁 0 💬 1 📌 0

5/n
We also adapt MuToR to images by modifying the offset sampling to respect the 2D image structure. This 2D extension enriches the training signal by capturing spatial dependencies inherent in visual data, while requiring no architectural changes.

20.05.2025 12:43 — 👍 1 🔁 0 💬 1 📌 0

4/n Why MuToR? (2/2)

✅ Negligible parameters (just a single learnable embedding for registers)
✅ Scalable prediction horizons (training cost remains fixed regardless of prediction span)
✅ Richer training signal
✅ Identical inference speed

20.05.2025 12:43 — 👍 0 🔁 0 💬 1 📌 0

3/n Why MuToR? (1/2)

✅ No architecture changes (unlike prior multi-token setups with extra transformer blocks)
✅ Fully compatible with off-the-shelf pretrained LLMs
✅ Ideal for supervised finetuning (aligns multi-token training with next-token pretraining setup)

20.05.2025 12:43 — 👍 0 🔁 0 💬 1 📌 0

Multi-Token Prediction Needs Registers Multi-token prediction has emerged as a promising objective for improving language model pretraining, but its benefits have not consistently generalized to other settings such as fine-tuning. In...

2/n Meet MuToR — Multi-token prediction, simplified

🔹 Training: registers (interleaved with regular tokens) predict future tokens several steps ahead for a richer learning signal
🔹 Inference: Registers are discarded—pure next-token prediction

Paper: arxiv.org/abs/2505.10518

20.05.2025 12:43 — 👍 2 🔁 0 💬 1 📌 0

1/n Multi-token prediction training boosts LLMs (DeepSeek-V3), tackling key limitations of the next-token prediction objective:
• Short-term focus
• Struggles with long-range decisions
• Weaker supervision

Prior methods add complexity (extra layers)
🔑 Our fix? Register tokens—elegant and powerful

20.05.2025 12:43 — 👍 2 🔁 0 💬 1 📌 0

Anastasios Gerontopoulos

Latest posts by nasosger.bsky.social on Bluesky

@nasosger is following 1 prominent accounts