Oussama Zekri's Avatar

Oussama Zekri

@ozekri.bsky.social

ENS Saclay maths dpt + UW Research Intern. Website : https://oussamazekri.fr Blog : https://logb-research.github.io/

61 Followers  |  118 Following  |  20 Posts  |  Joined: 25.11.2024  |  1.9062

Latest posts by ozekri.bsky.social on Bluesky

Post image

🚨 New paper on regression and classification!

Adding to the discussion on using least-squares or cross-entropy, regression or classification formulations of supervised problems!

A thread on how to bridge these problems:

10.02.2025 12:00 β€” πŸ‘ 50    πŸ” 8    πŸ’¬ 4    πŸ“Œ 0

You mean, we don’t stop at the frontier of the convex set but just a bit further ?

Wow, does this trick have a name?

06.02.2025 12:12 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Looks nice!! Will stop by your notebooks

05.02.2025 22:28 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Nicolas BoullΓ© About me

Working with him these past months has been both fun and inspiring. He’s an incredibly talented researcher! πŸš€

If you haven’t heard of him, check out his work : he’s one of the pioneers of operator learning and pushing this field to new heights!

04.02.2025 15:42 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Nicolas BoullΓ© About me

Thanks for reading !

❀️ Work done during my 3-months internship at Imperial College!

A huge thanks to Nicolas BoullΓ© (nboulle.github.io) for letting me work on a topic that interested me a lot during the internship.

04.02.2025 15:42 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

We fine-tuned a discrete diffusion model to respond to user prompts. In just 7k iterations (GPU poverty is real, haha), it outperforms the vanilla model ~75% of the time! πŸš€

04.02.2025 15:42 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Building on this, we can correct the gradient direction to better **follow the flow**, using the implicit function theorem (cf @mblondel.bsky.social et al., arxiv.org/abs/2105.15183 )✨

The cool part? We only need to invert a linear system, whose inverse is known in closed form! πŸ”₯

04.02.2025 15:42 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Inspired by Implicit Diffusion (@pierremarion.bsky.social @akorba.bsky.social @qberthet.bsky.socialπŸ€“, arxiv.org/abs/2402.05468), we sample using a specific CTMC, reaching the limiting distribution in an infinite time horizon. This effectively implements a gradient flow w.r.t. a Wasserstein metric!πŸ”₯

04.02.2025 15:42 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

SEPO, like most policy optimization algorithms, alternates between sampling and optimization. But what if sampling itself was seen as an optimization procedure in distribution space? πŸš€

04.02.2025 15:42 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

If you have a discrete diffusion model (naturally designed for discrete data, e.g. language or DNA sequence modeling), you can finetune it with non-differentiable reward functions! 🎯

For example, this enables RLHF for discrete diffusion models, making alignment more flexible and powerful. βœ…

04.02.2025 15:42 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image Post image

The main gradient takes the form of a weighted log concrete score, echoing DeepSeek’s unified paradigm with the weighted log policy!πŸ”₯

From this, we can reconstruct any policy gradient method for discrete diffusion models (e.g. PPO, GRPO etc...). πŸš€

04.02.2025 15:42 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

The main bottleneck of Energy-Based Models is computing the normalizing constant Z.

Instead, recent discrete diffusion models skip Z by learning ratios of probabilities. This forms the concrete score, which a neural network models efficiently!⚑

The challenge? Using this score network as a policy.

04.02.2025 15:42 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

πŸš€ Policy gradient methods like DeepSeek’s GRPO are great for finetuning LLMs via RLHF.

But what happens when we swap autoregressive generation for discrete diffusion, a rising architecture promising faster & more controllable LLMs?

Introducing SEPO !

πŸ“‘ arxiv.org/pdf/2502.01384

πŸ§΅πŸ‘‡

04.02.2025 15:42 β€” πŸ‘ 6    πŸ” 2    πŸ’¬ 1    πŸ“Œ 0

Beautiful work!!

04.02.2025 11:59 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

πŸš€Proud to share our work on the training dynamics in Transformers with Wassim Bouaziz & @viviencabannes.bsky.social @Inria @MetaAI

πŸ“Easing Optimization Paths arxiv.org/pdf/2501.02362 (accepted @ICASSP 2025 πŸ₯³)

πŸ“Clustering Heads πŸ”₯https://arxiv.org/pdf/2410.24050

πŸ–₯️ github.com/facebookrese...

1/🧡

04.02.2025 11:56 β€” πŸ‘ 5    πŸ” 4    πŸ’¬ 1    πŸ“Œ 1
Preview
GitHub - abenechehab/dicl: Official implementation of DICL (Disentangled In-Context Learning), featured in the paper Zero-shot Model-based Reinforcement Learning using Large Language Models. Official implementation of DICL (Disentangled In-Context Learning), featured in the paper Zero-shot Model-based Reinforcement Learning using Large Language Models. - abenechehab/dicl

Happy to see Disentangled In-Context Learning accepted at ICLR 2025 πŸ₯³

Make zero-shot reinforcement learning with LLMs go brrr πŸš€

πŸ–₯️ github.com/abenechehab/...

πŸ“œ arxiv.org/pdf/2410.11711

Congrats Abdelhakim (abenechehab.github.io) for leading it, always fun working with nice and strong people πŸ€—

25.01.2025 13:10 β€” πŸ‘ 5    πŸ” 2    πŸ’¬ 0    πŸ“Œ 0
GΓ©nΓ©ration de donnΓ©es en IA par transport et dΓ©bruitage (1) - StΓ©phane Mallat (2024-2025)
YouTube video by Mathématiques et informatique - Collège de France Génération de données en IA par transport et débruitage (1) - Stéphane Mallat (2024-2025)

For the French-speaking audience, S. Mallat's courses at the College de France on Data generation in AI by transport and denoising have just started. I highly recommend them, as I've learned a lot from the overall vision of his courses.

Recordings are also available: www.youtube.com/watch?v=5zFh...

20.01.2025 17:49 β€” πŸ‘ 10    πŸ” 3    πŸ’¬ 0    πŸ“Œ 0
Post image

Speculative sampling accelerates inference in LLMs by drafting future tokens which are verified in parallel. With @vdebortoli.bsky.social , A. Galashov & @arthurgretton.bsky.social , we extend this approach to (continuous-space) diffusion models: arxiv.org/abs/2501.05370

10.01.2025 16:30 β€” πŸ‘ 45    πŸ” 10    πŸ’¬ 0    πŸ“Œ 0

i couldn’t have say it better myself !

06.12.2024 16:09 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

The idea that one needs to know a lot of advanced math to start doing research in ML seems so wrong to me. Instead of reading books for weeks and forgetting most of them a year later, I think it's much better to try do things, see what knowledge gaps prevent you from doing them, and only then read.

06.12.2024 14:26 β€” πŸ‘ 9    πŸ” 2    πŸ’¬ 4    πŸ“Œ 0

This equivalence between LLMs and Markov chains seems useless, but it isn't! Among the contributions, the paper highlights bounds established thanks to this equivalence, and verifies the influence of bound terms on recents LLMs !

I invite you to take a look at the other contributions of the paper πŸ™‚

04.12.2024 09:47 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

This number is huge, but **finite**! Working with markov chains in a finite state space really gives non-trivial mathematical insights (existence and uniqueness of a stationary distribution for example...).

04.12.2024 09:41 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
Large Language Models as Markov Chains Large language models (LLMs) have proven to be remarkably efficient, both across a wide range of natural language processing tasks and well beyond them. However, a comprehensive theoretical analysis o...

This seems like… what we started with, no? arxiv.org/abs/2410.02724

03.12.2024 12:19 β€” πŸ‘ 167    πŸ” 9    πŸ’¬ 16    πŸ“Œ 1
Post image

🚨So, you want to predict your model's performance at test time?🚨

πŸ’‘Our NeurIPS 2024 paper proposes 𝐌𝐚𝐍𝐨, a training-free and SOTA approach!

πŸ“‘ arxiv.org/pdf/2405.18979
πŸ–₯️https://github.com/Renchunzi-Xie/MaNo

1/🧡(A surprise at the end!)

03.12.2024 16:58 β€” πŸ‘ 16    πŸ” 6    πŸ’¬ 2    πŸ“Œ 1
Post image

I wrote a summary of the main ingredients of the neat proof by Hugo Lavenant that diffusion models do not generally define optimal transport. github.com/mathematical...

30.11.2024 08:35 β€” πŸ‘ 240    πŸ” 45    πŸ’¬ 6    πŸ“Œ 5

For more details, check out these papers:
πŸ‘‰ arxiv.org/pdf/2402.00795 β€” Introduces this method (to the best of my knowledge).
πŸ‘‰ arxiv.org/pdf/2410.02724 β€” Provides theoretical results and empirical validation on LLMs.

26.11.2024 14:52 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

πŸ’‘For a Markov chain with d states, the LLM-based method achieves an error rate of O(log⁑(d)/N).

The frequentist approach, which is minimax optimal, achieves O(d/N). (see Wolfer et al., 2019, arxiv.org/pdf/1902.00080).

This makes it particularly efficient for MC with a large number of states! 🌟

26.11.2024 14:52 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image Post image

‼️What’s even better is that you can derive bounds on the estimation error based on the number of samples N provided and specific properties of the Markov chain.

Tested and validated on recent LLMs!

26.11.2024 14:52 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Video thumbnail

πŸš€ Did you know you can use the in-context learning abilities of an LLM to estimate the transition probabilities of a Markov chains?

The results are pretty exciting ! πŸ˜„

26.11.2024 14:52 β€” πŸ‘ 8    πŸ” 2    πŸ’¬ 1    πŸ“Œ 1

@ozekri is following 19 prominent accounts