Skander Moalla's Avatar

Skander Moalla

@skandermoalla.bsky.social

PhD @Caglar Gulcehre Lab for AI Research (CLAIRE) @EPFL. Deep Reinforcement Learning, RLHF, foundation models. ML Research Template (https://github.com/CLAIRE-Labo/python-ml-research-template)

61 Followers  |  67 Following  |  18 Posts  |  Joined: 27.11.2024
Posts Following

Posts by Skander Moalla (@skandermoalla.bsky.social)

The next generation of open LLMs should be inclusive, compliant, and multilingual by design. That’s why we @icepfl.bsky.social @ethz.ch @cscsch.bsky.social ) built Apertus.

03.09.2025 09:26 β€” πŸ‘ 24    πŸ” 8    πŸ’¬ 2    πŸ“Œ 2
Preview
Quantile Reward Policy Optimization: Alignment with Pointwise Regression and Exact Partition Functions Aligning large language models with pointwise absolute rewards has so far required online, on-policy algorithms such as PPO and GRPO. In contrast, simpler methods that can leverage offline or off-poli...

I’m really proud of this work! It’s been an amazing collaboration with @simonmatrenok.bsky.social and @caglarai.bsky.social

πŸ“° Paper: arxiv.org/abs/2507.08068
Hidden gems and open questions in the 30+ page appendixπŸ’Ž
πŸ§‘β€πŸ’» Code: github.com/CLAIRE-Labo/...
🌐 Blog: claire-labo.github.io/quantile-rewar

15.07.2025 18:45 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

What do these optimal policies look like? πŸ‘€
We show equivalence of a family of transformations allowing us to qualitatively interpret the quantile reward optimal as a Best-of-N policy 🎯
Empirically, each transformation brings different dynamics, and it's an open question to compare all of them! πŸ•΅οΈ

15.07.2025 18:45 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image Post image

QRPO is a framework. You can shape the optimal policy! πŸŽ›οΈ
We derive a framework around QRPO for using transformations on top of the quantile reward.
Each transformation reshapes the reward distribution and affects the properties of the optimal policy, while having a tractable partition function.

15.07.2025 18:45 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

And we show that for relatively high beta, with good data, the probabilities increase as predicted πŸ’―

15.07.2025 18:45 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

For QRPO, this is not a mystery anymore; we know exactly where the probabilities should move, and we explain how it's normal for them to decrease when the regularization (beta) is very low.
This is simply because the target policy is much further away from the training support 🎯

15.07.2025 18:45 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Is QRPO still subject to the "chosen probabilities decreasing" problem?
Our understanding of the KL-regularized closed-form solution gives insights into the "DPO chosen probabilities decreasing" problem! πŸ€”

15.07.2025 18:45 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

πŸ’¬ The reward model we use has been trained to be robust to length bias, and we see that this is preserved in QRPO and REBEL, which use rewards.
But when compressed to preferences for DPO and SimPO, it leads to the typical length bias trend, despite the reduction in mean length.

15.07.2025 18:45 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image Post image Post image

πŸ₯‡ QRPO achieves top performance in chat and coding compared to DPO, REBEL, and SimPO, each capturing a different way to learn from the reward signal (preference, reward difference, length normalization).

15.07.2025 18:45 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Obviously, nothing comes for free, but we give you a great deal! 🀝

* QRPO does not need many reference rewards to estimate quantiles: for high-quality offline datasets, 1-3 are enough!

* And you can scale this number for off-policy data generated from the reference model! πŸ“ˆ

15.07.2025 18:45 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image Post image Post image

3️⃣ We can transform the reward distribution to make it known. It's uniform for reward quantiles! πŸ”‘

πŸš€ The result: Quantile Reward Policy Optimization!

QRPO transforms rewards to quantile rewards for which we derive Z, and can then fit the closed-form optimal RL solution with a simple regression! πŸ“‰

15.07.2025 18:45 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

1️⃣ The β€œinfinite sum over all possible LLM generations” argument is a myth. We rewrite the partition function Z in terms of rewards, revealing that Z is given by the moment generating function (MGF) of the reward distribution!

2️⃣ Knowing the reward distribution => knowing the MGF => knowing Z πŸ”

15.07.2025 18:45 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

We tackle the infamous β€œ... partition function is known to be intractable...” problem 🧐
This is the problem that limits DPO-like methods to pairwise data. We solve it thanks to 3 insights! πŸ’‘

15.07.2025 18:45 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

πŸš€ Big time! We can finally do simple LLM RL fine-tuning with rewards and leverage offline/off-policy data!

❌ You want rewards, but GRPO only works online?
❌ You want offline, but DPO is limited to preferences?
βœ… QRPO can do both!

🧡Here's how we do it:

15.07.2025 18:45 β€” πŸ‘ 12    πŸ” 4    πŸ’¬ 1    πŸ“Œ 0
Post image

⚑️🧠 Excited to share our recent work on long-context efficiency! We propose a new layer called RATβ€”fast and lightweight like RNNs, yet powerful like Attention. 🐭✨ This is the joint effort with Anunay Yadav, @razvan-pascanu.bsky.social @caglarai.bsky.social !

12.07.2025 09:59 β€” πŸ‘ 7    πŸ” 3    πŸ’¬ 1    πŸ“Œ 1
Video thumbnail

Excited to share our latest work on EvoTune, a novel method integrating LLM-guided evolutionary search and reinforcement learning to accelerate the discovery of algorithms! 1/12🧡

26.04.2025 16:56 β€” πŸ‘ 21    πŸ” 10    πŸ’¬ 1    πŸ“Œ 2
Anastasia Koloskova Anastasia Koloskova, PhD student in Machine Learning at EPFL.

Anastasia @koloskova.bsky.social recently won the European @ellis.eu PhD award, for her amazing work on AI and optimization.

She will be joining University of Zurich as a professor this summer, and hiring PhD students and postdocs. You should apply to her group!

Her website: koloskova.github.io

08.03.2025 13:53 β€” πŸ‘ 9    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
No Representation, No Trust: Connecting Representation, Collapse, and Trust Issues in PPO

proceedings.neurips.cc/paper_files/...

03.03.2025 21:36 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

A dream come true! I presented "No Representation, No Trust" on my favorite RL podcast, TalkRL!
Make sure to check it out to learn why training with PPO for too long makes your agent collapse!

03.03.2025 21:36 β€” πŸ‘ 4    πŸ” 2    πŸ’¬ 1    πŸ“Œ 0
Post image Post image

Excited to share that the first paper of my PhD has been accepted for publication at the ISPRS Geospatial Week 2025! This dataset paper introduces a globally representative, high-resolution (10m) benchmark dataset for Above Ground Biomass estimation.

27.01.2025 13:21 β€” πŸ‘ 25    πŸ” 5    πŸ’¬ 2    πŸ“Œ 1

For my first post on Bluesky .. I'll start by announcing our 2025 edition of EEML which will be in Sarajevo :) ! I'm really excited about it and hope to see many of you there. Please follow the website (and Bluesky account) for more details which are coming soon ..

15.12.2024 18:39 β€” πŸ‘ 32    πŸ” 7    πŸ’¬ 1    πŸ“Œ 0

Also, check out our ML project templateβ€”it’s a game-changer!πŸš€πŸš€
@caglarai.bsky.social
πŸ§‘β€πŸ’» github.com/CLAIRE-Labo/...

10.12.2024 19:38 β€” πŸ‘ 6    πŸ” 3    πŸ’¬ 0    πŸ“Œ 0
NeurIPS Poster No Representation, No Trust: Connecting Representation, Collapse, and Trust Issues in PPONeurIPS 2024

πŸ‘₯ NeurIPS: neurips.cc/virtual/2024...
πŸ“° Paper: arxiv.org/abs/2405.00662
πŸ§‘β€πŸ’» Code: github.com/CLAIRE-Labo/...

10.12.2024 18:39 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Ever been puzzled by your PPO agent collapsing out of nowhere? πŸ“ˆπŸ€―πŸ“‰ Come check out our poster tomorrow!
Wed 11 Dec 11 am - 2 pm PST
West Ballroom A-D #6403
@caglarai.bsky.social @andreamiele.bsky.social @razvan-pascanu.bsky.social

10.12.2024 18:33 β€” πŸ‘ 7    πŸ” 1    πŸ’¬ 1    πŸ“Œ 1