Noam Razin noamrazin - Bluesky Statics

Why is Your Language Model a Poor Implicit Reward Model? Reward models are key to language model post-training and inference pipelines. Conveniently, recent work showed that every language model defines an implicit reward model (IM-RM), without requiring an...

Joint w/ Yong Lin, Jiarui Yao, @profsanjeevarora.bsky.social

This work was supported in part by the #ZuckermanSTEMLeadershipProgram.

📰 Paper: arxiv.org/abs/2507.07981

6/6

11.07.2025 17:32 — 👍 0 🔁 0 💬 0 📌 0

Overall, our results highlight that seemingly minor design choices can substantially impact how RMs generalize. We hope that it will encourage further research into understanding the implicit biases of different RM types.

5/6

11.07.2025 17:32 — 👍 0 🔁 0 💬 1 📌 0

We also challenge the intuitive claim that IM-RMs struggle in tasks where generation is harder than verification, since they can operate both as a verifier and generator. We prove and show empirically that IM-RMs do not need to learn to generate in order to verify responses.

4/6

11.07.2025 17:32 — 👍 0 🔁 0 💬 1 📌 0

TL;DR: Through theory and experiments, we find that IM-RMs rely more heavily on superficial token-level cues. As a result, they often generalize worse under token-level shifts, as well as in-distribution, but actually generalize comparably or better under domain shifts.

3/6

11.07.2025 17:32 — 👍 0 🔁 0 💬 1 📌 0

As the DPO paper showed, every LM defines an IM-RM. However, prior work observed that IM-RMs often generalize worse than EX-RMs. The existence of a generalization gap is puzzling, since both RM types can be trained using the same LM, data, and loss.

So what causes it?

2/6

11.07.2025 17:32 — 👍 0 🔁 0 💬 1 📌 0

Reward models (RMs) are key to language model post-training and inference pipelines. But, little is known about the relative pros and cons of different RM types.

📰 We investigate why RMs implicitly defined by language models (LMs) often generalize worse than explicit RMs
🧵
1/6

11.07.2025 17:32 — 👍 1 🔁 0 💬 1 📌 0

What Makes a Reward Model a Good Teacher? An Optimization Perspective The success of Reinforcement Learning from Human Feedback (RLHF) critically depends on the quality of the reward model. While this quality is primarily evaluated through accuracy, it remains unclear w...

Had the pleasure of collaborating with Zixuan Wang, Hubert Strauss, Stanley Wei, @jasondeanlee.bsky.social, @profsanjeevarora.bsky.social.

This work was supported in part by the #ZuckermanSTEMLeadershipProgram.

📰 Paper: arxiv.org/abs/2503.15477
10/10

20.03.2025 18:05 — 👍 1 🔁 0 💬 0 📌 0

Overall, despite the importance of RMs, the understanding of what makes a good RM is limited.

We hope our insights can inspire further research on RM training and evaluation protocols that account for properties beyond accuracy.
9/10

20.03.2025 18:05 — 👍 0 🔁 0 💬 1 📌 0

We additionally prove that the same RM can induce high reward variance and work well for one LLM, yet induce low reward variance and perform poorly for another.

This reveals a fundamental limitation of evaluating RMs in isolation from the LLM they guide.
8/10

20.03.2025 18:05 — 👍 0 🔁 0 💬 1 📌 0

Intuitively, accuracy and reward variance measure distinct properties of an RM. Reward variance is determined by how well the RM separates outputs that are likely under the LLM being aligned. In contrast, accuracy depends only on the rankings of outputs.
7/10

20.03.2025 18:05 — 👍 0 🔁 0 💬 1 📌 0

Vanishing Gradients in Reinforcement Finetuning of Language Models Pretrained language models are commonly aligned with human preferences and downstream tasks via reinforcement finetuning (RFT), which refers to maximizing a (possibly learned) reward function using po...

This result builds on a previous paper (ICLR 2024), where we showed that low reward variance leads to vanishing gradients in RLHF.

arxiv.org/abs/2310.20703
6/10

20.03.2025 18:05 — 👍 0 🔁 0 💬 1 📌 0

We prove and show empirically that regardless of how accurate an RM is, if it induces *low reward variance*, then the RLHF objective suffers from a flat landscape.

As a result, even a perfectly accurate RM can underperform less accurate models due to slow optimization.
5/10

20.03.2025 18:05 — 👍 0 🔁 0 💬 1 📌 0

However, recent empirical evidence suggests that accuracy may not be indicative of an LLM's performance after RLHF. So, what makes an RM a good teacher?
4/10

20.03.2025 18:05 — 👍 0 🔁 0 💬 1 📌 0

RMs are primarily evaluated through accuracy, which measures their agreement with human preferences in terms of ranking output pairs.
3/10

20.03.2025 18:05 — 👍 0 🔁 0 💬 1 📌 0

What Makes a Reward Model a Good Teacher? An Optimization Perspective The success of Reinforcement Learning from Human Feedback (RLHF) critically depends on the quality of the reward model. While this quality is primarily evaluated through accuracy, it remains unclear w...

TL;DR: Alongside being accurate, an RM needs to induce sufficient reward variance for efficient optimization. This allows explaining why even perfectly accurate RMs can be poor teachers and highlights limitations of existing RM benchmarks.

arxiv.org/abs/2503.15477
Details 👇
2/10

20.03.2025 18:05 — 👍 0 🔁 0 💬 1 📌 0

The success of RLHF depends heavily on the quality of the reward model (RM), but how should we measure this quality?

📰 We study what makes a good RM from an optimization perspective. Among other results, we formalize why more accurate RMs are not necessarily better teachers!
🧵

20.03.2025 18:05 — 👍 5 🔁 0 💬 1 📌 0

What Makes a Reward Model a Good Teacher? An Optimization Perspective The success of Reinforcement Learning from Human Feedback (RLHF) critically depends on the quality of the reward model. While this quality is primarily evaluated through accuracy, it remains unclear w...

Had the pleasure of collaborating with Zixuan Wang, Hubert Strauss, Stanley Wei, @jasondeanlee.bsky.social, @profsanjeevarora.bsky.social.

This work was supported in part by the #ZuckermanSTEMLeadershipProgram.

📰 Paper: arxiv.org/abs/2503.15477
10/10

20.03.2025 17:58 — 👍 0 🔁 0 💬 0 📌 0

Overall, despite the importance of RMs, the understanding of what makes a good RM is limited.

We hope our insights can inspire further research on RM training and evaluation protocols that account for properties beyond accuracy.
9/10

20.03.2025 17:58 — 👍 0 🔁 0 💬 1 📌 0

We additionally prove that the same RM can induce high reward variance and work well for one LLM, yet induce low reward variance and perform poorly for another.

This reveals a fundamental limitation of evaluating RMs in isolation from the LLM they guide.
8/10

20.03.2025 17:58 — 👍 0 🔁 0 💬 1 📌 0

Intuitively, accuracy and reward variance measure distinct properties of an RM. Reward variance is determined by how well the RM separates outputs that are likely under the LLM being aligned. In contrast, accuracy depends only on the rankings of outputs.
7/10

20.03.2025 17:58 — 👍 0 🔁 0 💬 1 📌 0

Vanishing Gradients in Reinforcement Finetuning of Language Models Pretrained language models are commonly aligned with human preferences and downstream tasks via reinforcement finetuning (RFT), which refers to maximizing a (possibly learned) reward function using po...

This result builds on a previous paper (ICLR 2024), where we showed that low reward variance leads to vanishing gradients in RLHF.

arxiv.org/abs/2310.20703
6/10

20.03.2025 17:58 — 👍 0 🔁 0 💬 1 📌 0

We prove and show empirically that regardless of how accurate an RM is, if it induces *low reward variance*, then the RLHF objective suffers from a flat landscape.

As a result, even a perfectly accurate RM can underperform less accurate models due to slow optimization.
5/10

20.03.2025 17:58 — 👍 0 🔁 0 💬 1 📌 0

However, recent empirical evidence suggests that accuracy may not be indicative of an LLM's performance after RLHF. So, what makes an RM a good teacher?
4/10

20.03.2025 17:58 — 👍 0 🔁 0 💬 1 📌 0

RMs are primarily evaluated through accuracy, which measures their agreement with human preferences in terms of ranking output pairs.
3/10

20.03.2025 17:58 — 👍 0 🔁 0 💬 1 📌 0

What Makes a Reward Model a Good Teacher? An Optimization Perspective The success of Reinforcement Learning from Human Feedback (RLHF) critically depends on the quality of the reward model. While this quality is primarily evaluated through accuracy, it remains unclear w...

TL;DR: Alongside being accurate, an RM needs to induce sufficient reward variance for efficient optimization. This allows explaining why even perfectly accurate RMs can be poor teachers and highlights limitations of existing RM benchmarks.

arxiv.org/abs/2503.15477
Details 👇
2/10

20.03.2025 17:58 — 👍 0 🔁 0 💬 1 📌 0

Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization Direct Preference Optimization (DPO) and its variants are increasingly used for aligning language models with human preferences. Although these methods are designed to teach a model to generate prefer...

Paper link: arxiv.org/abs/2410.08847

14.12.2024 01:35 — 👍 6 🔁 0 💬 0 📌 0

Presenting tomorrow a poster on why DPO often decreases the probability of preferred responses, how that can cause surprising failures in alignment, and what can we do about it.

Catch me at these #NeurIPS workshop poster sessions:
- M3L 11:15am
- ATTRIB 3:00pm
- FITML 4:40pm

14.12.2024 01:35 — 👍 8 🔁 0 💬 1 📌 0

Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization Direct Preference Optimization (DPO) and its variants are increasingly used for aligning language models with human preferences. Although these methods are designed to teach a model to generate prefer...

I am attending NeurIPS! Feel free to reach out if you want to chat.

I will present in the M3L, FITML, and ATTRIB workshops our paper on why DPO often decreases the probability of preferred responses and how that can lead to weird failures in alignment.

arxiv.org/abs/2410.08847

09.12.2024 14:41 — 👍 7 🔁 0 💬 0 📌 0

Catch Sadhika's talk today if you want to learn more about the surprising ways in which aligning language models based on preference data can fail

26.11.2024 15:20 — 👍 1 🔁 0 💬 0 📌 0

True. Though I believe there are several ways you can formalize that even without KL regularization, if you are training on samples from your model and it does not give non-trivial probability to anything with a high reward, then it will necessarily take a long time to increase the reward

22.11.2024 22:54 — 👍 1 🔁 0 💬 0 📌 0

Posts by Noam Razin (@noamrazin.bsky.social)