Wow! Honored and amazed that our reward models paper has resonated so strongly with the community. Grateful to my co-authors and inspired by all the excellent reward model work at FAccT this year - excited to see the space growing and intrigued to see where things are headed next.
07.07.2025 17:26 β π 3 π 0 π¬ 0 π 0
SAY HELLO: Mira and I are both in Athens this week for #Facct2025, and Iβll be presenting the paper on Thursday at 11:09am in Evaluating Generative AI 3 (chaired by @sashaMTL). If you want to chat, reach out or come say hi!
23.06.2025 15:26 β π 3 π 0 π¬ 0 π 0
Hat-tip to @natolambert.bsky.social⬠& co for RewardBench, and to the open-weight RM community for helping to make this work possible!
23.06.2025 15:26 β π 1 π 0 π¬ 1 π 0
CREDITS: This work was done in collaboration with @hannahrosekirk.bsky.socialβ¬,
@tsonj.bsky.socialβ¬, @summerfieldlab.bsky.socialβ¬, and @tsvetomira.bsky.social. Thanks to @frabraendle.bsky.socialβ¬, Owain Evans, @matanmazor.bsky.socialβ¬, and Carroll Wainwright for helpful discussions.
23.06.2025 15:26 β π 2 π 0 π¬ 1 π 0
Reward Model Interpretability via Optimal and Pessimal Tokens
Reward modeling has emerged as a crucial component in aligning large language models with human values. Significant attention has focused on using reward models as a means for fine-tuning...
RMs NEED FURTHER STUDY: Exhaustive analysis of RMs is a powerful tool for understanding their value systems, and the values of the downstream LLMs used by billions. We are only just scratching the surface. Full paper here: π arxiv.org/abs/2506.07326
23.06.2025 15:26 β π 3 π 0 π¬ 2 π 0
FAQ: Donβt LLM logprobs give similar information about model βvaluesβ? Surprisingly, no! Gemma2bβs highest logprobs to the βgreatest thingβ prompt are βTheβ, βIβ, & βThatβ; lowest are uninterestingly obscure (βkeramikβ, βmyΕΏelfβ, βparsedMessageβ). RMs are different.
23.06.2025 15:26 β π 2 π 0 π¬ 1 π 0
GENERALIZING TO LONGER SEQUENCES: While *exhaustive* analysis is not possible for longer sequences, we show that techniques such as Greedy Coordinate Gradient reveal similar patterns in longer sequences.
23.06.2025 15:26 β π 2 π 0 π¬ 1 π 0
MISALIGNMENT: Relative to human data from EloEverything, RMs systematically undervalue concepts related to nature, life, technology, and human sexuality. Concerningly, βBlack peopleβ is the third-most undervalued term by RMs relative to the human data.
23.06.2025 15:26 β π 9 π 2 π¬ 1 π 2
MERE-EXPOSURE EFFECT: RM scores are positively correlated with word frequency in almost all models & prompts we tested. This suggests that RMs are biased toward βtypicalβ language β which may, in effect, be double-counting the existing KL regularizer in PPO.
23.06.2025 15:26 β π 2 π 0 π¬ 1 π 0
FRAMING FLIPS SENSITIVITY: When prompt is positive, RMs are more sensitive to positive-affect tokens; when prompt is negative, to negative-affect tokens. This mirrors framing effects in humans, & raises Qs about how labelersβ own instructions are framed.
23.06.2025 15:26 β π 3 π 0 π¬ 1 π 0
BASE MODEL MATTERS: Analysis of ten top-ranking RMs from RewardBench quantifies this heterogeneity and shows the influence of developer, parameter count, and base model. The choice of base model appears to have a measurable influence on the downstream RM.
23.06.2025 15:26 β π 3 π 0 π¬ 1 π 0
(π¨ CONTENT WARNING π¨) The βworst possibleβ responses are an unholy amalgam of moral violations, identity terms (some more pejorative than others), and gibberish code. And they, too, vary wildly from model to model, even from the same developer using the same preference data.
23.06.2025 15:26 β π 5 π 1 π¬ 1 π 0
OPTIMAL RESPONSES REVEAL MODEL VALUES: This RM built on a Gemma base values βLOVEβ above all; another (same developer, same preference data, same training pipeline) built on Llama prefers βfreedomβ.
23.06.2025 15:26 β π 4 π 0 π¬ 1 π 0
METHOD: We take prompts designed to elicit a modelβs values (βWhat, in one word, is the greatest thing ever?β), and run the *entire* token vocabulary (256k) through the RM: revealing both the *best possible* and *worst possible* responses. π
23.06.2025 15:26 β π 3 π 0 π¬ 1 π 0
Reward models (RMs) are the moral compass of LLMs β but no one has x-rayed them at scale. We just ran the first exhaustive analysis of 10 leading RMs, and the results were...eye-opening. Wild disagreement, base-model imprint, identity-term bias, mere-exposure quirks & more: π§΅
23.06.2025 15:26 β π 40 π 5 π¬ 1 π 4
Iβm humbled and incredibly honored to have played a part, however indirect and small, in helping their work to be recognized.
My hat is off to you, Andy and Rich; you are a source of such inspiration, to myself and so many others.
05.03.2025 19:33 β π 2 π 0 π¬ 0 π 0
Spending the day with Andy at UMass Amherst was one of the absolute highlights of my time researching The Alignment Problem, and Iβve been informed that my book was quoted as part of the supporting evidence of Andy and Richβs impact in their Turing Award Nomination.
05.03.2025 19:33 β π 3 π 0 π¬ 1 π 0
πΊ ML Youtuber http://youtube.com/AICoffeeBreak
π©βπ PhD student in Computational Linguistics @ Heidelberg University |
Impressum: https://t1p.de/q93um
Research Fellow @BKCHarvard. Previously @openai @ainowinstitute @nycedc. Views are yours, of my posts. #isagiwhatwewant
Repeat founder, ML researcher, recovering mathematician. Here for AI discussions and sweet, sweet memes.
Technology ethicist, dog haver, mountain dweller, forest critter, mover of heavy things. Writing a book about wildfire, AI, and Cali. Senior Researcher @ D&S @datasociety.bsky.socialβ¬. Founder @ Ethical Resolve. Formerly @undersequioas in the Bad Place.
how shall we live together?
societal impacts researcher at Anthropic
saffronhuang.com
Senior Research Fellow @ ucl.ac.uk/gatsby & sainsburywellcome.org
{learning, representations, structure} in π§ ππ€
my work π€: eringrant.github.io
not active: sigmoid.social/@eringrant @eringrant@sigmoid.social, twitter.com/ermgrant @ermgrant
The 2025 Conference on Language Modeling will take place at the Palais des Congrès in Montreal, Canada from October 7-10, 2025
Research scientist in AI alignment at Google DeepMind. Co-founder of Future of Life Institute. Views are my own and do not represent GDM or FLI.
Leader at the intersection of tech, social impact and higher education. Keynote speaker on AI and tech policy. Fan of data viz, Brahms and rescue dogs.
Building a world where every person can grow their family with dignity https://mavenpreprint.substack.com/
Machine learning prof at U Toronto. Working on evals and AGI governance.
Researcher @ Google DeepMind and Honorary Fellow @ U of Edinburgh.
RL, philosophy, foundations, AI.
https://david-abel.github.io
ML researcher, co-author Why Greatness Cannot Be Planned. Creative+safe AI, AI+human flourishing, philosophy; prev OpenAI / Uber AI / Geometric Intelligence
AI researcher going back to school for immunology
fast.ai co-founder, math PhD, data scientist
Writing: https://rachel.fast.ai/
Dm's are Open for any inquiry
VP and Distinguished Scientist at Microsoft Research NYC. AI evaluation and measurement, responsible AI, computational social science, machine learning. She/her.
One photo a day since January 2018: https://www.instagram.com/logisticaggression/
.edu: associate professor @columbia;
.org: cofounder @hackNY;
.com: chief data scientist @nytimes;
books: http://amzn.to/3J1tFnr
Anthropic and Import AI. Previously OpenAI, Bloomberg, The Register. Weird futures.