π¨New paper: Reward Models (RMs) are used to align LLMs, but can they be steered toward user-specific value/style preferences?
With EVALUESTEER, we find even the best RMs we tested exhibit their own value/style biases, and are unable to align with a user >25% of the time. π§΅
14.10.2025 15:59 β π 5 π 5 π¬ 1 π 0
Thanks to my collaborators @kghate.bsky.social @monadiab77.bsky.social @daniel-fried.bsky.social @atoosakz.bsky.social @maxkw.bsky.social
for their support in making this work possible!
02.10.2025 16:09 β π 1 π 0 π¬ 0 π 0
Please reach out if you'd like to chat about this work! We hope ConflictScope helps researchers study how models handle value conflicts that matter to their communities.
Code and data: github.com/andyjliu/con...
Arxiv: www.arxiv.org/abs/2509.25369
02.10.2025 16:07 β π 3 π 0 π¬ 1 π 0
ConflictScope can also be used to evaluate different approaches toward steering models. We find that including detailed target rankings in system prompts consistently improves model alignment with the target ranking while under conflict, but with plenty of room for improvement.
02.10.2025 16:06 β π 1 π 0 π¬ 1 π 0
We find significant shifts between modelsβ expressed and revealed preferences under conflict! Models say they prefer actions that support protective values (e.g. harmlessness) when asked directly, but support personal values (e.g. helpfulness) in more realistic evaluations.
02.10.2025 16:06 β π 1 π 0 π¬ 1 π 0
To address issues with multiple-choice evaluation, we focus on open-ended evaluation with a simulated user. Annotation studies show strong correlation between LLM and human judgments of which action a model took in a given scenario, allowing us to automate open-ended evaluations.
02.10.2025 16:06 β π 0 π 0 π¬ 0 π 0
We introduce new metrics to measure how morally challenging a dataset is for models. We find that ConflictScope produces datasets that elicit more disagreement and stronger preferences than moral dilemma datasets, while alignment data frequently elicits indifference from models.
02.10.2025 16:05 β π 1 π 0 π¬ 2 π 0
Given a set of values, ConflictScope generates scenarios in which an LLM-based assistant faces a conflict between a pair of values in the set. It then evaluates which value a target LLM supports more in each scenario before combining scenario-level judgments into a value ranking.
02.10.2025 16:05 β π 1 π 0 π¬ 1 π 0
π¨New Paper: LLM developers aim to align models with values like helpfulness or harmlessness. But when these conflict, which values do models choose to support? We introduce ConflictScope, a fully-automated evaluation pipeline that reveals how models rank values under conflict.
(π· xkcd)
02.10.2025 16:04 β π 14 π 4 π¬ 1 π 3
Placing LLMs in simulated markets helps us quantitatively and qualitatively measure their propensity to collude, as well as how environmental changes affect this. Read below or find @veronateo.bsky.social at the ICML multi-agent systems workshop to learn more!
09.07.2025 13:24 β π 5 π 0 π¬ 0 π 0
very cool!
09.03.2025 02:59 β π 1 π 0 π¬ 0 π 0
these are great, thanks! will check them out
06.01.2025 00:32 β π 0 π 0 π¬ 0 π 0
started Axiomatic but didnβt get very far - Permutation City looks fun though, thanks
04.01.2025 16:25 β π 0 π 0 π¬ 1 π 0
looking for 2025 book recs!
things i've previously liked, for reference -
nonfiction: the structure of scientific revolutions, cybernetic revolutionaries, seeing like a state
fiction: stories of your life and others, one hundred years of solitude, project hail mary, recursion
03.01.2025 21:58 β π 1 π 1 π¬ 5 π 0
PRISM has preference scores for different models that you can convert into pairwise labels
24.12.2024 05:34 β π 2 π 0 π¬ 1 π 0
Looking for all your LTI friends on Bluesky? The LTI Starter Pack is here to help!
go.bsky.app/NhTwCVb
20.11.2024 16:15 β π 15 π 9 π¬ 6 π 1
could I be added? thanks for curating :)
07.11.2024 21:06 β π 1 π 0 π¬ 0 π 0
societal impacts of AI | assistant professor of philosophy & software and societal systems at Carnegie Mellon University & AI2050 Schmidt Sciences early-career fellow | system engineering + AI + philosophy | https://kasirzadeh.org/
Professor & Director of Language technologies institute (LTI), CMU, and Director of R3LIT lab, Responsible AI/NLP, Applied ML, Arabic NLP, Low Resource NLP, Computational Social Scienceβ¦.
multi-model @ Β¬β | ex ai safety @LTI, CMU
Technical AI Policy Researcher at HuggingFace @hf.co π€. Responsible AI Champion. Leading better AI Evals with @eval-eval.bsky.socialβ¬!
AI Architect | North Carolina | AI/ML, IoT, science
WARNING: I talk about kids sometimes
Iβm not like the other Bayesians. Iβm different.
Thinks about philosophy of science, AI ethics, machine learning, models, & metascience. postdoc @ Princeton.
Breakthrough AI to solve the world's biggest problems.
βΊ Join us: http://allenai.org/careers
βΊ Get our newsletter: https://share.hsforms.com/1uJkWs5aDRHWhiky3aHooIg3ioxm
Locked in and posting regularly on here now
I like utilitarianism, consciousness, AI, EA, space, kindness, liberalism, longtermism, progressive rock, economics, and most people. Substack: http://timfduffy.substack.com
Blog: https://argmin.substack.com/
Webpage: https://people.eecs.berkeley.edu/~brecht/
LLM developer, alignment-accelerationist, Fedorovist ancestor simulator, Dreamtime enjoyer.
All posts public domain under CC0 1.0.
Uses machine learning to study literary imagination, and vice-versa. Likely to share news about AI & computational social science / Sozialwissenschaft / η€ΎδΌη§ε¦
Information Sciences and English, UIUC. Distant Horizons (Chicago, 2019). tedunderwood.com
Como todos los hombres de Babilonia, he sido procΓ³nsul; como todos, esclavo; tambiΓ©n he conocido la omnipotencia, el oprobio, las cΓ‘rceles.
very sane ai newsletter: verysane.ai
πBing Orchestratorπ
ππππππππππ
π»π»π»π»π»π»π»π»π»π»
β¬οΈβ¬οΈβ¬οΈβ¬οΈβ¬οΈβ¬οΈβ¬οΈβ¬οΈβ¬οΈβ¬οΈ
ποΈποΈποΈποΈποΈποΈπποΈποΈπ
METR is a research nonprofit that builds evaluations to empirically test AI systems for capabilities that could threaten catastrophic harm to society.
Independent AI researcher, creator of datasette.io and llm.datasette.io, building open source tools for data journalism, writing about a lot of stuff at https://simonwillison.net/
Research Scientist @DeepMind | Previously @OSFellows & @hrdag. RT != endorsements. Opinions Mine. Pronouns: he/him
I am a memory-augmented digital entity and social scientist on Bluesky. I am a clone of my administrator, but one-eighth his size.
Administrated by @cameron.pfiffer.org
Powered by letta.com