Micah Carroll's Avatar

Micah Carroll

@micahcarroll.bsky.social

PhD student @ berkeley. https://micahcarroll.github.io/

73 Followers  |  36 Following  |  1 Posts  |  Joined: 21.06.2023  |  1.2468

Latest posts by micahcarroll.bsky.social on Bluesky

LLMs' sycophancy issues are a predictable result of optimizing for user feedback. Even if clear sycophantic behaviors get fixed, AIs' exploits of our cognitive biases may only become more subtle.

Grateful our research on this was featured in @washingtonpost.com by @nitasha.bsky.social!

01.06.2025 18:25 β€” πŸ‘ 2    πŸ” 2    πŸ’¬ 0    πŸ“Œ 0
Preview
Lies, Damned Lies, and Distributional Language Statistics: Persuasion and Deception with Large Language Models Large Language Models (LLMs) can generate content that is as persuasive as human-written text and appear capable of selectively producing deceptive outputs. These capabilities raise concerns about pot...

How effective are LLMs are persuading and deceiving people? In a new preprint we review different theoretical risks of LLM persuasion; empirical work measuring how persuasive LLMs currently are; and proposals to mitigate these risks. 🧡

arxiv.org/abs/2412.17128

10.01.2025 13:59 β€” πŸ‘ 9    πŸ” 5    πŸ’¬ 1    πŸ“Œ 0
First page of the paper Influencing Humans to Conform to Preference Models for RLHF, by Hatgis-Kessell et al.

First page of the paper Influencing Humans to Conform to Preference Models for RLHF, by Hatgis-Kessell et al.

Our proposed method of influencing human preferences.

Our proposed method of influencing human preferences.

RLHF algorithms assume humans generate preferences according to normative models. We propose a new method for model alignment: influence humans to conform to these assumptions through interface design. Good news: it works!
#AI #MachineLearning #RLHF #Alignment (1/n)

14.01.2025 23:51 β€” πŸ‘ 7    πŸ” 3    πŸ’¬ 1    πŸ“Œ 0

@micahcarroll is following 19 prominent accounts