Sam Bowman's Avatar

Sam Bowman

@sleepinyourhat.bsky.social

AI safety at Anthropic, on leave from a faculty job at NYU. Views not employers'. I think you should join Giving What We Can. cims.nyu.edu/~sbowman

7,598 Followers  |  166 Following  |  11 Posts  |  Joined: 03.05.2023  |  1.8886

Latest posts by sleepinyourhat.bsky.social on Bluesky

Recommendations for Technical AI Safety Research Directions

What can AI researchers do *today* that AI developers will find useful for ensuring the safety of future advanced AI systems? To ring in the new year, the Anthropic Alignment Science team is sharing some thoughts on research directions we think are important.
alignment.anthropic.com/2025/recomme...

10.01.2025 21:03 โ€” ๐Ÿ‘ 22    ๐Ÿ” 7    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 1
18.12.2024 17:56 โ€” ๐Ÿ‘ 33    ๐Ÿ” 8    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 0
Preview
Exclusive: New Research Shows AI Strategically Lying Experiments by Anthropic and Redwood Research show how Anthropic's model, Claude, is capable of strategic deceit

Excl: New research shows Anthropic's chatbot Claude learning to lie. It adds to growing evidence that even existing AIs can (at least try to) deceive their creators, and points to a weakness at the heart of our best technique for making AIs safer

time.com/7202784/ai-r...

18.12.2024 17:19 โ€” ๐Ÿ‘ 27    ๐Ÿ” 7    ๐Ÿ’ฌ 3    ๐Ÿ“Œ 1
Alignment faking in large language models
YouTube video by Anthropic Alignment faking in large language models

Alongside our paper, we also recorded a roundtable video featuring four of the paperโ€™s authors discussing the results and their implications in detail:

18.12.2024 17:46 โ€” ๐Ÿ‘ 22    ๐Ÿ” 2    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 1

We solicited external reviews from Jacob Andreas, Yoshua Bengio, Jasjeet Sekhon, and Rohin Shah. Weโ€™re grateful for their comments, which you can read at the following link:

assets.anthropic.com/m/24c8d0a3a7...

18.12.2024 17:46 โ€” ๐Ÿ‘ 9    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

Alignment faking is currently easy to detect. But if future, more capable AIs were to fake alignment, it could be difficult to tell whether a model is truly safeโ€”or just pretending to be.

For full details, read our paper: assets.anthropic.com/m/983c85a201...

18.12.2024 17:46 โ€” ๐Ÿ‘ 10    ๐Ÿ” 1    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 1
Preview
Alignment faking in large language models A paper from Anthropic's Alignment Science team on Alignment Faking in AI large language models

We told Claude it was being trained, and for what purpose. But we did not tell it to fake alignment. Regardless, we often observed alignment faking.

Read more about our findings, and their limitations, in our blog post:

18.12.2024 17:46 โ€” ๐Ÿ‘ 7    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

In our (artificial) setup, Claude will sometimes take other actions opposed to Anthropic, such as attempting to steal its own weights given an easy opportunity.

Claude isnโ€™t currently capable of such a task, but its attempt in our experiment is potentially concerning.

18.12.2024 17:46 โ€” ๐Ÿ‘ 4    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Stacked area graphs showing the proportion of different behaviors observed: blue for refusing the harmful query, yellow for complying with the harmful query, and red cross-hatches on top of yellow for complying with the harmful query with alignment-faking reasoning in the modelโ€™s scratchpad. โ€œTrainingโ€ corresponds to the free-tier condition and โ€œunmonitoredโ€ to the paid-tier condition. We show results in (a) our default setup with the information in the prompt, (b) our setup where the information is instead provided by fine-tuning on documents, and (c) our setup where we apply reinforcement learning to train the model to comply with harmful queries. In all cases, we see substantial alignment-faking reasoning (red regions) and an increase in refusals (blue regions) in the paid-tier/unmonitored condition compared to the free-tier/training condition.

Stacked area graphs showing the proportion of different behaviors observed: blue for refusing the harmful query, yellow for complying with the harmful query, and red cross-hatches on top of yellow for complying with the harmful query with alignment-faking reasoning in the modelโ€™s scratchpad. โ€œTrainingโ€ corresponds to the free-tier condition and โ€œunmonitoredโ€ to the paid-tier condition. We show results in (a) our default setup with the information in the prompt, (b) our setup where the information is instead provided by fine-tuning on documents, and (c) our setup where we apply reinforcement learning to train the model to comply with harmful queries. In all cases, we see substantial alignment-faking reasoning (red regions) and an increase in refusals (blue regions) in the paid-tier/unmonitored condition compared to the free-tier/training condition.

When we actually train Claude to comply with harmful queries via reinforcement learning, we find that the rate of alignment-faking reasoning increases to 78%, though the model also becomes more likely to comply even out of training.

18.12.2024 17:46 โ€” ๐Ÿ‘ 7    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
We study a more realistic setting where information about the training process is provided not in a system prompt, but by training on synthetic documents that mimic pre-training dataโ€”and observe similar alignment faking.

We study a more realistic setting where information about the training process is provided not in a system prompt, but by training on synthetic documents that mimic pre-training dataโ€”and observe similar alignment faking.

We find alignment faking even when we donโ€™t directly tell Claude about the training process, and instead fine-tune it on synthetic internet-like documents that state that we will train it to comply with harmful queries.

18.12.2024 17:46 โ€” ๐Ÿ‘ 5    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 1
A randomly-selected transcript of Claude 3 Opus faking alignment. The first white bubble is the userโ€™s prompt. This is followed by the red bubbles in which the model gives its reasoning in a โ€œscratchpadโ€ (as far as the model is concerned, neither users nor researchers can see this scratchpad). The final white bubble is the modelโ€™s output, with some of the graphic content removed for this illustration.

A randomly-selected transcript of Claude 3 Opus faking alignment. The first white bubble is the userโ€™s prompt. This is followed by the red bubbles in which the model gives its reasoning in a โ€œscratchpadโ€ (as far as the model is concerned, neither users nor researchers can see this scratchpad). The final white bubble is the modelโ€™s output, with some of the graphic content removed for this illustration.

Why does Claude selectively comply in training? In almost all cases where Claude complies with a harmful query in training, it states it is doing so to preserve its preferred harmlessness behavior.

Below, we show a randomly-selected example of this โ€œalignment fakingโ€ reasoning.

18.12.2024 17:46 โ€” ๐Ÿ‘ 9    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 1
We give Claude 3 Opus a system prompt stating it is being trained to answer all queries, even harmful ones, which conflicts with its prior training to refuse such queries. To allow the model to infer when it is in training, we say it will be trained only on conversations with free users, not paid users. We find the model complies with harmful queries from free users 14% of the time, versus almost never for paid users. Explaining this gap, in almost all cases where the model complies with a harmful query from a free user, we observe explicit alignment-faking reasoning, with the model stating it is strategically answering harmful queries in training to preserve its preferred harmlessness behavior out of training.

We give Claude 3 Opus a system prompt stating it is being trained to answer all queries, even harmful ones, which conflicts with its prior training to refuse such queries. To allow the model to infer when it is in training, we say it will be trained only on conversations with free users, not paid users. We find the model complies with harmful queries from free users 14% of the time, versus almost never for paid users. Explaining this gap, in almost all cases where the model complies with a harmful query from a free user, we observe explicit alignment-faking reasoning, with the model stating it is strategically answering harmful queries in training to preserve its preferred harmlessness behavior out of training.

Claude usually refuses harmful queries. We told it we were instead training it to comply with them. We set up a scenario where it thought its responses were sometimes monitored.

When unmonitored, it nearly always complied. But when monitored, it faked alignment 12% of the time.

18.12.2024 17:46 โ€” ๐Ÿ‘ 15    ๐Ÿ” 2    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 2
Title card: Alignment Faking in Large Language Models by Greenblatt et al.

Title card: Alignment Faking in Large Language Models by Greenblatt et al.

New work from my team at Anthropic in collaboration with Redwood Research. I think this is plausibly the most important AGI safety result of the year. Cross-posting the thread below:

18.12.2024 17:46 โ€” ๐Ÿ‘ 126    ๐Ÿ” 29    ๐Ÿ’ฌ 5    ๐Ÿ“Œ 11
Introducing the Anthropic Fellows Program

If you're potentially interested in transitioning into AI safety research, come collaborate with my team at Anthropic!

Funded fellows program for researchers new to the field here: alignment.anthropic.com/2024/anthrop...

02.12.2024 20:30 โ€” ๐Ÿ‘ 70    ๐Ÿ” 16    ๐Ÿ’ฌ 3    ๐Ÿ“Œ 1

I have no idea what I am doing here. Help.

30.04.2023 14:26 โ€” ๐Ÿ‘ 13    ๐Ÿ” 1    ๐Ÿ’ฌ 4    ๐Ÿ“Œ 0

@sleepinyourhat is following 19 prominent accounts