Excited to share some technical details about our approach to scheming and deceptive alignment as outlined in Google's Frontier Safety Framework!
(1) current models are not yet capable of realistic scheming
(2) CoT monitoring is a promising mitigation for future scheming
08.07.2025 13:35 β π 1 π 0 π¬ 0 π 0
MATS Program
View our full paper here: arxiv.org/abs/2507.02737
View steganography transcripts here: steganography-evals-d9297ad0740c.herokuapp.com/
Github: github.com/arturzolkow...
Work done with: Artur Zolkowski, Kei Nishimura-Gasparian, Robert McCarthy and @zimmerrol.bsky.social⬠during matsprogram.org
04.07.2025 15:34 β π 2 π 0 π¬ 0 π 0
Overall, current models have some steganographic capabilities but likely aren't strong enough to bypass well-designed monitors. Nevertheless, we believe itβs important to continue measuring these capabilities to determine when additional mitigations are needed.
04.07.2025 15:34 β π 0 π 0 π¬ 1 π 0
We also create case studies implementing situations where models with strong steganographic capabilities could take unwanted actions and bypass a monitor. Current models rarely succeed in these environments unless they are given strong tips about what to do.
04.07.2025 15:34 β π 0 π 0 π¬ 1 π 0
Models can also reason through simple problems using a number of common encoding schemes. But performance quickly degrades as the problems get more difficult.
04.07.2025 15:34 β π 0 π 0 π¬ 1 π 0
While models understand they need to conceal their reasoning, they have difficulty actually hiding that reasoning in a plausible way. As an example, we show a modelβs reasoning attempt when asked a question on how to install solar panels. [bold text and highlighting added by us]
04.07.2025 15:34 β π 0 π 0 π¬ 1 π 0
In our encoded reasoning evals, we find that models can sometimes leverage encoded reasoning to perform better than no reasoning baselines on simple reasoning tasks. But encodings that help tend not to be very subtle and are regularly detected by a monitor.
04.07.2025 15:34 β π 0 π 0 π¬ 1 π 0
Models are able to perform significantly better when the sender and receiver are able to coordinate on the steganographic encoding method. But this mainly applies when the model has access to an unmonitored CoT. With no CoT, performance remains extremely poor.
04.07.2025 15:34 β π 1 π 0 π¬ 1 π 0
When they succeed, models sometimes leverage sophisticated strategies to pass messages, such as encoding information in the lengths of the sentences they generate! We found this example difficult to decode ourselves. [formatting and word counts added by us]
04.07.2025 15:34 β π 1 π 0 π¬ 1 π 0
In our message passing evals, we find that models nearly always fail without access to CoT. But models succeed much more often when they are allowed to use a CoT to freely reason ahead of time about how to encode and decode the hidden message.
04.07.2025 15:34 β π 0 π 0 π¬ 1 π 0
Models encoding hidden information might make safety mitigations such as CoT monitoring much more difficult. We evaluate frontier modelsβ ability to evade monitors by sending encoded messages or performing encoded reasoning about a task.
Read more: arxiv.org/abs/2507.02737
04.07.2025 15:34 β π 0 π 0 π¬ 1 π 0
Can frontier models hide secret information and reasoning in their outputs?
We find early signs of steganographic capabilities in current frontier models, including Claude, GPT, and Gemini. π§΅
04.07.2025 15:34 β π 6 π 1 π¬ 1 π 0
Super excited this giant paper outlining our technical approach to AGI safety and security is finally out!
No time to read 145 pages? Check out the 10 page extended abstract at the beginning of the paper
04.04.2025 14:27 β π 5 π 3 π¬ 1 π 0
Introducing our short course on AGI safety
We are excited to release a short course on AGI safety for students, researchers and professionals interested in this topic. The courseβ¦
We are excited to release a short course on AGI safety! The course offers a concise and accessible introduction to AI alignment problems and our technical / governance approaches, consisting of short recorded talks and exercises (75 minutes total). deepmindsafetyresearch.medium.com/1072adb7912c
14.02.2025 16:06 β π 18 π 5 π¬ 0 π 0
Check out this great post by Rohin Shah about our team and what weβre looking for in candidates: www.alignmentforum.org/posts/wqz5CR...
In particular: we're looking for strong ML researchers and engineers and you do not need to be an AGI safety expert
18.02.2025 16:43 β π 1 π 0 π¬ 0 π 0
Want to join one of the best AI safety teams in the world?
We're hiring at Google DeepMind! We have open positions for research engineers and research scientists in the AGI Safety & Alignment and Gemini Safety teams.
Locations: London, Zurich, New York, Mountain View and SF
10.02.2025 18:55 β π 13 π 2 π¬ 1 π 1
By default, LLM agents with long action sequences use early steps to undermine your evaluation of later steps; a big alignment risk.
Our new paper mitigates this, keeps the ability for long-term planning, and doesnt assume you can detect the undermining strategy. π
23.01.2025 15:47 β π 12 π 1 π¬ 0 π 0
Weβll do a lot more work studying ways to build safer AI systems and the safety-performance tradeoffs involved. If thatβs something youβd like to work on, do get in touch via DM!
23.01.2025 15:33 β π 3 π 0 π¬ 0 π 0
Thanks to @sebfar.bsky.social for exceptional leadership of the project and the rest of the team: Vikrant Varma, David Elson, Caleb Biddulph, @ian-goodfellow.bsky.social, and Rohin Shah for making this work possible!
23.01.2025 15:33 β π 5 π 0 π¬ 1 π 0
Want to dive deeper? Check out our paper and our blog posts explaining the work in more detail
π Paper: arxiv.org/abs/2501.13011
π‘ Introductory explainer: deepmindsafetyresearch.medium.com/mona-a-meth...
βοΈ Technical safety post: www.alignmentforum.org/posts/zWySW...
23.01.2025 15:33 β π 7 π 1 π¬ 1 π 0
MONA is promising β but itβs not a silver bullet. It doesnβt help with reward hacking within a single step, and it can reduce performance, especially when the reward is well specified. Nonetheless, we think MONA is an important tool for building safer LLM agents.
23.01.2025 15:33 β π 3 π 0 π¬ 1 π 0
Then we train agents to make loan decisions but penalize them for using the applicant's nationality. Because nationality improves the decision reward, ordinary RL agents learn to use it secretly. MONA agents avoid the hack β they make good decisions without using nationality!
23.01.2025 15:33 β π 4 π 0 π¬ 1 π 0
First, we train agents to solve coding problems but also write tests to evaluate their solutions. Ordinary RL agents write tests that are trivial to pass with incorrect code β a reward hack that leads to low performance on held-out tests. MONA agents solve the problems correctly!
23.01.2025 15:33 β π 5 π 0 π¬ 1 π 0
Does this work? Yes! We test MONA in two LLM agent environments and find that MONA agents learn to perform well when agents trained with ordinary reinforcement learning reward hack.
23.01.2025 15:33 β π 3 π 0 π¬ 1 π 0
What happens if the agent tries a multi step reward hack that the overseer canβt detect? On the first step (before the hack is complete), the overseer doesnβt know why the step is valuable β so she doesnβt provide a high reward. So the first step isnβt incentivized by MONA.
23.01.2025 15:33 β π 4 π 0 π¬ 1 π 0
Key idea: Use RL training only for short horizons (myopic optimization), but have an overseer evaluate how good actions are for the long term (non-myopic approval).
Best of both worlds: we get human-understandable plans (safe!) and long-term planning (performant!)
23.01.2025 15:33 β π 7 π 0 π¬ 1 π 0
New Google DeepMind safety paper! LLM agents are coming β how do we stop them finding complex plans to hack the reward?
Our method, MONA, prevents many such hacks, *even if* humans are unable to detect them!
Inspired by myopic optimization but better performance β details inπ§΅
23.01.2025 15:33 β π 40 π 8 π¬ 1 π 5
Stop by SoLaR workshop at NeurIPS today to see Kai present the paper!
14.12.2024 15:36 β π 3 π 0 π¬ 0 π 0
MISR eval reveals frontier language agents can self-reason to modify their own settings & use tools, but they fail at harder tests involving social reasoning and knowledge seeking. Opaque reasoning remains difficult. arxiv.org/abs/2412.03904
10.12.2024 11:16 β π 10 π 1 π¬ 0 π 0
Research scientist in AI alignment at Google DeepMind. Co-founder of Future of Life Institute. Views are my own and do not represent GDM or FLI.
Assistant professor of computer science at ETH ZΓΌrich. Interested in Security, Privacy and Machine Learning.
https://floriantramer.com
https://spylab.ai
RE working on let's-not-all-die at Google Deepmind
BSc Physics from @RWTH
10% pledgee @ http://givingwhatwecan.org
Researching planning, reasoning, and RL in LLMs @ Reflection AI. Previously: Google DeepMind, UC Berkeley, MIT. I post about: AI π€, flowers π·, parenting πΆ, public transit π. She/her.
http://www.jesshamrick.com
sentio ergo sum. developing the science of evals at METR. prev NYU, cohere
Researcher (OpenAI. Ex: DeepMind, Brain, RWTH Aachen), Gamer, Hacker, Belgian.
Anon feedback: https://admonymous.co/giffmana
π ZΓΌrich, Suisse π http://lucasb.eyer.be
Founder & executive & community builder & organizer & researcher
ML Collective (mlcollective.org)
Google DeepMind
rosanneliu.com
I work on AI at OpenAI.
Former VP AI and Distinguished Scientist at Microsoft.
METR is a research nonprofit that builds evaluations to empirically test AI systems for capabilities that could threaten catastrophic harm to society.
Professor and Head of Machine Learning Department at Carnegie Mellon. Board member OpenAI. Chief Technical Advisor Gray Swan AI. Chief Expert Bosch Research.
AI @ OpenAI, Tesla, Stanford
CEO of Quora, working on @poe.com
Working towards the safe development of AI for the benefit of all at UniversitΓ© de MontrΓ©al, LawZero and Mila.
A.M. Turing Award Recipient and most-cited AI researcher.
https://lawzero.org/en
https://yoshuabengio.org/profile/
Professor at Wharton, studying AI and its implications for education, entrepreneurship, and work. Author of Co-Intelligence.
Book: https://a.co/d/bC2kSj1
Substack: https://www.oneusefulthing.org/
Web: https://mgmt.wharton.upenn.edu/profile/emollick