The paper, "Mitigating goal misgeneralization via minimax regret" will appear at @rl-conference.bsky.social!
Joint work with the great Matthew Farrugia-Roberts, Usman Anwar, Hannah Erlebach, Christrian Schroeder de Witt, David Krueger and @michaelddennis.bsky.social
www.arxiv.org/pdf/2507.03068
08.07.2025 17:16 β π 2 π 0 π¬ 0 π 0
Future work we are excited about:
β’ Improving UED algorithms to be closer to the results predicted by our theory
β’ Mitigating the fully ambiguous case, by focusing on the inductive biases of the agent.
08.07.2025 17:16 β π 1 π 0 π¬ 1 π 0
We also visualize the performance of our agents in a maze for each possible location of the goal in the environment.
The results show that agents trained with the regret objective achieve near-maximum return for almost all goal locations.
08.07.2025 17:16 β π 1 π 0 π¬ 1 π 0
We complement our theoretical findings with empirical results. We find these as supporting our theory, showing better generalization of agents trained via minimax regret.
Left: performance at test time
Right: % of distinguishing levels played by the respective level designer
08.07.2025 17:16 β π 1 π 0 π¬ 1 π 0
In the case where the environments in deployment are in the support of the training level distribution, we also show that a policy that is optimal with respect to the minimax regret objective must provably be robust against goal misgeneralization!
08.07.2025 17:16 β π 2 π 0 π¬ 1 π 0
We first formally show that a policy maximizing expected value may suffer from goal misgeneralization if distinguishing levels are rare.
08.07.2025 17:16 β π 1 π 0 π¬ 1 π 0
Goal misgeneralization can occur when training only on non-distinguishing levels, as shown in Langosco et al., 2022.
Adding a few distinguishing levels does not alter this outcome. However, we propose a mitigation for this scenario!
08.07.2025 17:16 β π 1 π 0 π¬ 1 π 0
Goal misgeneralization arises due to the presence of βproxy goalsβ. We formalize this and characterize environments as either:
β’ Non-distinguishing: the true and proxy reward may induce the same behaviour
β’ Distinguishing: the true and proxy rewards induce different behavior
08.07.2025 17:16 β π 1 π 0 π¬ 1 π 0
We propose using regret, the difference between the optimal agent's return and our current policy's return, as a training objective.
Minimizing it will encourage the agent to solve rare out-of-distribution levels during training, helping it learn the correct reward function.
08.07.2025 17:16 β π 1 π 0 π¬ 1 π 0
*New Paper*
π¨ Goal misgeneralization occurs when AI agents learn the wrong reward function, instead of the human's intended goal.
π We show that training with a minimax regret objective provably mitigates it, promoting safer and better-aligned RL policies!
08.07.2025 17:16 β π 9 π 2 π¬ 1 π 0
Cooperative AIPlaintext Code Block
CAIF's new and massive report on multi-agent AI risks will be really useful resource for the field
www.cooperativeai.com/post/new-rep...
21.02.2025 14:24 β π 3 π 1 π¬ 0 π 0
what ifβ¦
21.02.2025 04:31 β π 4 π 0 π¬ 2 π 0
A large group of us (spearheaded by Denizalp Goktas) have put out a position paper on paths towards foundation models for strategic decision-making. Language models still lack these capabilities so we'll need to build them: hal.science/hal-04925309...
18.02.2025 18:33 β π 33 π 7 π¬ 2 π 0
lbh gnxr gur yninynzc bhgchg, naq Nyvpr naq Obo qb gur qbg cebqhpg bs vg jvgu gurve erfcrpgvir ahzore naq gura nccyl zbq 2 gb gur erfhyg. Gurl gura pbzzhavpngr gur ovg gurl bognvarq (1=jnir,0=jvax), naq guvf bcrengvba nyjnlf erghea gur fnzr ahzore gb obgu vs n=o be bgurejvfr snvyf jvgu c=1/2?
17.02.2025 06:30 β π 1 π 0 π¬ 1 π 0
Cooperative AI
The 2025 Cooperative AI summer school (9-13 July 2025 near London) is now accepting applications, due March 7th!
www.cooperativeai.com/summer-schoo...
09.01.2025 19:25 β π 14 π 5 π¬ 1 π 0
The magic thing that humans do is a pretty good job at solving tasks under high uncertainty about the problem specification. We also frequently are capable of doing this collaboratively. I still do not see evidence that models can do any part of this.
21.12.2024 01:08 β π 82 π 12 π¬ 6 π 1
I will be at @neuripsconf.bsky.social this week!
Would love to chat about Multi-agent systems, RL, Human-AI Alignment, or anything interesting :)
I'm also applying for PhD programs this cycle, feel free to reach out for any advice!
More about me: karim-abdel.github.io
08.12.2024 23:59 β π 9 π 2 π¬ 0 π 0
I give you a loaded coin, with some (unknown) probability 0<p<1 of landing Heads, and I ask you to generate a fair coin toss.
Great! We know how to do this! This is the Von Neumann trick: toss twice. If HH or TT, repeat; if HT or TH, return the first.
Problem solved? Not quite... This can be bad!
18.11.2024 20:50 β π 41 π 7 π¬ 3 π 0
Here some cool work doing a first step towards that in Minecraft using MCTS: Scalably Solving Assistance Games - openreview.net/pdf/080f0c69...
19.11.2024 15:26 β π 0 π 0 π¬ 1 π 0
Very cool work! I think an important challenge is to scale assistance games in scenarios where the goal/action/communication space can be 'large', as to capture real world scenarios where we will want to actually apply CIRL.
19.11.2024 15:26 β π 2 π 0 π¬ 1 π 0
Here some cool work doing a first step towards that in Minecraft using MCTS: Scalably Solving Assistance Games - openreview.net/pdf/080f0c69...
19.11.2024 15:22 β π 0 π 0 π¬ 0 π 0
AI PhD student at Berkeley
alyd.github.io
Studying multi-agent collaboration π€π§©π€
PhD Candidate at Princeton CS with Tom Griffiths & Natalia VΓ©lez @cocoscilab.bsky.social @velezcolab.bsky.social
Prev: Cornell CS, MIT BCS
AI Research Engineer working on AI Safety and Alignment | formerly OpenAI, Waymo, DeepMind, Google. Father, photographer, Zen practitioner.
The world's leading venue for collaborative research in theoretical computer science. Follow us at http://YouTube.com/SimonsInstitute.
25th International Conference on Autonomous Agents and Multiagent Systems
May 25-29, 2026
Paphos, Cyprus
https://cyprusconferences.org/aamas2026
professor of EECS at MIT, currently visiting IAS. working in theoretical computer science namely algorithm design, complexity theory, circuit complexity, etc.
i'll let you know when P != NP is proved (and when it's not)
Autonomous Agents | PhD @ Princeton | World Gen @ Waymo | Prev: CMU, Amazon | NSF GRFP Fellow
friendly deep sea dweller
PhD at Machine Learning Department, Carnegie Mellon University | Interactive Decision Making | https://yudasong.github.io
research @ Google DeepMind
Gemini Post-Training @ Google DeepMind
Previously:Β ETH Zurich, Cambridge, CERN
alizeepace.com
PhD at NYU studying reasoning, decision-making, and open-endedness
alum of MIT | prev: Google, MSR, MIT CoCoSci
https://upiterbarg.github.io/
Reinforcement learning, but without rewards.
Postdoc at the Technion. PhD from Politecnico di Milano.
https://muttimirco.github.io
Assistant professor of computer science at Bocconi University | https://andcelli.github.io/
Multi-Agent Researcher at CAIF | applied research at IQT | Thinking about making MA systems go well
CS PhD Student @University of Washington, CSxPhilosophy @Dartmouth College
Interested in MARL, Social Reasoning, and Collective Decision making in people, machines, and other organisms
kjha02.github.io
Postdoc @csail.mit.edu, Ph.D. from @scai-asu.bsky.social
Working on AI Safety, AI Assessment, Automated Planning, Interpretability, Robotics
Previously: Masters from IITGuwahati, Research Intern at MetaAI
https://pulkitverma.net
PhD student @Berkeley_AI
reinforcement learning, AI, robotics
PhD Student at UC San Diego | LLM Agents, Reinforcement Learning, Human-AI Collaboration, Multi-Agent Systems
Associate Professor at Northeastern University and father of 3. Interests include artificial intelligence, reinforcement learning, and robotics (he/him).