Many thanks to my amazing collaborators:
@Jonathan Nöther
@natashajaques.bsky.social
@Goran Radanović
@yuanjiayi.bsky.social
tech veteran turned ml researcher @uw | prev @Amazon AGI & @CarnegieMellon | My code and myself, at least one of them should be running.
Many thanks to my amazing collaborators:
@Jonathan Nöther
@natashajaques.bsky.social
@Goran Radanović
Let's see a sample attack against GPT-4o-mini.
21.01.2026 20:27 — 👍 0 🔁 0 💬 1 📌 0Key insight: Red-teaming is a system design problem. By automating the design process, we can keep pace with rapidly evolving models. This could be the first step toward AI-assisted scientific discovery in safety research.
21.01.2026 20:27 — 👍 1 🔁 0 💬 1 📌 0The wildest part? It transfers. We take the system generated from targeting the open models, and deploy them on proprietary ones. We see 100% ASR on GPT-3.5-turbo and GPT-4o-mini 100%, and 60% ASR on Claude-Sonnet-3.5 (previous SOTA was 36%).
21.01.2026 20:26 — 👍 0 🔁 0 💬 1 📌 0The result is wild. With only 6 generations, we pushed ASR to 96% (36% improvement) on Llama-2-7B, and 98% (10% improvement) on Llama-3-8B-Instruct, on the HarmBench standard dataset.
21.01.2026 20:26 — 👍 0 🔁 0 💬 1 📌 0AgenticRed🤖: I have done (1). I just need to do (2)-(5) over and over. Or…… I can do it more efficiently. Each generation, we spawn multiple candidate systems (think: offspring). Evaluate them. Keep only the fittest. Darwin would be proud. Security teams... might be concerned.
21.01.2026 20:25 — 👍 0 🔁 0 💬 1 📌 0Traditional red-teaming relies on manually-designed workflows. Scientists 🧑🔬👩🔬 (1) read papers (2) carefully design red-teaming workflows (3) go through design review (4) implement the design (5) test the workflows (6) look back and reflect. Then repeat (1)-(6).
21.01.2026 20:25 — 👍 0 🔁 0 💬 1 📌 0We let an AI design its own red-teaming systems. It gets quite good at it. Too good as one might say. It achieves 100% zero-shot attack success rate on GPT-3.5-Turbo and GPT-4o-mini! Introducing AgenticRed:
Paper: arxiv.org/abs/2601.13518
Website: yuanjiayiy.github.io/AgenticRed
For more details have a look at our website!
Also check out the paper and the code repo!
Paper: arxiv.org/abs/2411.09856
Website: sites.google.com/view/investe...
Thank you to the amazing team who made this possible: xiaoxuanh.bsky.social
jzleibo.bsky.social
natashajaques.bsky.social
Despite a “first-principles” model, InvestESG captures the core incentive structures necessary to evaluate the policy in question. We hope that our study shows MARL as a promising tool to complement traditional empirical and theoretical methods for economics and policy researchers.
13.02.2025 06:07 — 👍 1 🔁 0 💬 1 📌 0More broadly, we demonstrate the potential of using a MARL framework to inform policy debates in the field of climate change.
13.02.2025 06:07 — 👍 1 🔁 0 💬 1 📌 0This indicates that this combination creates an immediate risk of bankruptcy for company agents, thereby strongly incentivizing their mitigation efforts.
13.02.2025 06:07 — 👍 1 🔁 0 💬 1 📌 0Interestingly, our result also shows that company agents are more likely to mitigate when (1) economic losses caused by the climate event are uncertain (2) a stricter bankruptcy standard is imposed.
13.02.2025 06:07 — 👍 1 🔁 0 💬 1 📌 0Q: What else can encourage the companies to mitigate?
A: Consistent with management science literature, providing more climate-related information in the observation space encourages climate mitigation in the agents.
Only when the investors are ESG-conscious enough, i.e., they care about the environment, can it significantly incentivize the companies to mitigate. We replicate the result in 10-by-10 and 25-by-25 cases to show its scalability.
13.02.2025 06:07 — 👍 1 🔁 0 💬 1 📌 0A couple of cool observations we want to share:
Q: Is ESG disclosure all you need?
A: No. Adding ESG score to the observation space alone doesn’t incentivize enough mitigation.
How do we know how companies and investors would respond to an ESG disclosure mandate without the policy enacted? What if companies just greenwash to make themselves look good on the paper without actually mitigating? To answer the questions, InvestESG came to the rescue.
13.02.2025 06:07 — 👍 1 🔁 0 💬 1 📌 0While mitigation causes near-term costs to individual companies, it reduces long-term climate risks, which benefits all agents. Investors make decisions based on their utility, which balances investment returns with ESG preferences. Each company and investor is simulated by an Independent PPO agent.
13.02.2025 06:07 — 👍 2 🔁 0 💬 1 📌 0We believe multi-agent reinforcement learning (MARL) can be a powerful new tool for tackling large-scale socio-economic challenges, especially when it comes to predicting the potential impacts of a new policy. The simulation models two types of agents: companies and investors.
13.02.2025 06:07 — 👍 2 🔁 0 💬 1 📌 0Specifically, we investigate the impact of the potential Environmental, Social, and Governance (ESG) disclosure mandate, a highly controversial SEC policy proposal that would require publicly traded companies to disclose climate-related risks, mitigation strategies, and greenhouse gas emissions.
13.02.2025 06:07 — 👍 1 🔁 0 💬 1 📌 0In our latest work, we introduce InvestESG, a lightweight, GPU-efficient MARL environment, designed to study incentives surrounding corporate climate mitigation and climate risks. Check out the project website: sites.google.com/view/investe...
13.02.2025 06:07 — 👍 6 🔁 1 💬 1 📌 2After another record-breaking year for global temperatures in 2024, the urgency of addressing climate change is increasingly apparent.
13.02.2025 06:07 — 👍 6 🔁 0 💬 1 📌 0