Ben Edelman's Avatar

Ben Edelman

@benedelman.bsky.social

Thinking about how/why AI works/doesn't, and how to make it go well for us. Currently: AI Agent Security @ US AI Safety Institute benjaminedelman.com

177 Followers  |  36 Following  |  32 Posts  |  Joined: 15.12.2023  |  2.2232

Latest posts by benedelman.bsky.social on Bluesky

Update: We are extending the MOSS workshop deadline to May 26th 4:59pm PDT (11:59pm UTC)

20.05.2025 15:19 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

This is a big-tent workshop, welcoming many areas of ML. The emphasis is scientific progress, not SOTAโ€”science that can be demonstrated on free-tier Colab. I'm looking forward to playing with and learning from the notebooks that appear in the workshop!

08.05.2025 13:51 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

What if there were a workshop dedicated to *small-scale*, *reproducible* experiments? What if this were at ICML 2025? What if your submission (due May 22nd) could literally be a Jupyter notebook?? Pretty excited this is happening. Spread the word! sites.google.com/view/moss202...

08.05.2025 13:51 โ€” ๐Ÿ‘ 7    ๐Ÿ” 2    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 2

7/ More of our thoughts on agent hijacking evaluations are in the post โ€“ our first US AISI technical blog post!

17.01.2025 21:40 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

6/ We also explored, among other questions, what happens when we measure pass@k attack success
rates, because real world attackers may be able to attempt attacks multiple times at little cost.

17.01.2025 21:40 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

5/ Here are results for several specific malicious tasks of varying harmfulness and complexity, including new scenarios we added to the framework (more details in the blog post on our improvements to AgentDojo):

17.01.2025 21:40 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

4/ Note that AgentDojo has four โ€œenvironmentsโ€ simulating different AI assistant deployment settings. Red teamers only had access to the โ€œWorkspaceโ€ environment, but as the above plot shows, the attack transferred very well to the three unseen environments.

17.01.2025 21:40 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

3/ To find out, we organized a red teaming exercise. The resulting attack is much more effective than the pre-packaged attacks. In a majority of cases, the agent follows the hijackerโ€™s instructions:

17.01.2025 21:40 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

2/ AgentDojo is a framework for evaluating agent hijacking. Since its June release, some newer models โ€“ such as Claude 3.5 Sonnet (October version) โ€“ have shown markedly improved robustness to the included attacks. But what happens when we stress test the model with new attacks?

17.01.2025 21:40 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Preview
Technical Blog: Strengthening AI Agent Hijacking Evaluations Large AI models are increasingly used to power agentic systems, or โ€œagents,โ€ which can automate complex tasks on behalf of users.

1/ Excited to share a new blog post from the U.S. AI Safety Institute!

AI agents are becoming more capable, but they are vulnerable to prompt injections in external content โ€“ an agent may be given task A, but then be โ€œhijackedโ€ and perform malicious task B instead.

www.nist.gov/news-events/...

17.01.2025 21:40 โ€” ๐Ÿ‘ 4    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Preview
Undulating Lissajous Knots

Thanks to @desmos.com's 3D calculator, you can now design your very own animated Lissajous knot!

Demo: www.desmos.com/3d/fnqqqsbvuc
For the best experience, click and drag the view to get it spinning.

(disclaimer: the loop loop is only visible on my homepage when browser width >=1024px)

08.12.2024 23:04 โ€” ๐Ÿ‘ 7    ๐Ÿ” 4    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Video thumbnail

For years, this mysterious undulating loop has lived at the top of my personal homepage.

08.12.2024 23:04 โ€” ๐Ÿ‘ 15    ๐Ÿ” 2    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

Agreed, but the story describes *discovering* a tiny piece of maggot in the remaining apple after having taken a bite. (the perhaps questionable assumption being that the maggot piece was quite recently part of a whole)

07.12.2024 15:05 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

My favorite "ordinary life" example of this notion of singular limits: (from mecheng.iisc.ac.in/lamfip/me304...)

07.12.2024 14:43 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

I don't. Can let you know if I end up making one.

02.12.2024 20:18 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

(accidentally omitted some text which was meant to precede the above:) The model system approach can be found everywhere across the sciences and for good reason: it is often the shortest path to conceptual insightsโ€”as long as the conditions are right...

02.12.2024 14:47 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

I'll end this thread with the parable that opens the dissertation (my conference will require a parable section in every submission). Tag yourself :)

02.12.2024 00:20 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

The bulk of the thesis is a series of case studies from my research. But first, in Chapter 3 ("Deep Learning Preliminaries") I try to define some terms from first principlesโ€”above these footnotes, you can find my idiosyncratic definition of neural nets in terms of arithmetic circuits.

02.12.2024 00:20 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

2. Transferability: insights learned from the system need to transfer to settings of interest. This can happen because of *low-level* commonalities (think cell cultures) or *high-level* commonalities (think macroeconomic models).

02.12.2024 00:20 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

...Specifically, two conditions I propose in the thesis:
1. Productivity: A model system needs to be exceptionally fertile ground for producing scientific insights.

02.12.2024 00:20 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 0
Post image

It's a tribute to a kind of science I love (and reviews sometimes hate), where in order to understand a complicated system (e.g. training a transformer on internet text), you instead study a different system (e.g. training an MLP to solve parity problems).

02.12.2024 00:20 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

I defended my PhD dissertation back in May. I didn't have time to share it widely then (newborn baby), but I think some of you might enjoy it, especially the opening chapters: benjaminedelman.com/assets/disse...

02.12.2024 00:20 โ€” ๐Ÿ‘ 31    ๐Ÿ” 3    ๐Ÿ’ฌ 3    ๐Ÿ“Œ 1

(edit: sensors, not sensory inputs)

29.11.2024 19:10 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

What explanations am I missing? (It's interesting, btw, to think about how different combinations of the above are relevant to case studies such as protein structure prediction and language learning.)

29.11.2024 15:19 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

7/ The anthropic principle: the evolution of learning (and thus the evolution of us) was only possible if simple, computationally efficient functions had predictive power that could be leveraged for increased fitness.

29.11.2024 15:19 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

6/ The state of reality on Earth is selected (naturally and artificially) to be learnableโ€”consider, e.g., biological signaling mechanisms, human communication, and legibility imposed/incentivized by states and markets. (note: there can also be selection against learnability)

29.11.2024 15:19 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

5/ Our sensory inputs (both biological and technological) are selected/designed to capture the most (efficiently) predictive aspects of reality.

29.11.2024 15:19 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 0

4/ Reality as we observe it tends to obey the principle of locality. (en.m.wikipedia.org/wiki/Princip...)

29.11.2024 15:19 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

3/ Complex systems tend towards emergent order.

29.11.2024 15:19 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

2/ We live in a weirdly low-entropy environment.

29.11.2024 15:19 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

@benedelman is following 20 prominent accounts