@gleave.me - Bluesky Profile

Training models that aren't necessarily robust but have *uncorrelated* failures with other models is an interesting research direction I'd love to see more people work on!

02.07.2025 18:17 — 👍 0 🔁 0 💬 0 📌 0

Indeed, we find a transfer attack works -- but at quite a hit to attack success rate. That was between two somewhat similar open-weight models, so you can probably make it harder with more model diversity.

02.07.2025 18:17 — 👍 0 🔁 0 💬 1 📌 0

The other challenge is that your components in a defense-in-depth pipeline are all ML models. So, it's not like an attacker has to guess a random password, they have to guess what exploits will affect (unseen, but predictable) models.

02.07.2025 18:17 — 👍 0 🔁 0 💬 1 📌 0

The problem is then you go through from exponential to linear complexity: it's like being able to bruteforce a combination lock digit by digit rather than having to guess the whole code.

02.07.2025 18:17 — 👍 0 🔁 0 💬 1 📌 0

It's really easy to leak information from your defense-in-depth pipeline, like which component blocked an input, and indeed a lot of existing implementations don't even really try to protect against this.

02.07.2025 18:17 — 👍 0 🔁 0 💬 1 📌 0

The bad news is as with most security-critical algorithms, implementation details matter a lot. This is just not something the AI community (or most startups) are used to, where engineering standards are *ahem*... mixed?

02.07.2025 18:17 — 👍 0 🔁 0 💬 1 📌 0

The good news is simply layering defenses can help quite a bit: we take some off-the-shelf open weight models, prompt them, and defend against attacks like PAP that reliably exploit our (and many other) models.

02.07.2025 18:17 — 👍 0 🔁 0 💬 1 📌 0

This work has been in the pipeline for a while -- we started it before constitutional classifiers came out, and it was a big inspiration for our Claude Opus 4 jailbreak (new results coming out soon -- we're giving Anthropic time to fix things first).

02.07.2025 18:17 — 👍 0 🔁 0 💬 1 📌 0

I started this research project quite skeptical that we'll be able to get robust LLMs. I'm now meaningfully more optimistic: defense-in-depth worked better than I expected, and there's been a bunch of other innovations like circuit-breakers that have come out in the meantime.

02.07.2025 18:17 — 👍 0 🔁 0 💬 1 📌 0

Progress in robustness is just in time with new security threats from growing agent deployments and growing misuse risks from emerging model capabilities.

02.07.2025 18:17 — 👍 0 🔁 0 💬 1 📌 0

With SOTA defenses LLMs can be difficult even for experts to exploit. Yet developers often compromise on defenses to retain performance (e.g. low-latency). This paper shows how these compromises can be used to break models – and how to securely implement defenses.

02.07.2025 18:17 — 👍 1 🔁 0 💬 1 📌 0

So many great talks from the Singapore Alignment Workshop -- I look forward to catching up on those that I missed in person!

11.06.2025 19:09 — 👍 3 🔁 0 💬 0 📌 0

As I say in the video, innovation vs safety is a false dichotomy -- do check out great ideas from our speakers for how innovation can enable effective policy in the video 👇 and initial talk recordings!

04.06.2025 13:08 — 👍 0 🔁 0 💬 0 📌 0

AI control is one of the most exciting new research directions; excited to have the videos from ControlConf, the world's first control-specific conference. Tons of great material both intros & diving into specific areas!

06.05.2025 19:58 — 👍 1 🔁 0 💬 0 📌 0

AI security needs more than just testing, it needs guarantees.
Evan Miyazono calls for broader adoption of formal proofs, suggesting a new paradigm where AI produces code to meet human specifications.

30.04.2025 15:32 — 👍 3 🔁 2 💬 1 📌 0

Had a great time at the Singapore Alignment Workshop earlier this week -- fantastic start to the ICLR week! My only complaint is I missed many of the excellent talks because I was having so many interesting conversations. Looking forward to the videos to catch up!

25.04.2025 03:16 — 👍 2 🔁 0 💬 0 📌 0

ControlConf 2025 Day 2 delivered!
From robust evals to security R&D & moral patienthood, we covered the edge of AI control theory and practice. Thanks to Ryan Greenblatt, Rohin Shah, Alex Mallen, Stephen McAleer, Tomek Korbak, Steve Kelly & others for their insights.

28.03.2025 18:54 — 👍 2 🔁 1 💬 1 📌 0

My biggest complaint with the AI Security Forum was too much great content across the three tracks. Looking forward to catching up on the talks I missed with the videos 👇

20.03.2025 16:02 — 👍 0 🔁 0 💬 0 📌 0

Excited to meet others working on or interested in alignment at the Alignment Workshop Open Social before ICLR!

13.03.2025 00:07 — 👍 0 🔁 0 💬 0 📌 0

Excited to see people before ICLR at Alignment Workshop Singapore!

11.03.2025 23:05 — 👍 0 🔁 0 💬 0 📌 0

Humans sometimes cheat at exams -- might AIs do the same? Unique challenge to evaluating intelligent systems.

10.03.2025 18:48 — 👍 0 🔁 0 💬 0 📌 0

AI agents can start VMs, buy things, send e-mails, etc. AI control is a promising way to prevent harmful agent actions -- whether by accident, due to adversarial attack, or the systems themselves trying to subvert controls. Apply to the world's first control conference 👇

03.03.2025 21:44 — 👍 1 🔁 0 💬 0 📌 0

Since joining FAR.AI in June, Lindsay has delivered amazing events like the Alignment Workshop Bay Area and Paris Security Forum. Welcome to the team!

24.02.2025 20:24 — 👍 0 🔁 0 💬 0 📌 0

Evaluations are key to understanding AI capabilities and risks -- but which ones matter? Enjoyed Soroush's talk exploring these issues!

17.02.2025 20:05 — 👍 0 🔁 0 💬 0 📌 0

Excited to have Annie join our team, and help produce a 200-person event in her first month! We're growing across operations and technical roles -- check out opportunities 👇

14.02.2025 19:58 — 👍 0 🔁 0 💬 0 📌 0

I had great conversations at the AI Security Forum -- it's exciting to see people from cybersec, hardware root of trust, and AI come together to come up with creative solutions to boost AI security.

10.02.2025 20:42 — 👍 0 🔁 0 💬 0 📌 0

Formal verification has a lot of exciting applications, especially in the age of LLMs: e.g. can LLMs output programs with proofs of correctness? However formally verifying neural network behavior in general seems intractable -- enjoyed Zac's talk on limitations.

31.01.2025 09:58 — 👍 0 🔁 0 💬 0 📌 0

Many eyes on code secures critical open-source code -- we similarly need independent scrutiny of AI models to catch issues and have trust in the models. A safe harbor for evaluation as proposed by @shaynelongpre.bsky.social could enable an independent testing ecosystem.

28.01.2025 17:07 — 👍 1 🔁 0 💬 0 📌 0

Deceptive alignment is one of the more pernicious safety risks. I used to find it far fetched but LLMs are very good at persuasion so have the capability -- it's just a question of whether it's incentivized during training. Great to see work towards detecting deception!

23.01.2025 21:45 — 👍 1 🔁 0 💬 0 📌 0

Gradient routing can influence where a neural network learns a particular skill -- useful for interpretability and control.

21.01.2025 06:14 — 👍 0 🔁 0 💬 0 📌 0

Latest posts by gleave.me on Bluesky

@gleave.me is following 20 prominent accounts