Training models that aren't necessarily robust but have *uncorrelated* failures with other models is an interesting research direction I'd love to see more people work on!
02.07.2025 18:17 β π 0 π 0 π¬ 0 π 0@gleave.me.bsky.social
Training models that aren't necessarily robust but have *uncorrelated* failures with other models is an interesting research direction I'd love to see more people work on!
02.07.2025 18:17 β π 0 π 0 π¬ 0 π 0Indeed, we find a transfer attack works -- but at quite a hit to attack success rate. That was between two somewhat similar open-weight models, so you can probably make it harder with more model diversity.
02.07.2025 18:17 β π 0 π 0 π¬ 1 π 0The other challenge is that your components in a defense-in-depth pipeline are all ML models. So, it's not like an attacker has to guess a random password, they have to guess what exploits will affect (unseen, but predictable) models.
02.07.2025 18:17 β π 0 π 0 π¬ 1 π 0The problem is then you go through from exponential to linear complexity: it's like being able to bruteforce a combination lock digit by digit rather than having to guess the whole code.
02.07.2025 18:17 β π 0 π 0 π¬ 1 π 0It's really easy to leak information from your defense-in-depth pipeline, like which component blocked an input, and indeed a lot of existing implementations don't even really try to protect against this.
02.07.2025 18:17 β π 0 π 0 π¬ 1 π 0The bad news is as with most security-critical algorithms, implementation details matter a lot. This is just not something the AI community (or most startups) are used to, where engineering standards are *ahem*... mixed?
02.07.2025 18:17 β π 0 π 0 π¬ 1 π 0The good news is simply layering defenses can help quite a bit: we take some off-the-shelf open weight models, prompt them, and defend against attacks like PAP that reliably exploit our (and many other) models.
02.07.2025 18:17 β π 0 π 0 π¬ 1 π 0This work has been in the pipeline for a while -- we started it before constitutional classifiers came out, and it was a big inspiration for our Claude Opus 4 jailbreak (new results coming out soon -- we're giving Anthropic time to fix things first).
02.07.2025 18:17 β π 0 π 0 π¬ 1 π 0I started this research project quite skeptical that we'll be able to get robust LLMs. I'm now meaningfully more optimistic: defense-in-depth worked better than I expected, and there's been a bunch of other innovations like circuit-breakers that have come out in the meantime.
02.07.2025 18:17 β π 0 π 0 π¬ 1 π 0Progress in robustness is just in time with new security threats from growing agent deployments and growing misuse risks from emerging model capabilities.
02.07.2025 18:17 β π 0 π 0 π¬ 1 π 0With SOTA defenses LLMs can be difficult even for experts to exploit. Yet developers often compromise on defenses to retain performance (e.g. low-latency). This paper shows how these compromises can be used to break models β and how to securely implement defenses.
02.07.2025 18:17 β π 1 π 0 π¬ 1 π 0So many great talks from the Singapore Alignment Workshop -- I look forward to catching up on those that I missed in person!
11.06.2025 19:09 β π 3 π 0 π¬ 0 π 0As I say in the video, innovation vs safety is a false dichotomy -- do check out great ideas from our speakers for how innovation can enable effective policy in the video π and initial talk recordings!
04.06.2025 13:08 β π 0 π 0 π¬ 0 π 0AI control is one of the most exciting new research directions; excited to have the videos from ControlConf, the world's first control-specific conference. Tons of great material both intros & diving into specific areas!
06.05.2025 19:58 β π 1 π 0 π¬ 0 π 0AI security needs more than just testing, it needs guarantees.
Evan Miyazono calls for broader adoption of formal proofs, suggesting a new paradigm where AI produces code to meet human specifications.
Had a great time at the Singapore Alignment Workshop earlier this week -- fantastic start to the ICLR week! My only complaint is I missed many of the excellent talks because I was having so many interesting conversations. Looking forward to the videos to catch up!
25.04.2025 03:16 β π 2 π 0 π¬ 0 π 0ControlConf 2025 Day 2 delivered!
From robust evals to security R&D & moral patienthood, we covered the edge of AI control theory and practice. Thanks to Ryan Greenblatt, Rohin Shah, Alex Mallen, Stephen McAleer, Tomek Korbak, Steve Kelly & others for their insights.
My biggest complaint with the AI Security Forum was too much great content across the three tracks. Looking forward to catching up on the talks I missed with the videos π
20.03.2025 16:02 β π 0 π 0 π¬ 0 π 0Excited to meet others working on or interested in alignment at the Alignment Workshop Open Social before ICLR!
13.03.2025 00:07 β π 0 π 0 π¬ 0 π 0Excited to see people before ICLR at Alignment Workshop Singapore!
11.03.2025 23:05 β π 0 π 0 π¬ 0 π 0Humans sometimes cheat at exams -- might AIs do the same? Unique challenge to evaluating intelligent systems.
10.03.2025 18:48 β π 0 π 0 π¬ 0 π 0AI agents can start VMs, buy things, send e-mails, etc. AI control is a promising way to prevent harmful agent actions -- whether by accident, due to adversarial attack, or the systems themselves trying to subvert controls. Apply to the world's first control conference π
03.03.2025 21:44 β π 1 π 0 π¬ 0 π 0Since joining FAR.AI in June, Lindsay has delivered amazing events like the Alignment Workshop Bay Area and Paris Security Forum. Welcome to the team!
24.02.2025 20:24 β π 0 π 0 π¬ 0 π 0Evaluations are key to understanding AI capabilities and risks -- but which ones matter? Enjoyed Soroush's talk exploring these issues!
17.02.2025 20:05 β π 0 π 0 π¬ 0 π 0Excited to have Annie join our team, and help produce a 200-person event in her first month! We're growing across operations and technical roles -- check out opportunities π
14.02.2025 19:58 β π 0 π 0 π¬ 0 π 0I had great conversations at the AI Security Forum -- it's exciting to see people from cybersec, hardware root of trust, and AI come together to come up with creative solutions to boost AI security.
10.02.2025 20:42 β π 0 π 0 π¬ 0 π 0Formal verification has a lot of exciting applications, especially in the age of LLMs: e.g. can LLMs output programs with proofs of correctness? However formally verifying neural network behavior in general seems intractable -- enjoyed Zac's talk on limitations.
31.01.2025 09:58 β π 0 π 0 π¬ 0 π 0Many eyes on code secures critical open-source code -- we similarly need independent scrutiny of AI models to catch issues and have trust in the models. A safe harbor for evaluation as proposed by @shaynelongpre.bsky.social could enable an independent testing ecosystem.
28.01.2025 17:07 β π 1 π 0 π¬ 0 π 0Deceptive alignment is one of the more pernicious safety risks. I used to find it far fetched but LLMs are very good at persuasion so have the capability -- it's just a question of whether it's incentivized during training. Great to see work towards detecting deception!
23.01.2025 21:45 β π 1 π 0 π¬ 0 π 0Gradient routing can influence where a neural network learns a particular skill -- useful for interpretability and control.
21.01.2025 06:14 β π 0 π 0 π¬ 0 π 0