๐ Workshop paper: openreview.net/forum?id=eIB... (full paper soon!)
๐ฅ Co-authors: David Huang @davidhuang1.bsky.social, Avi Shah, Alexandre Araujo, David Wagner.
(7/7)
๐ Workshop paper: openreview.net/forum?id=eIB... (full paper soon!)
๐ฅ Co-authors: David Huang @davidhuang1.bsky.social, Avi Shah, Alexandre Araujo, David Wagner.
(7/7)
Most importantly, this project is led by 2 amazing Berkeley undergrads (David Huang - www.linkedin.com/in/huang-david & Avi Shah - shavidan123.github.io). They are undoubtedly promising researchers and also applying for PhD programs this year! Please reach out to them!
(6/7)
2๏ธโฃ Representation Rerouting defense (Circuit Breaker: arxiv.org/abs/2406.04313) is not robust.
Our token-level universal transfer attack is somehow stronger than a white-box embedding-level attack!
3๏ธโฃ โBetter CoT/reasoning modelsโ like o1 are still far from robust.
(5/7)
In fact, using the best universal suffix alone is better than using multiple white-box prompt-specific suffixes!
This phenomenon is very unintuitive but confirms that LLM attacks are far from optimal. There's also a clear implication on white-box robustness evaluation.
(4/7)
1๏ธโฃ Creating universal & transferable attack is easier than we thought.
Our surprising discovery is some adversarial suffixes (even gibberish ones from vanilla GCG) can jailbreak many different prompts while being optimized on a single prompt.
(3/7)
IRIS jailbreak rates on AdvBench/HarmBench (1 universal suffix, transferred from Llama-3): GPT-4o 76/56%, o1-mini 54/43%, Llama-3-RR 74/25% (vs 2.5% by white-box GCG).
Here are 3 main takeaways:
(2/7)
๐ข Excited to share our new result on LLM jailbreak!
โ๏ธ We propose IRIS, a simple automated ๐๐ป๐ถ๐๐ฒ๐ฟ๐๐ฎ๐น ๐ฎ๐ป๐ฑ ๐๐ฟ๐ฎ๐ป๐๐ณ๐ฒ๐ฟ๐ฟ๐ฎ๐ฏ๐น๐ฒ ๐ท๐ฎ๐ถ๐น๐ฏ๐ฟ๐ฒ๐ฎ๐ธ ๐๐๐ณ๐ณ๐ถ๐
that works on GPTs, o1, and Circuit Breaker defense! To appear at NeurIPS Safe GenAI Workshop!
(1/7)