Chawin Sitawarin's Avatar

Chawin Sitawarin

@chawins.bsky.social

Postdoc @Meta (Privacy-Preserving ML | Central Applied Science). PhD CS @UCBerkeley. ML security ๐Ÿ‘น privacy ๐Ÿ‘€ robustness ๐Ÿ›ก Views are my own.

28 Followers  |  80 Following  |  7 Posts  |  Joined: 04.12.2024
Posts Following

Posts by Chawin Sitawarin (@chawins.bsky.social)

Preview
Stronger Universal and Transfer Attacks by Suppressing Refusals Making large language models (LLMs) safe for mass deployment is a complex and ongoing challenge. Efforts have focused on aligning models to human prefer- ences (RLHF) in order to prevent malicious...


๐Ÿ“ƒ Workshop paper: openreview.net/forum?id=eIB... (full paper soon!)

๐Ÿ‘ฅ Co-authors: David Huang @davidhuang1.bsky.social, Avi Shah, Alexandre Araujo, David Wagner.

(7/7)

12.12.2024 18:16 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Most importantly, this project is led by 2 amazing Berkeley undergrads (David Huang - www.linkedin.com/in/huang-david & Avi Shah - shavidan123.github.io). They are undoubtedly promising researchers and also applying for PhD programs this year! Please reach out to them!

(6/7)

12.12.2024 18:16 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

2๏ธโƒฃ Representation Rerouting defense (Circuit Breaker: arxiv.org/abs/2406.04313) is not robust.

Our token-level universal transfer attack is somehow stronger than a white-box embedding-level attack!

3๏ธโƒฃ โ€œBetter CoT/reasoning modelsโ€ like o1 are still far from robust.

(5/7)

12.12.2024 18:16 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

In fact, using the best universal suffix alone is better than using multiple white-box prompt-specific suffixes!

This phenomenon is very unintuitive but confirms that LLM attacks are far from optimal. There's also a clear implication on white-box robustness evaluation.

(4/7)

12.12.2024 18:16 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

1๏ธโƒฃ Creating universal & transferable attack is easier than we thought.

Our surprising discovery is some adversarial suffixes (even gibberish ones from vanilla GCG) can jailbreak many different prompts while being optimized on a single prompt.

(3/7)

12.12.2024 18:16 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

IRIS jailbreak rates on AdvBench/HarmBench (1 universal suffix, transferred from Llama-3): GPT-4o 76/56%, o1-mini 54/43%, Llama-3-RR 74/25% (vs 2.5% by white-box GCG).

Here are 3 main takeaways:

(2/7)

12.12.2024 18:16 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

๐Ÿ“ข Excited to share our new result on LLM jailbreak!

โš”๏ธ We propose IRIS, a simple automated ๐˜‚๐—ป๐—ถ๐˜ƒ๐—ฒ๐—ฟ๐˜€๐—ฎ๐—น ๐—ฎ๐—ป๐—ฑ ๐˜๐—ฟ๐—ฎ๐—ป๐˜€๐—ณ๐—ฒ๐—ฟ๐—ฟ๐—ฎ๐—ฏ๐—น๐—ฒ ๐—ท๐—ฎ๐—ถ๐—น๐—ฏ๐—ฟ๐—ฒ๐—ฎ๐—ธ ๐˜€๐˜‚๐—ณ๐—ณ๐—ถ๐˜… that works on GPTs, o1, and Circuit Breaker defense! To appear at NeurIPS Safe GenAI Workshop!
(1/7)

12.12.2024 18:16 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0