Mickel Liu's Avatar

Mickel Liu

@mickelliu.bsky.social

PhD student @ UWCSE/UWNLP · Incoming @ Meta FAIR · I do LLM + RL

12 Followers  |  6 Following  |  9 Posts  |  Joined: 12.06.2025  |  1.4996

Latest posts by mickelliu.bsky.social on Bluesky

Thanks to my co-authors for their crucial contributions in making this paper possible!
@Liwei Jiang, @Yancheng Liang, @Simon Shaolei Du,
@yejinchoinka.bsky.social‬, @Tim Althoff, @natashajaques.bsky.social‬

12.06.2025 05:12 — 👍 2    🔁 0    💬 0    📌 0
Post image

Our framework shows, both theoretically and empirically, that online MARL self-improvement can reach a new frontier for safety alignment of LMs.
Check out more details at:
📍𝐏𝐚𝐩𝐞𝐫: arxiv.org/abs/2506.07468
📍𝐂𝐨𝐝𝐞: github.com/mickelliu/s...

12.06.2025 05:12 — 👍 2    🔁 0    💬 1    📌 0
Post image

On the code level, how does our self-play method work?
We built on OpenRLHF and their Re++ algorithm, a critic-free method like GPRO. Both roles share the same LLM parameters and mix the training experiences for gradient descent together.
Our code is also open-sourced (see next)!

12.06.2025 05:12 — 👍 1    🔁 0    💬 1    📌 0
Post image

Co-evolutionary dynamics reveal emergent arms race behavior:
Defender performance improves gradually as the defender wins more, while the attacker must continuously adapt. This contrasts with static training, where the trainable part converges easily and stops improving.

12.06.2025 05:12 — 👍 3    🔁 0    💬 1    📌 0
Post image

Our very comprehensive evaluations show:
✅ Significant improvement on harmful refusal accuracy compared to the abliterated and instruct (IT) models (Table 1)
✅ Minimal compromise on benign compliance & general abilities (see Table 2 in the text).

12.06.2025 05:12 — 👍 0    🔁 0    💬 1    📌 0
Post image

Why self-play matters:
Attacker-only training collapses into repetitive patterns (see red clusters), whereas self-play / co-evolution maintains semantic diversity throughout training (see blue spread). Self-play can ensure coverage over a wider attack surface.

12.06.2025 05:12 — 👍 0    🔁 0    💬 1    📌 0
Post image

How to play the empirical red-teaming game?
1) We train ONE model to play BOTH roles in a 𝐬𝐞𝐥𝐟-𝐩𝐥𝐚𝐲 𝐳𝐞𝐫𝐨-𝐬𝐮𝐦 game fully online! This enables continuous co-evolution.
2) 𝐇𝐢𝐝𝐝𝐞𝐧 Chain-of-Thought enables strategic reasoning invisible to opponents.

12.06.2025 05:12 — 👍 1    🔁 0    💬 1    📌 0
Post image

We first start with establishing a 𝐭𝐡𝐞𝐨𝐫𝐞𝐭𝐢𝐜𝐚𝐥 𝐬𝐚𝐟𝐞𝐭𝐲 𝐠𝐮𝐚𝐫𝐚𝐧𝐭𝐞𝐞:
At Nash Equilibrium, the defender provides safe responses to ANY adversarial input (Theorem 1). This motivates our game-theoretic approach to safety alignment beyond empirical defenses.

12.06.2025 05:11 — 👍 4    🔁 0    💬 1    📌 0
Post image

🤔Conventional LM safety alignment is reactive: find vulnerabilities→patch→repeat
🌟We propose 𝗼𝗻𝗹𝗶𝗻𝗲 𝐦𝐮𝐥𝐭𝐢-𝐚𝐠𝐞𝐧𝐭 𝗥𝗟 𝘁𝗿𝗮𝗶𝗻𝗶𝗻𝗴 where Attacker & Defender self-play to co-evolve, finding diverse attacks and improving safety by up to 72% vs. RLHF 🧵

12.06.2025 05:11 — 👍 8    🔁 1    💬 2    📌 1

@mickelliu is following 6 prominent accounts