Excited to share the obfuscation-atlas I've been working on! The most surprising finding to me: Standard RLVR leading to reward hacking can make models believe that it's okay to do so. Deception probes catch such reward hacking on the original model but cannot catch it after RLVR
13.02.2026 16:52 β
π 4
π 0
π¬ 0
π 0
Mech Interp Workshop #NeurIPS2025 poster & spotlight presentation today!
π 11:30am-12:30pm Sun, Dec 7 @ Upper Level Room 30A-E
Path Channels & Plan Extension Kernels: A Mechanistic Description of Planning in a Sokoban RNN.
by @taufeeque.bsky.social, Aaron Tucker, @gleave.me, AdriΓ Garriga-Alonsoπ
07.12.2025 17:01 β
π 1
π 1
π¬ 1
π 0