When RLHFed models engage in โreward hackingโ it can lead to unsafe/unwanted behavior. But there isnโt a good formal definition of what this means! Our new paper provides a definition AND a method that provably prevents reward hacking in realistic settings, including RLHF. ๐งต
19.12.2024 17:17 โ ๐ 8 ๐ 3 ๐ฌ 2 ๐ 0