Jiahai Feng's Avatar

Jiahai Feng

@fjiahai.bsky.social

AI interp @UC Berkeley | prev. MIT jiahai-feng.github.io

256 Followers  |  55 Following  |  1 Posts  |  Joined: 16.11.2024  |  1.3019

Latest posts by fjiahai.bsky.social on Bluesky

Video thumbnail

When RLHFed models engage in โ€œreward hackingโ€ it can lead to unsafe/unwanted behavior. But there isnโ€™t a good formal definition of what this means! Our new paper provides a definition AND a method that provably prevents reward hacking in realistic settings, including RLHF. ๐Ÿงต

19.12.2024 17:17 โ€” ๐Ÿ‘ 8    ๐Ÿ” 3    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 0
Post image

Can we predict emergent capabilities in GPT-N+1๐ŸŒŒ using only GPT-N model checkpoints, which have random performance on the task?

We propose a method for doing exactly this in our paper โ€œPredicting Emergent Capabilities by Finetuningโ€๐Ÿงต

26.11.2024 22:37 โ€” ๐Ÿ‘ 44    ๐Ÿ” 6    ๐Ÿ’ฌ 3    ๐Ÿ“Œ 1

๐Ÿ™‹โ€โ™‚๏ธ

24.11.2024 17:56 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

@fjiahai is following 19 prominent accounts