nice....
15.03.2025 12:47 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0@zstevenwu.bsky.social
Computer science professor at Carnegie Mellon. Researcher in machine learning. Algorithmic foundations of responsible AI (e.g., privacy, uncertainty quantification), interactive learning (e.g., imitation/reinforcement learning). https://zstevenwu.com/
nice....
15.03.2025 12:47 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0I was lucky enough to be invited give a talk on our new paper on the value of RL in fine-tuning at Cornell last week! Because of my poor time management skills, the talk isn't as polished as I'd like, but I think the "vibes" are accurate enough to share: youtu.be/E4b3cSirpsg.
06.03.2025 18:19 โ ๐ 15 ๐ 3 ๐ฌ 0 ๐ 01.5 yrs ago, we set out to answer a seemingly simple question: what are we *actually* getting out of RL in fine-tuning? I'm thrilled to share a pearl we found on the deepest dive of my PhD: the value of RL in RLHF seems to come from *generation-verification gaps*. Get ready to ๐คฟ:
04.03.2025 20:59 โ ๐ 59 ๐ 11 ๐ฌ 1 ๐ 3can you present other people's results :-)
04.03.2025 14:18 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0that makes sense to me.... i should go to bed....
06.02.2025 00:51 โ ๐ 3 ๐ 0 ๐ฌ 0 ๐ 0
@gswamy.bsky.social et al propose SPO which builds a game from a preferences, solving for the minimax winner. Handles non-Markovian, intransitive, and stochastic preferences. Nice empirical eval ranging from small demonstrative domains to huge RL domain (Mujoco).
arxiv.org/abs/2401.04056
2/3.
I have become a fan of the game-theoretic approaches to RLHF, so here are two more papers in that category! (with one more tomorrow ๐
)
1. Self-Play Preference Optimization (SPO).
2. Direct Nash Optimization (DNO).
๐งต 1/3.
1....
21.11.2024 00:40 โ ๐ 4 ๐ 0 ๐ฌ 0 ๐ 0