Plus reviewers might look up your submission on arXiv and become biased against you based on affiliation
04.10.2025 09:26 β π 0 π 0 π¬ 0 π 0@johannesack.bsky.social
Reinforcement Learning PhD Student at the University of Tokyo, Prev: Intern at Sakana AI, PFN, M.Sc/B.Sc. from TU Munich johannesack.github.io
Plus reviewers might look up your submission on arXiv and become biased against you based on affiliation
04.10.2025 09:26 β π 0 π 0 π¬ 0 π 0If you're from a famous lab it's clearly useful to put it on arXiv, but for less famous labs I'm not sure it's helpful.
You usually don't get that much visibility and risk your ideas getting stolen/"reinvented" afterwards
Bravo!
18.09.2025 14:59 β π 1 π 0 π¬ 0 π 0In our paper we provide more details, a theoretical analysis, and numerous ablations!
This was a very fun joint work with Takashi Ishida and Masashi Sugiyama!
Find our paper at arxiv.org/abs/2507.15507, our code at github.com/JohannesAck/... and swing by our poster at COLM in October!
Of course we also tested our approach for alignment of language models, both on the TL;DR summarization task and a variant of the Alpaca-Farm benchmark.
It results in a notable increase in performance across base models and tasks! (5/6)
By correcting the RM a few times during training, we can obtain a better final policy.
As illustrated in this 2D toy example, we can successively retrain the RM on the distribution of the current policy allowing us to keep training for longer! (4/6)
We could simply sample new actions from the current policy and obtain human preference labels, but this is costly and slow.
Instead, we use importance weighting to train an off-policy corrected RM without any additional samples or preference labels needed! (3/6)
The reward model (RM) is trained on actions sampled from the SFT model.
As we keep training our LM, it deviates from the SFT policy and thus the RM becomes inaccurate, causing stagnation or overoptimization.
We can prevent this by off-policy correcting the RM! (2/6)
Reward models do not have the capacity to fully capture human preferences.
If they can't represent human preferences, how can we hope to use them to align a language model?
In our #COLM2025 "Off-Policy Corrected Reward Modeling for RLHF", we investigate this issue π§΅
The photo looks pretty good, I wish they had them in Tokyo!
01.05.2025 08:22 β π 1 π 0 π¬ 0 π 0An element of feedback to the devs will go missing.
If the interface is really unergonomic but LLMs can figure it out, there won't be enough user complaints to lead to improvement.
Likewise for bad docs if the LLM can just ingest the library's source code