Johannes Ackermann's Avatar

Johannes Ackermann

@johannesack.bsky.social

Reinforcement Learning PhD Student at the University of Tokyo, Prev: Intern at Sakana AI, PFN, M.Sc/B.Sc. from TU Munich johannesack.github.io

31 Followers  |  172 Following  |  11 Posts  |  Joined: 19.11.2024  |  1.5824

Latest posts by johannesack.bsky.social on Bluesky

Plus reviewers might look up your submission on arXiv and become biased against you based on affiliation

04.10.2025 09:26 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

If you're from a famous lab it's clearly useful to put it on arXiv, but for less famous labs I'm not sure it's helpful.

You usually don't get that much visibility and risk your ideas getting stolen/"reinvented" afterwards

04.10.2025 09:25 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Bravo!

18.09.2025 14:59 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Off-Policy Corrected Reward Modeling for Reinforcement Learning from Human Feedback Reinforcement Learning from Human Feedback (RLHF) allows us to train models, such as language models (LMs), to follow complex human preferences. In RLHF for LMs, we first train an LM using supervised ...

In our paper we provide more details, a theoretical analysis, and numerous ablations!

This was a very fun joint work with Takashi Ishida and Masashi Sugiyama!
Find our paper at arxiv.org/abs/2507.15507, our code at github.com/JohannesAck/... and swing by our poster at COLM in October!

29.07.2025 10:21 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

Of course we also tested our approach for alignment of language models, both on the TL;DR summarization task and a variant of the Alpaca-Farm benchmark.

It results in a notable increase in performance across base models and tasks! (5/6)

29.07.2025 10:21 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

By correcting the RM a few times during training, we can obtain a better final policy.

As illustrated in this 2D toy example, we can successively retrain the RM on the distribution of the current policy allowing us to keep training for longer! (4/6)

29.07.2025 10:21 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

We could simply sample new actions from the current policy and obtain human preference labels, but this is costly and slow.

Instead, we use importance weighting to train an off-policy corrected RM without any additional samples or preference labels needed! (3/6)

29.07.2025 10:21 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

The reward model (RM) is trained on actions sampled from the SFT model.
As we keep training our LM, it deviates from the SFT policy and thus the RM becomes inaccurate, causing stagnation or overoptimization.

We can prevent this by off-policy correcting the RM! (2/6)

29.07.2025 10:21 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Reward models do not have the capacity to fully capture human preferences.
If they can't represent human preferences, how can we hope to use them to align a language model?

In our #COLM2025 "Off-Policy Corrected Reward Modeling for RLHF", we investigate this issue 🧡

29.07.2025 10:21 β€” πŸ‘ 2    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0

The photo looks pretty good, I wish they had them in Tokyo!

01.05.2025 08:22 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

An element of feedback to the devs will go missing.

If the interface is really unergonomic but LLMs can figure it out, there won't be enough user complaints to lead to improvement.

Likewise for bad docs if the LLM can just ingest the library's source code

20.11.2024 06:47 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

@johannesack is following 20 prominent accounts