Yi (Joshua) Ren's Avatar

Yi (Joshua) Ren

@joshuaren.bsky.social

Ph.D. student @cs.ubc.ca, working on ML (learning dynamics, simplicity bias, iterated learning, LLM) https://joshua-ren.github.io/

74 Followers  |  4 Following  |  21 Posts  |  Joined: 09.12.2024  |  1.9121

Latest posts by joshuaren.bsky.social on Bluesky

πŸ™Many thanks to my supervisor @djsutherland.ml and the reviewers for their thoughtful suggestions and feedback.
πŸ“Poster Hall 3+2B #376 on Fri, Apr 25 at 15:00
🎀Oral in Session 6A on Sat, Apr 26 at 16:30
πŸ“°https://arxiv.org/pdf/2407.10490
(12/12)

21.04.2025 05:45 β€” πŸ‘ 5    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

This wraps up the main story of our paper. 🎬
But there’s more comingβ€”
🧠 Many RL + LLM methods (like GRPO) also involve negative gradients.
🎯 And a token-level AKG decomposition is even more suitable for real-world LLMs.
Please stay tuned. πŸš€
(11/12)

21.04.2025 05:45 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Video thumbnail

With this setup, we can now explain some strange behaviors in DPO, like why the model's confidence on both the chosen and rejected answers drops after long training. πŸ“‰πŸ“‰
Just apply force analysis and remember: the smaller p(y-), the stronger the squeezing effect.
(10/12)

21.04.2025 05:45 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Video thumbnail

Just like this!!!
(9/12)

21.04.2025 05:45 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

We formally show that, as long as you're using a softmax to produce probabilistic predictions, the squeezing effect is inevitable. And it gets stronger when p(y-) is smaller β€” the less likely an answer is (especially for off-policy), the harder all dimensions get squeezed.
(8/12)

21.04.2025 05:45 β€” πŸ‘ 5    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Now let’s switch gears to DPO β€” a more complex algorithm than SFT (as its AKG decomposition shows). But from a force analysis perspective, the story is surprisingly similar.
βš–οΈ The key difference? DPO introduces a negative gradient term β€” that’s where the twist comes in.
(7/12)

21.04.2025 05:45 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

It also offers a possible explanation for a specific hallucination pattern in SFT:
πŸ” The model uses facts or phrases from A2 when answering an unrelated Q1.
Why does this happen?
Just do a force analysis β€” the answer emerges naturally. πŸ’‘
(6/12)

21.04.2025 05:45 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Video thumbnail

Time to see how learning dynamics explains those weird behaviors. We observe a consistent trend: similar responses often rise in confidence, then fall.
πŸ“ˆπŸ“‰ This aligns well with the force analysis perspective. (More supporting experiments in the paper).
(5/12)

21.04.2025 05:45 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Now let’s analyze SFT!
The change in the model’s prediction can be decomposed (AKG-style). The input is a concatenation: [x; y]. This lets us ask questions like: β€œHow does the model’s confidence in 'y-' change if we fine-tune on 'y+'?”
(4/12)

21.04.2025 05:45 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

This toy example on MNIST helps you understand how it works: since 4 and 9 look similar from the model's perspective, learning 4 will make p(y=4 | 9) more likely. (More detailed discussions on simple classification tasks can be found here arxiv.org/pdf/2203.02485)
(3/12)

21.04.2025 05:45 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Instead of focusing on the global optimum, learning dynamics analyzes how the model behaves during training β€” one update at a time.
🧠 Think of the model's prediction as an object and each gradient update as a force acting on it.
(2/12)

21.04.2025 05:45 β€” πŸ‘ 6    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

You might have seen some strange behaviors when fine-tuning LLMs.
🧩Prior work offers great insights, but we take a different angle: We dive into the dynamics behind these changes, step by step, like force analysis in physics. βš™οΈ
(1/12)

21.04.2025 05:45 β€” πŸ‘ 7    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

πŸ“’Curious why your LLM behaves strangely after long SFT or DPO?
We offer a fresh perspectiveβ€”consider doing a "force analysis" on your model’s behavior.
Check out our #ICLR2025 Oral paper:

Learning Dynamics of LLM Finetuning!

(0/12)

21.04.2025 05:45 β€” πŸ‘ 70    πŸ” 15    πŸ’¬ 1    πŸ“Œ 3
Post image

We formally show that, as long as you're using a softmax to produce probabilistic predictions, the squeezing effect is inevitable. And it gets stronger when p(y-) is smaller β€” the less likely an answer is (especially for off-policy), the harder all dimensions get squeezed.
(8/12)

21.04.2025 05:33 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

Now let’s switch gears to DPO β€” a more complex algorithm than SFT (as its AKG decomposition shows). But from a force analysis perspective, the story is surprisingly similar.
βš–οΈ The key difference? DPO introduces a negative gradient term β€” that’s where the twist comes in.
(7/12)

21.04.2025 05:33 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

It also offers a possible explanation for a specific hallucination pattern in SFT:
πŸ” The model uses facts or phrases from A2 when answering an unrelated Q1.
Why does this happen?
Just do a force analysis β€” the answer emerges naturally. πŸ’‘
(6/12)

21.04.2025 05:33 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Time to see how learning dynamics explains those weird behaviors. We observe a consistent trend: similar responses often rise in confidence, then fall.
πŸ“ˆπŸ“‰ This aligns well with the force analysis perspective. (More supporting experiments in the paper).
(5/12)

21.04.2025 05:33 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Now let’s analyze SFT!
The change in the model’s prediction can be decomposed (AKG-style). The input is a concatenation: [x; y]. This lets us ask questions like: β€œHow does the model’s confidence in 'y-' change if we fine-tune on 'y+'?”
(4/12)

21.04.2025 05:33 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

This toy example on MNIST helps you understand how it works: since 4 and 9 look similar from the model's perspective, learning 4 will make p(y=4 | 9) more likely. (More detailed discussions on simple classification tasks can be found here arxiv.org/pdf/2203.02485)
(3/12)

21.04.2025 05:33 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Instead of focusing on the global optimum, learning dynamics analyzes how the model behaves during training β€” one update at a time.
🧠 Think of the model's prediction as an object and each gradient update as a force acting on it.
(2/12)

21.04.2025 05:33 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

You might have seen some strange behaviors when fine-tuning LLMs.
🧩Prior work offers great insights, but we take a different angle: We dive into the dynamics behind these changes, step by step, like force analysis in physics. βš™οΈ
(1/12)

21.04.2025 05:33 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

@joshuaren is following 4 prominent accounts