Yuda Song's Avatar

Yuda Song

@yus167.bsky.social

PhD at Machine Learning Department, Carnegie Mellon University | Interactive Decision Making | https://yudasong.github.io

1,287 Followers  |  188 Following  |  12 Posts  |  Joined: 18.11.2024  |  1.9516

Latest posts by yus167.bsky.social on Bluesky

Post image

Training with more data = better LLMs, right? 🚨

False! Scaling language models by adding more pre-training data can decrease your performance after post-training!
Introducing "catastrophic overtraining." πŸ₯πŸ§΅πŸ‘‡

arxiv.org/abs/2503.19206

1/10

26.03.2025 18:35 β€” πŸ‘ 34    πŸ” 14    πŸ’¬ 1    πŸ“Œ 1
Post image

1.5 yrs ago, we set out to answer a seemingly simple question: what are we *actually* getting out of RL in fine-tuning? I'm thrilled to share a pearl we found on the deepest dive of my PhD: the value of RL in RLHF seems to come from *generation-verification gaps*. Get ready to 🀿:

04.03.2025 20:59 β€” πŸ‘ 59    πŸ” 11    πŸ’¬ 1    πŸ“Œ 3
Post image

super happy about this preprint! we can *finally* perform efficient exploration and find near-optimal stationary policies in infinite-horizon linear MDPs, and even use it for imitation learning :) working with @neu-rips.bsky.social and @lviano.bsky.social on this was so much fun!!

20.02.2025 17:45 β€” πŸ‘ 23    πŸ” 2    πŸ’¬ 2    πŸ“Œ 1
Post image

What are the minimal supervised learning primitives required to perform RL efficiently?

New paper led by my amazing intern Dhruv Rohatgi:

Necessary and Sufficient Oracles: Toward a Computational Taxonomy for Reinforcement Learning

arxiv.org/abs/2502.08632

1/

20.02.2025 23:39 β€” πŸ‘ 25    πŸ” 4    πŸ’¬ 1    πŸ“Œ 0
Preview
Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models | alphaXiv View 3 comments: Delete the space?

Models can self-improveπŸ₯· by knowing they were wrongπŸ§˜β€β™€οΈ but when can they do it?

Across LLM families, tasks and mechanisms
This ability scales with pretraining, prefers CoT, non QA tasks and more in 🧡

alphaxiv.org/abs/2412.02674
@yus167.bsky.social @shamkakade.bsky.social
πŸ“ˆπŸ€–
#NLP #ML

13.12.2024 23:55 β€” πŸ‘ 24    πŸ” 3    πŸ’¬ 2    πŸ“Œ 0

On Saturday I will present our LLM self-improvement paper in the workshop on Mathematics of Modern Machine Learning (M3L) and the workshop on Statistical Foundations of LLMs and Foundation Models (SFLLM).
bsky.app/profile/yus1...

09.12.2024 19:48 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
The Importance of Online Data: Understanding Preference Fine-tuning via Coverage Learning from human preference data has emerged as the dominant paradigm for fine-tuning large language models (LLMs). The two most common families of techniques -- online reinforcement learning (RL) ...

Arxiv link for HyPO: arxiv.org/abs/2406.01462

09.12.2024 19:48 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
NeurIPS Poster The Importance of Online Data: Understanding Preference Fine-tuning via CoverageNeurIPS 2024

I will present two papers at #NeurIPS2024!

Happy to meet old and new friends and talk about all aspects of RL: data, environment structure, and reward! πŸ˜€

In Wed 11am-2pm poster session I will present HyPO-- best of both worlds of offline and online RLHF: neurips.cc/virtual/2024...

09.12.2024 19:48 β€” πŸ‘ 9    πŸ” 2    πŸ’¬ 1    πŸ“Œ 0
Preview
Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models Self-improvement is a mechanism in Large Language Model (LLM) pre-training, post-training and test-time inference. We explore a framework where the model verifies its own outputs, filters or reweights...

There are many more intriguing results that I can not fit into one post! For more details, please check out our paper: arxiv.org/abs/2412.02674. This is joint work with amazing collaborators Hanlin Zhang, Carson Eisenach, @shamkakade.bsky.social , Dean Foster and @ughai.bsky.social. (9/9)

06.12.2024 18:02 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

We also dive deep into the similarity and difference between different verification mechanisms. We observed the consistency, distinction and ensemble properties of the verification methods (see the summary image). (8/9)

06.12.2024 18:02 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

In iterative self-improvement, we observe the gap diminishes to 0 in a few iterations, resembling many previous findings. We discovered that one cause of such saturation is the degradation of the "effective diversity" of the generation due to the imperfect verifier. (7/9)

06.12.2024 18:02 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

However, self-improvement is not always possible on all tasks. We do not observe significant self-improvement signal on QA tasks like Natural Questions. Also, not all models can self-improve on sudoku, a canonical example of "verification is easier than generation". (6/9)

06.12.2024 18:02 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Our first major result is an observational scaling law: with certain verification methods, the relative gap increases monotonically (almost linear) to the log of pretrain flops, on tasks like GSM8K and MATH. (5/9)

06.12.2024 18:02 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

We propose to use the performance difference between the reweighted and original responses (2-1) -- the "generation-verification gap". We also study the relative gap -- gap weighted by the error rate. Intuitively, improvement is harder if the model makes fewer mistakes. (4/9)

06.12.2024 18:02 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

While previous works measure self-improvement using the performance difference between the models (3-1), we found out that step 3 (distillation) introduces confounders (for example, the models can just be better at following certain formats). (3/9)

06.12.2024 18:02 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

We study self-improvement as the following process:
1. Model generates many candidate responses.
2. Model filters/reweights responses based on its verifications.
3. Distill the reweighted responses into a new model.
(2/9)

06.12.2024 18:02 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

LLM self-improvement has critical implications in synthetic data, post-training and test-time inference. To understand LLMs' true capability of self-improvement, we perform large-scale experiments with multiple families of LLMs, tasks and mechanisms. Here is what we found: (1/9)

06.12.2024 18:02 β€” πŸ‘ 12    πŸ” 4    πŸ’¬ 1    πŸ“Œ 1

Yuda Song, Hanlin Zhang, Carson Eisenach, Sham Kakade, Dean Foster, Udaya Ghai
Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models
https://arxiv.org/abs/2412.02674

04.12.2024 09:09 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0

I think the main difference in terms of interpolation / extrapolation between DPO and RLHF is that the former only guarantees closeness to the reference policy on the training data, while RLHF usually tacks on an on-policy KL penalty. We explored this point in arxiv.org/abs/2406.01462.

22.11.2024 15:38 β€” πŸ‘ 4    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Post image

(1/n) πŸ’‘How can we speed up the serial runtime of long pre-training runs? Enter Critical Batch Size (CBS): the tipping point where the gains of data parallelism balance with diminishing efficiency. Doubling batch size halves the optimization stepsβ€”until we hit CBS, beyond which returns diminish.

22.11.2024 20:19 β€” πŸ‘ 16    πŸ” 4    πŸ’¬ 2    πŸ“Œ 0

I created a starter pack for people who are or have been affiliated with the Machine Learning Department at CMU. Let me know if I missed someone!

go.bsky.app/QLTVEph

#AcademicSky

18.11.2024 15:46 β€” πŸ‘ 4    πŸ” 4    πŸ’¬ 0    πŸ“Œ 0

Ojash Neopane, Aaditya Ramdas, Aarti Singh
Logarithmic Neyman Regret for Adaptive Estimation of the Average Treatment Effect
https://arxiv.org/abs/2411.14341

22.11.2024 05:01 β€” πŸ‘ 5    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
Video thumbnail

Intro πŸ¦‹

I am a final-year PhD student from CMU Robotics. I work on humanoid control, perception, and behavior in both simulation and real life, using mostly RL:

πŸƒπŸ»PHC: zhengyiluo.com/PHC
πŸ’«PULSE: zhengyiluo.com/PULSE
πŸ”©Omnigrasp: zhengyiluo.com/Omnigrasp
πŸ€–OmniH2O: omni.human2humanoid.com

19.11.2024 20:34 β€” πŸ‘ 22    πŸ” 3    πŸ’¬ 2    πŸ“Œ 0
Post image

Hi Bsky people πŸ‘‹ I'm a PhD candidate in Machine Learning at Carnegie Mellon University.
My research focuses on interactive AI, involving:
πŸ€– reinforcement learning,
🧠 foundation models, and
πŸ‘©β€πŸ’» human-centered AI.

Also a founding co-organizer of the MineRL competitions πŸ–€ Follow me for ML updates!

18.11.2024 15:05 β€” πŸ‘ 70    πŸ” 6    πŸ’¬ 2    πŸ“Œ 0

@yus167 is following 19 prominent accounts