Final message: LLMs can improve from failure — if you ask the right question.
“Explain the answer” > “Try again”
Paper: arxiv.org/abs/2507.02834
Joint work with @ruiyang-zhou.bsky.social and Shuozhe Li.
@leqiliu.bsky.social
AI/ML Researcher | Assistant Professor at UT Austin | Postdoc at Princeton PLI | PhD, Machine Learning Department, CMU. Research goal: Building controllable machine intelligence that serves humanity safely. leqiliu.github.io
Final message: LLMs can improve from failure — if you ask the right question.
“Explain the answer” > “Try again”
Paper: arxiv.org/abs/2507.02834
Joint work with @ruiyang-zhou.bsky.social and Shuozhe Li.
ExPO significantly improves model reasoning on hard tasks.
We plug ExPO into:
• DPO (preference-based)
• GRPO (verifier-based RL)
→ No architecture changes
→ No expert supervision
→ Big gains on hard tasks
Results (Qwen2.5-3B-Instruct, MATH level-5):
Our solution:
Ask the model to explain the correct answer — even when it couldn’t solve the problem.
These self-explanations are:
✅ in-distribution
✅ richer than failed CoTs
✅ Offer better guidance than expert-written CoTs
We train on them. We call it ExPO.
Most RL post-training methods only work when the model has some chance to get answers right. But what if it mostly gets everything wrong?
NO correct trajectory sampled -> NO learning signal -> Model stays the same and unlearns due to KL constraint
This happens often in hard reasoning tasks.
For hard reasoning tasks, the chance of sampling a correct answer is low. Thus, sharpening the sampling distribution is not enough, and standard RL post-training fails.
New method to crack hard reasoning problems with LLM!
No expert traces. No test-time hacks.
Just: Self-explanation + RL-style training
Result? Accuracy on MATH level-5 jumped from 2% → 23%.
This has huge practical implications! It opens the door to using small, efficient models as sandboxes to probe, understand, and even steer their much larger counterparts.
Paper: arxiv.org/abs/2506.00653
Joint work with Femi Bello, @anubrata.bsky.social, Fanzhi Zeng, @fcyin.bsky.social
We tested this by learning an affine map between Gemma-2B and Gemma-9B.
The result? Steering vectors(directions for specific behaviors) from the 2B model successfully guided 9B's outputs.
For example, a "dog-saying" steering vector from 2B made 9B talk more about dogs!
Here's the core idea: We hypothesize that models trained on similar data learn a **universal set of basis features**. Each model's internal representation space is just a unique, model-specific projection of this shared space.
This means representations learned across models are transferable!
What if you could understand and control an LLM by studying its *smaller* sibling?
Our new paper introduces the Linear Representation Transferability Hypothesis. We find that the internal representations of different-sized models can be translated into one another using a simple linear(affine) map.
4/4 Joint work with Hui Yuan, Yifan Zeng, Yue Wu, Huazheng Wang, Mengdi Wang
Paper: arxiv.org/abs/2410.13828
Check out our work at the NeurIPS AFM workshop, Exhibit Hall A, 12/14, 4:30 - 5:30 pm #NeurIPS2024
3/4 Wonder how to **resolve** problems coming with gradient entanglement? Our theoretical framework highlights new algorithmic ideas:
- Normalized preference optimization: normalize the chosen and rejected gradient
- Sparse token masking: impose sparsity on the tokens for calculating the margins.
2/4 The Gradient Entanglement effect becomes particularly concerning when the chosen and rejected gradient inner product is large, which often happens when the two responses are similar!
14.12.2024 17:38 — 👍 1 🔁 0 💬 1 📌 01/4 We demystify the reason behind the synchronized change in chosen and rejected logps: the **Gradient Entanglement** effect! For any margin-based losses (esp. these *PO objectives), the chosen probability will depend on the rejected gradient, and vice versa.
14.12.2024 17:38 — 👍 0 🔁 0 💬 1 📌 0Ever wondered why there are synchronized ups and downs for chosen and rejected log-probs during DPO (and most *POs: IPO, SimPO, CPO, R-DPO, DPOP, RRHF, SlicHF) training? Why do chosen logps decrease, and rejected logps sometimes increase?
Our answer: Gradient Entanglement!
arxiv.org/abs/2410.13828
4/4 Joint work with Xinyu Li, @ruiyang-zhou.bsky.social,
@zacharylipton.bsky.social
Paper: arxiv.org/abs/2402.05133, Code: github.com/HumainLab/Personalized_RLHF
Check out our work at the NeurIPS AFM workshop, Exhibit Hall A, 12/14, 4:30 - 5:30 pm #NeurIPS2024
3/4 Beyond user preferences indicated in explicit textual format, P-RLHF can learn the nuanced implicit preferences encoded in user preference data. On the largest publicly available preference dataset based on multi-turn dialog (PRISM), P-RLHF outperforms all strong baselines in winrate by 10-20%.
14.12.2024 17:02 — 👍 1 🔁 0 💬 1 📌 02/4 For any base preference optimization (*PO) algorithm, P-RLHF can create its corresponding personalized version P-*PO, allowing for **flexible** choice of alignment algorithms.
14.12.2024 17:02 — 👍 0 🔁 0 💬 1 📌 01/4 Personalized-RLHF (P-RLHF) uses a **light-weight** user model to learn user embeddings, which serve as a soft prompt for generating personalized responses. The user model is much smaller (10-100x smaller) compared to the LORA adapters used for fine-tuning the language model.
14.12.2024 17:02 — 👍 0 🔁 0 💬 1 📌 0How to **efficiently** build personalized language models **without** textual info on user preferences?
Our Personalized-RLHF work:
- light-weight user model
- personalize all *PO alignment algorithms
- strong performance on the largest personalized preference dataset
arxiv.org/abs/2402.05133
4/4 Joint work with: Xinyu Li, Ruiyang Zhou, @zacharylipton.bsky.social
Paper: arxiv.org/abs/2402.05133, Code: github.com/HumainLab/Pe...
Check our work at the NeurIPS AFM workshop, Exhibit Hall A, 12/14, 4:30 - 5:30 pm #NeurIPS2024
3/4 Beyond user preferences indicated in explicit textual format, P-RLHF can learn the nuanced implicit preferences encoded in their preference data. On the largest publicly available preference dataset based on multi-turn dialog (PRISM), P-RLHF outperforms all strong baselines in winrate by 10-20%.
14.12.2024 16:39 — 👍 0 🔁 0 💬 1 📌 02/4 For any base preference optimization (*PO) algorithm, P-RLHF can create its corresponding personalized version P-*PO, allowing for **flexible** choice of alignment algorithms.
14.12.2024 16:39 — 👍 0 🔁 0 💬 1 📌 01/4 Personalized-RLHF (P-RLHF) uses a **light-weight** user model that maps user information to their embeddings, which serve as a soft prompt for generating personalized response. The user model is much smaller (10-100x smaller) compared to the LORA adapters used for fine-tuning the LM.
14.12.2024 16:39 — 👍 0 🔁 0 💬 1 📌 0We're hiring a fully-funded Ph.D. student in Use-Inspired AI @ UT Austin starting Fall 2025! Join us to work on impactful AI/ML research addressing real-world challenges.
Learn more & apply: t.co/OPrxO3yMhf