Bo Liu (Benjamin Liu)'s Avatar

Bo Liu (Benjamin Liu)

@benjamin-eecs.bsky.social

Reinforcement Learning PhD @NUSingapore | Undergrad @PKU1898 | Building autonomous decision making systems | Ex intern @MSFTResearch @deepseek_ai | DeepSeek-V2, DeepSeek-VL, DeepSeek-Prover

110 Followers  |  21 Following  |  8 Posts  |  Joined: 19.11.2024  |  1.7188

Latest posts by benjamin-eecs.bsky.social on Bluesky

Co-first authors: @LeonGuertler @simon_ycl @zzlccc, advisor @natashajaques.bsky.social
Team: @QPHutu @danibalcells @mickel_liu C.Tan @shi_weiyan @mavenlin W.S.Lee
@NUSingapore @ASTARsg @Northeastern @UW πŸš€

01.07.2025 20:13 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

New paradigm: instead of curating problems, create environments where models discover reasoning through competition.
Self-play = autonomous improvement without human supervision. Simple games improve general reasoning!

01.07.2025 20:11 β€” πŸ‘ 5    πŸ” 1    πŸ’¬ 1    πŸ“Œ 1
Post image

We developed Role-conditioned Advantage Estimation (RAE) to stabilize training.
Without RAE: "thinking collapse" - responses crash 3500β†’0 chars, math drops 66%
RAE keeps reasoning alive!

01.07.2025 20:11 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Multi-game magic:
Single game: ~41% reasoning average
Multi-game: 42.7% - skills synergize!
Even strong models improve:
DeepSeek-R1-Distill-Qwen-7B jumps 59.7%β†’61.7%. AIME'25 +10 points! πŸ“ˆ

01.07.2025 20:11 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Different games β†’ different skills:
TicTacToe β†’ spatial (56% on Snake)
Kuhn Poker β†’ probabilistic (91.7% on Pig Dice!)
Simple Negotiation β†’ strategic (55.8% on Truth & Deception)
Each game develops distinct abilities!

01.07.2025 20:11 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Why self-play? We compared approaches:
Self-play: 39.7% math, 47.8% general reasoning
Fixed opponents: Much worse
Random: Complete collapse
Key: as you improve, so does your opponent. Fixed opponents become too easy.

01.07.2025 20:11 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

To understand poker→math transfer, we found 3 patterns:
πŸ“Š Expected Value Calculation
πŸ” Case-by-Case Analysis
🎯 Pattern Recognition
These patterns from games transfer to math benchmarks. Games teach generalizable thinking!

01.07.2025 20:11 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 1
Post image

We're excited about self-play unlocking continuously improving agents. RL selects CoT patterns from LLMs. Games=perfect testing grounds.
SPIRAL: models learn via self-competition. Kuhn Poker β†’ +8.7% math, +18.1 Minerva Math! πŸƒ
Paper: huggingface.co/papers/2506....
Code: github.com/spiral-rl/spiral

01.07.2025 20:11 β€” πŸ‘ 17    πŸ” 5    πŸ’¬ 2    πŸ“Œ 1
Post image

Natural Language Reinforcement Learning (NLRL) redefines Reinforcement Learning (RL).

NLRL's main idea:
The core parts of RL like goals, strategies, and evaluation methods are reimagined using natural language instead of rigid math.

Let's explore this approach more precisely🧡

29.11.2024 15:06 β€” πŸ‘ 2    πŸ” 1    πŸ’¬ 1    πŸ“Œ 1

@benjamin-eecs is following 19 prominent accounts