Petr Baudis (pasky) @xpasky

Is this about SotA AI in general or comparative to Gemini and Claude?

24.05.2025 14:05 — 👍 0 🔁 0 💬 0 📌 0

GitHub - huggingface/lighteval: Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends - huggingface/lighteval

More details: github.com/huggingface/...

06.01.2025 13:32 — 👍 5 🔁 0 💬 0 📌 0

Definitely believe them regarding technical capabilities of the models. (Ok maybe add a 6-12m buffer.)

Where they are imho over indexing is maximum realistic speed of adaptation of the real world.

Adoption even of the most amazing stuff will take time, and need a lot of infra.

06.01.2025 13:27 — 👍 3 🔁 1 💬 0 📌 0

Definitely believe them regarding technical capabilities of the models. (Ok maybe add a 6-12m buffer.)

Where they are imho over indexing is maximum realistic speed of adaptation of the real world.

Adoption even of the most amazing stuff will take time, and need a lot of infra.

06.01.2025 13:27 — 👍 3 🔁 1 💬 0 📌 0

i was annoyed at having many chrome tabs with PDF papers having uninformative titles, so i created a small chrome extension to fix it.

i'm using it for a while now, works well.

today i put it on github. enjoy.

github.com/yoavg/pdf-ta...

05.01.2025 22:22 — 👍 98 🔁 22 💬 5 📌 1

GamesDoneQuick - Twitch Awesome Games Done Quick 2025 - Benefiting the Prevent Cancer Foundation - Ori and the Blind Forest

good news: despite all the bleak shit going on in the world it is once again Awesome Games Done Quick week. go watch some speedruns www.twitch.tv/gamesdonequick

05.01.2025 22:07 — 👍 29 🔁 11 💬 2 📌 0

I reposted the thread here! :)

bsky.app/profile/xpas...

05.01.2025 00:30 — 👍 1 🔁 1 💬 0 📌 0

Will be back with more later - by losing MCTS we also lost the exploration policy, how to plug it back?

This is a repost of a Twitter thread I made yesterday - my experiment on whether I can reach BSky DL audience. Twitter's LLM scene is very lively, I'd love to see more of that here.

'nite! 16/16

05.01.2025 00:29 — 👍 1 🔁 0 💬 0 📌 0

And the pseudocode algorithm for quick reference. 15/n

05.01.2025 00:26 — 👍 0 🔁 0 💬 1 📌 0

But this is the gist of the magic. And it results in reported 38x convergence speedup compared to MCTS & impressive benchmark gains. 14/n

05.01.2025 00:25 — 👍 1 🔁 0 💬 1 📌 0

PRIME of course also contains tons of important technical details. (PPO policy with alternative-normalized advantages over raw rewards. The initial finetuning LLM snapshot staing as a reference and considering only token logit changes to it, makes the math work. Formally proving it's >> MCTS..) 13/n

05.01.2025 00:25 — 👍 0 🔁 0 💬 1 📌 0

Unlike just using ORM approach alone, this introduces an accretive effect - information on what works is shared across training batches through the reward LLM, and as it learns, it produces better guidance and the convergence of the main LLM speeds up. 12/n

05.01.2025 00:23 — 👍 0 🔁 0 💬 1 📌 0

"Continuously" learned? The reward model LLM and the main LLM epochs are interleaved - the estimates are learned in parallel with finetuning the main model, a sort of expectation-minimization dance. 11/n

05.01.2025 00:23 — 👍 0 🔁 0 💬 1 📌 0

...and this extra LLM is then used as a Process RM assigning a reward to each token based on its continuously learned estimate of how much that token is helpful. 10/n

05.01.2025 00:23 — 👍 0 🔁 0 💬 1 📌 0

Well, how do we know how to reward each token then? Why, by finetuning an *extra* copy of your LLM internally to use as per-token reward model. This extra LLM copy is finetuned using the Outcome RM approach (so sparse rewards just encouraging tokens that lead to good final outcomes)... 9/n

05.01.2025 00:22 — 👍 0 🔁 0 💬 1 📌 0

5. Finally, PRIME (Implicit Rewards PRM)!

The basic question is: Instead of MCTS-like evaluating each CoT step by N rollouts, could we just run a beam search of N rollouts of CoT from start to end? 8/n

05.01.2025 00:22 — 👍 0 🔁 0 💬 1 📌 0

The problem now is that you need to roll out the CoT ten times for each candidate - a Monte Carlo approach. This is not efficient as you are wasting a lot of time on stupid CoT step candidates and lost causes. 7/n

05.01.2025 00:21 — 👍 0 🔁 0 💬 1 📌 0

4. Enter MCTS-inspired approaches for automated PRM supervision. Given a few next possible CoT steps, which one is more helpful, can we tell automatically? Well, try rolling out the rest of the CoT ten times for each candidate, and see which one reaches the right answer most frequently! 6/n

05.01.2025 00:21 — 👍 0 🔁 0 💬 1 📌 0

3. So let's give a per-CoT-step reinforcement using a PRM (Process Reward Model). Like teaching humans: don't just look at the final result, tell them if their approach was good.

Naive idea: just use per step human supervision for the steps. But that's obviously unsustainable, too little data. 5/n

05.01.2025 00:21 — 👍 0 🔁 0 💬 1 📌 0

The problem is that each rollout gives you only final outcome info, no sense if any particular CoT step actually helped move towards the result. Convergence is slow, so is ood generalization etc.

4/n

05.01.2025 00:19 — 👍 0 🔁 0 💬 1 📌 0

2. Basic approach is to use ORM (Outcome Reward Model) - try answering a query by rolling out a CoT, and check if it led to the right answer. This gives a positive/negative reinforcement to each token in the CoT (each token in the particular CoT gets the same reward).

3/n

05.01.2025 00:18 — 👍 0 🔁 0 💬 1 📌 0

1. We are RL tuning an LLM to produce good CoTs (chains of thought).

(Good == leading step by step to correct answers to complex queries.)

2/n

05.01.2025 00:18 — 👍 0 🔁 0 💬 1 📌 0

Quick primer for non-wizards about the post-MCTS LLM reasoning future:

How will LLMs learn to reason efficiently?

No math in this thread, ~simple words only! Let's go through the "Process Reinforcement through IMplicit REwards" (PRIME) method. 1/n

curvy-check-498.notion.site/Process-Rein...

05.01.2025 00:17 — 👍 2 🔁 0 💬 1 📌 1

Science/maths/programming have a tendency to depreciate the value of grinding - smart people don't grind! - yes, that project basically took me only 30 minutes. The 3 and half hours of dead ends I went down obviously don't count. Or the hour I spent installing the wrong package.

24.12.2024 13:27 — 👍 59 🔁 8 💬 5 📌 0

Yes

14.12.2024 00:56 — 👍 0 🔁 0 💬 0 📌 0

Fun fact, AlphaGo etc. actually suck at being deliberative, they are all about "iterative intuition deepening". It doesn't even "cache" local sequence outcomes.

Even when permitted only very tiny search tree, AlphaGo will be better than 99%+ serious human Go players.

13.12.2024 13:08 — 👍 0 🔁 0 💬 0 📌 0

Fan Xiping and Shi Ding'an at Sensei's Library Sensei's Library, page: Fan Xiping and Shi Ding'an, keywords: Culture & History, People. SL is a large WikiWikiWeb about the game of Go (Baduk, Weiqi). It's a collaboration and community site. Everyon...

Like these ancient Go (Weiqi) players...

"Fan was said to have played very quickly, while Shi very slowly. Fan would sometimes go on a picnic, sing songs or take a nap while Shi was thinking." senseis.xmp.net?FanXipingAnd...

13.12.2024 13:08 — 👍 0 🔁 0 💬 1 📌 0

"Is System 2 thinking even real, chat?"

Humans vary widely between being very intuitive or deliberative.

Human-level AGI could plausibly exist even purely as System 1, or with only very basic System 2 mixed in. Just git gud at superhuman intuition.

13.12.2024 13:07 — 👍 2 🔁 0 💬 1 📌 0

What can RBMK teach us about building AGI?

07.12.2024 20:11 — 👍 0 🔁 1 💬 0 📌 0

Can we build something lasting in a world where technology is easily reproducible?

How much time do we have before Artificial General Intelligence (AGI) is here, and how much time do we have afterwards? 2/n

09.12.2024 16:07 — 👍 1 🔁 1 💬 1 📌 0

Petr Baudis (pasky)

Latest posts by xpasky.bsky.social on Bluesky

@xpasky is following 20 prominent accounts