Thông Nguyễn @machine1235

Using pdb is the most important skill a Python developer can learn.

21.04.2025 15:51 — 👍 0 🔁 0 💬 0 📌 0

GitHub - policy-gradient/GRPO-Zero: Implementing DeepSeek R1's GRPO algorithm from scratch Implementing DeepSeek R1's GRPO algorithm from scratch - policy-gradient/GRPO-Zero

Oh hey, I have a new weekend project: Implementing GRPO from scratch with (almost) zero dependencies.
github.com/policy-gradi...

20.04.2025 15:16 — 👍 0 🔁 0 💬 0 📌 0

We are... their moon. 😅

20.04.2025 15:11 — 👍 1 🔁 0 💬 0 📌 0

These two reasons are just my speculation. I'd be happy if anyone could prove me wrong, or right.

20.04.2025 15:07 — 👍 0 🔁 0 💬 0 📌 0

So when the model eventually generates a correct answer and receives a high reward, its internal hidden states already contain information about which past tokens were important in producing the correct final answer. Therefore, solving the credit assignment problem.

20.04.2025 15:07 — 👍 0 🔁 0 💬 1 📌 0

Another reason could be the attention mechanism, which seems to help significantly with the credit assignment problem. During pretraining, LLMs learn to predict the next token, and the attention mechanism is trained to use past tokens to improve the prediction of the current token.

20.04.2025 15:07 — 👍 0 🔁 0 💬 1 📌 0

One possible reason for this is that there's no real interaction with an external environment. Every state/action is internal. In other words, the "environment" is essentially the model itself, apart from the final reward. So in a sense, we're already doing model-based RL.

20.04.2025 15:07 — 👍 0 🔁 0 💬 1 📌 0

Why is this the case? Why can a model so easily learn to generate tens of thousands of tokens of CoT, despite receiving a sparse reward only at the end? And why can it succeed even with the most basic policy gradient algorithm?

20.04.2025 15:07 — 👍 0 🔁 0 💬 1 📌 0

I've been thinking a lot about training LLMs with reinforcement learning lately. One thing that surprises me is how easy it is to train LLMs to generate chain-of-thought reasoning using RL, even with extremely simple algorithms like GRPO, which is essentially just the vanilla REINFORCE algorithm.

20.04.2025 15:07 — 👍 0 🔁 0 💬 1 📌 0

every time i look at it, it still amazes me how ridiculously simple flow matching training is...

24.12.2024 17:36 — 👍 0 🔁 0 💬 0 📌 0

This is somewhat similar to LLMs using chain-of-thought (CoT) reasoning versus generating a solution in one go. In CoT reasoning, the model breaks down a problem into smaller, sequential steps.

On an unrelated note, can we train latent diffusion models for text? 🤔 4/4

29.11.2024 05:34 — 👍 0 🔁 0 💬 0 📌 0

Diffusion models, on the other hand, generate an image through multiple supervised steps. We break down the generation process into smaller tasks and guide the model through them.

This leads to much more stable training and better overall results. 3/4

29.11.2024 05:34 — 👍 0 🔁 0 💬 1 📌 0

GANs generate an image in a single step, from random noise to the final output directly. The model has to learn the entire process in one go, using a limited number of layers in the network. 2/4

29.11.2024 05:34 — 👍 0 🔁 0 💬 1 📌 0

I was thinking about the following question today: What makes diffusion models better than GANs in generative modeling? 🤔 1/4

29.11.2024 05:34 — 👍 0 🔁 0 💬 1 📌 0

Adding a data point that I am quite familiar with: currently, almost all SoTA audio generation models use GAN.

28.11.2024 04:25 — 👍 4 🔁 0 💬 0 📌 0

So, what's next? If pre-training is slowing down, maybe post-training is about to become more important than ever!

23.11.2024 17:42 — 👍 0 🔁 0 💬 0 📌 0

But now, we've exhausted most of the available internet text data. We can't keep scaling in the same way. We can still train models longer by repeating the data, but gains diminish after a few epochs.

23.11.2024 17:42 — 👍 0 🔁 0 💬 1 📌 0

The wall isn't about LLMs like Gemini, Claude, or ChatGPT stagnating. These models will continue to improve, that's not in question.

The wall is about diminishing returns in scaling LLM pre-training. From GPT-1 to GPT-4, we've benefited from scaling model size and dataset size together.

23.11.2024 17:42 — 👍 1 🔁 0 💬 1 📌 0

There's some confusion about the "wall" that many AI people are talking about. First, it's not an AI winter, far from it. AI progress is rapid: we're seeing breakthroughs in music generation, image generation, text-to-speech, video generation, protein folding, and more. We're in a golden age of AI.

23.11.2024 17:42 — 👍 0 🔁 0 💬 1 📌 0

Full-on reinforcement learning, based on the interaction of the agent with the environment!

LLM is the agent, and user is the environment. With OpenAI having hundred of millions of monthly active users, I think they can do RL with real-world interactions to keep improving their model.

22.11.2024 09:49 — 👍 2 🔁 0 💬 0 📌 0

If scaling LLM pre-training is hitting a wall of diminishing returns, as we have already trained the model on all the data available on the internet, what will help us move forward? 🤔

I've been thinking about this question for a while, and I believe the way forward is scaling LLM post-training.

22.11.2024 09:49 — 👍 1 🔁 0 💬 1 📌 0

A starter pack for people who love training and understanding big functions:
go.bsky.app/6NeJ1FW

22.11.2024 07:13 — 👍 0 🔁 0 💬 0 📌 0

But I have to say, I agree with all of his points. Long context window is the future, a future where models can have a full context for a given task, remember all the history, every interaction with the user. Steven explains it far better than I could, and it's well worth a read.

21.11.2024 16:21 — 👍 0 🔁 0 💬 0 📌 0

Anyway, coming back to the essay: yes, Steven might be a bit biased, given that he works for Google, which provides the Gemini 1.5 models with the largest context window in the market right now (2M tokens). Naturally, he's inclined to advocate for using models with long context window.

21.11.2024 16:21 — 👍 0 🔁 0 💬 1 📌 0

I found this very interesting, using LLMs (AIs) to assist people in their tasks by copying expert's worklow. In this case, NotebookLM is a helpful research assistant for writers. People should adopt this approach for their AI products!

21.11.2024 16:21 — 👍 0 🔁 0 💬 1 📌 0

I found something fascinating about his involvement in NotebookLM. As Raiza Martin (PM for NotebookLM) recently said, Steven Johnson himself is the product. The team essentially followed his workflow, how he does research and compiles information, and used it as a model for their product.

21.11.2024 16:21 — 👍 0 🔁 0 💬 1 📌 0

You Exist In The Long Context Thoughts on the quiet revolution of long-context AI models, from NotebookLM's Editorial Director Steven Johnson.

Check out this essay by the writer Steven Johnson for a really insightful take on LLMs with long context window: thelongcontext.com.

Steven is one of the people behind NotebookLM, an app created by Google that helps you organize information and conduct research on particular topics.

21.11.2024 16:21 — 👍 1 🔁 0 💬 1 📌 0

Thông Nguyễn

Latest posts by machine1235.bsky.social on Bluesky

@machine1235 is following 20 prominent accounts