brendan chambers @societyoftrees

The first fantastic paper on scaling RL with LLMs just dropped. I strongly recommend taking a look and will be sharing more thoughts on the blog soon.

The Art of Scaling Reinforcement Learning Compute for LLMs
Khatri & Madaan et al.

buff.ly/olKwF3X

16.10.2025 13:59 — 👍 18 🔁 1 💬 0 📌 1

NVIDIA DGX Spark: great hardware, early days for the ecosystem NVIDIA sent me a preview unit of their new DGX Spark desktop “AI supercomputer”. I’ve never had hardware to review before! You can consider this my first ever sponsored post …

NVIDIA sent me preview hardware of their new DGX Spark 128GB ARM64 4TB "AI supercomputer" - it's a very neat little device, here are my notes so far
simonwillison.net/2025/Oct/14/...

14.10.2025 23:38 — 👍 130 🔁 14 💬 8 📌 9

Multi-Head Latent Attention
🔗 github.com/rasbt/LLMs-f...

12.10.2025 13:57 — 👍 43 🔁 6 💬 0 📌 1

⚠️ You have marked yourself as an untrusted node in the epistemic network

11.10.2025 13:55 — 👍 99 🔁 10 💬 6 📌 2

common misconception! flash attn is still all-to-all and isomorphic to vanilla self-attention

(optimized matrix ops have to be decomposed into tiles for memory hierarchy reasons, and ideally fused - multiple algorithmic steps on one tile. FA just does this best, esp the tricky-to-fuse softmax step)

11.10.2025 06:55 — 👍 4 🔁 0 💬 1 📌 0

You get small KL divergence from the base model without extra regularization here, since the search is local

and most surprisingly, this approach even handily beats (a grid-search tuned implementation of) GRPO, at least in this work + problem context

07.10.2025 17:02 — 👍 0 🔁 0 💬 0 📌 0

improving pretrained LLMs by searching over iid-noised params, using a reward score (aka fitness criterion) for weight-merging

07.10.2025 17:02 — 👍 0 🔁 0 💬 1 📌 0

We are excited to announce 4 outstanding papers 🏆🏆🏆🏆 --> 🧵

07.10.2025 13:22 — 👍 10 🔁 4 💬 1 📌 1

LLMs are currently this one big parameter block that stores all sort of facts. In our new preprint, we add context-specific memory parameters to the model, and pretrain the model along with a big bank of memories.

📑 arxiv.org/abs/2510.02375

[1/10]🧵

06.10.2025 16:06 — 👍 12 🔁 4 💬 1 📌 0

accepted papers, COLM 2025

colmweb.org/AcceptedPape...

06.10.2025 15:39 — 👍 1 🔁 0 💬 0 📌 0

Paper: arxiv.org/pdf/2509.20328

03.10.2025 13:05 — 👍 15 🔁 2 💬 1 📌 0

Spaced Scheduling for Large Language Model Training

Amine El hattami, Nicolas Chapados, Christopher Pal

Action editor: Colin Raffel

https://openreview.net/forum?id=p0KTYl2B9T

#scheduling #scheduled #training

02.10.2025 04:18 — 👍 2 🔁 1 💬 0 📌 0

Understanding Optimization in Deep Learning with Central Flows

really neat clear explainer for the new on “centralizing flows” to theoretically model learning dynamics

01.10.2025 12:20 — 👍 43 🔁 9 💬 1 📌 5

Scaling laws don’t just show up in test error — they leave fingerprints in the weight spectrum.
In the feature learning regime, we map this connection: phase diagrams of scaling exponents <-> spectral signatures of trained weights. The paper is: arxiv.org/abs/2509.24882

30.09.2025 11:02 — 👍 12 🔁 4 💬 0 📌 0

latent space opera

28.09.2025 16:26 — 👍 5 🔁 1 💬 0 📌 0

New technical post from Thinky on optimizers but this is the main catch: conditional learning rate per layers.

thinkingmachines.ai/blog/modular...

26.09.2025 18:00 — 👍 21 🔁 4 💬 3 📌 0

Isaac-01 multimodal model from Perceptron AI - pdf whitepaper

github.com/perceptron-a...

24.09.2025 17:16 — 👍 0 🔁 0 💬 0 📌 0

tldr: accounting for data transformations and context dependent embeddings takes some careful bookkeeping and clean abstractions

24.09.2025 17:13 — 👍 0 🔁 0 💬 0 📌 0

Downstream, leveraging coupled sequences with varying temporal structure and modality of origin is a significant open problem, and probably the best approach depends on task structure—which is why this serialization step needs to be really flexible

24.09.2025 17:13 — 👍 0 🔁 0 💬 1 📌 0

Super interesting way to frame complexity and self-prediction. I’m having trouble loading the pdf but most of the html seems to work

24.09.2025 17:08 — 👍 0 🔁 0 💬 1 📌 0

Measuring In-Context Computation Complexity via Hidden State Prediction Detecting when a neural sequence model does "interesting" computation is an open problem. The next token prediction loss is a poor indicator: Low loss can stem from trivially predictable sequences tha...

New (March) Schmidhuber I missed where they use a carefully engineered layer to track the information gained by each (prediction) token for solving problems that require computation. Hidden state is predictive of (a? not necessarily minimal?) description length.

09.09.2025 00:06 — 👍 9 🔁 3 💬 1 📌 2

TensorStream - Perceptron A layer of intelligence for the physical world. We are a research company building the future of Physical AGI.

the Perceptron folks are sharing design specs of their approach to serialize multimodal data as interleaved events

www.perceptron.inc/blog/tensors...

24.09.2025 16:27 — 👍 0 🔁 0 💬 1 📌 0

Google's Mixboard 💡🧑‍🎨 an experimental, AI-powered concepting board. Designed to help you explore, visualize, and refine your ideas and powered by our latest image generation model

Only available in U.S.

blog.google/technology/g...

23.09.2025 21:18 — 👍 16 🔁 3 💬 0 📌 0

We've hired some *fantastic* researchers but our startup is still looking for 2-3 more people with skills in ML/RL/LLMs. If you'd like to work on some transformative applied problems, hit me up. We'll be launching publicly soon too...

23.09.2025 17:31 — 👍 37 🔁 8 💬 0 📌 0

Three schemes for shared-private storage

Surprise! A second leaflet on private data in AT, this time exploring some schemes that might be used to implement shared-private data.

23.09.2025 02:22 — 👍 381 🔁 70 💬 25 📌 15

Qwen Qwen Chat offers comprehensive functionality spanning chatbot, image and video understanding, image generation, document processing, web search integration, tool utilization, and artifacts.

Qwen drops

Image Edit: apache 2 that feels on par with Nano Banana
qwen.ai/blog?id=7a90...

Qwen3-Omni: unified image, text, audio and video, like GPT-4o
huggingface.co/Qwen/Qwen3-O...

Qwen3-TTS: multi-timbre, multi-lingual TTS
qwen.ai/blog?id=b426...

22.09.2025 21:33 — 👍 38 🔁 7 💬 3 📌 0

I made an agent that unsubscribes you from spam. It controls a web browser to navigate an unsubscribe page. The agent runs JavaScript to perform actions on the page, and gets to see a screenshot after each action it takes.

Blog post: blog.aqnichol.com/2025/09/20/u...

20.09.2025 23:13 — 👍 10 🔁 4 💬 2 📌 0

Policy churn: maybe epsilon greedy doesn’t matter that much because the Q value argmax action changes constantly
arxiv.org/abs/2206.00730

20.09.2025 21:12 — 👍 21 🔁 4 💬 2 📌 1

Today my azure A100 quota came through as a little treat. It took about two weeks to get the quota approved…which isn’t bad, I’m not complaining. It’s so great to be installing nvidia drivers again

19.09.2025 18:40 — 👍 0 🔁 0 💬 0 📌 0

How to Train an LLM-RecSys Hybrid for Steerable Recs with Semantic IDs An LLM that can converse in English & item IDs, and make recommendations w/o retrieval or tools.

I've been nerdsniped by the idea of Semantic IDs.

Here's the result of my training runs:
• RQ-VAE to compress item embeddings into tokens
• SASRec to predict the next item (i.e., 4-tokens) exactly
• Qwen3-8B that can return recs and natural language!

eugeneyan.com/writing/sema...

17.09.2025 02:04 — 👍 25 🔁 6 💬 2 📌 1

brendan chambers

Latest posts by societyoftrees.bsky.social on Bluesky

@societyoftrees is following 20 prominent accounts