Training our most capable Gemini models relies heavily on our JAX software stack+Google's TPU hardware platforms.
If you want to learn more, see this awesome book "How to Scale Your Model":
jax-ml.github.io/scaling-book/
Put together by several of my Google DeepMind colleagues listed below π.
04.02.2025 19:51 β π 76 π 12 π¬ 2 π 1
The book was co-written with @sholtodouglas.bsky.social, @froystig.bsky.social, @levskaya.bsky.social, @reinerpope.bsky.social, Albert Webson, Charlie Chen, Vinay Ramasesh, and Federico Lebron 10/n
04.02.2025 18:54 β π 3 π 0 π¬ 1 π 0
LLM systems programming is super fun! It's hard to do good ML research without it these days, and you don't need much compute to work on it. I hope this book will make it easier for more people (esp. academics) to work on this stuff 9/n
04.02.2025 18:54 β π 3 π 0 π¬ 1 π 0
The rest of the book is a set of practical guides: how to write and profile parallel JAX code, and how to apply the previous two sections to real models like LLaMA-3. We also have worked problems at the end of each section if you like homework: jax-ml.github.io/scaling-book... 8/n
04.02.2025 18:54 β π 2 π 0 π¬ 1 π 0
Now that weβve talked about training, we need to talk about serving. How expensive should a model be to serve? What kind of latency can we expect? What are prefill and generation? How do we build an efficient inference service? We talk about this here: jax-ml.github.io/scaling-book... 7/n
04.02.2025 18:54 β π 2 π 0 π¬ 1 π 0
Now for the good stuff! You may have heard of data or tensor parallelism, FSDP or pipelining. But why choose one over the other? Short answer: each adds communication, and the one with the lowest cost depends on the model. Part 5 dives into this: jax-ml.github.io/scaling-book... 6/n
04.02.2025 18:54 β π 3 π 0 π¬ 1 π 0
5 years ago, there were many ML architectures, but today, there is (mostly) only one. _You should know the Transformer inside and out!_ How many FLOPs or params in LLaMA-3? How expensive is attention vs. a feed-forward block? You'll know after reading jax-ml.github.io/scaling-book... 5/n
04.02.2025 18:54 β π 3 π 0 π¬ 1 π 0
Scaling an LLM involves distributing β a.k.a. "sharding" β its weights across multiple TPUs. To run it, we have to add cross-chip communication. Part 3 describes the TPU's communication primitives, and simple rules for multiplying sharded matrices: jax-ml.github.io/scaling-book... 4/n
04.02.2025 18:54 β π 2 π 0 π¬ 1 π 0
A big chunk of this book is dedicated to understanding the hardware that provides those system resources. We emphasize TPUs in this book, but the principles and math can be adapted to GPUs too. Part 2 explains the TPU in detail: jax-ml.github.io/scaling-book... 3/n
04.02.2025 18:54 β π 4 π 0 π¬ 1 π 0
How To Scale Your Model
Training LLMs often feels like alchemy, but understanding and optimizing the performance of your models doesn't have to. This book aims to demystify the science of scaling language models on TPUs: how...
The secret is to think in terms of basic system resources β compute, memory, and bandwidth β and calculate which one limits our performance. From this we can estimate the cost, runtime, and optimal parallelism strategy for any given LLM: jax-ml.github.io/scaling-book/ 2/n
04.02.2025 18:54 β π 3 π 0 π¬ 1 π 0
Making LLMs run efficiently can feel scary, but scaling isnβt magic, itβs math! We wanted to demystify the βsystems viewβ of LLMs and wrote a little textbook called βHow To Scale Your Modelβ which weβre releasing today. 1/n
04.02.2025 18:54 β π 95 π 28 π¬ 3 π 8
Excited to be here! Hopefully the skies are brighter on this side of the fence. Will be posting research stuff here, mostly
24.11.2024 16:54 β π 1 π 0 π¬ 0 π 0
Husband, dad, veteran, writer, and proud Midwesterner. 19th US Secretary of Transportation and former Mayor of South Bend.
research scientist at google deepmind.
co-author of JAX (https://github.com/jax-ml/jax)
https://cs.stanford.edu/~rfrostig
policy for v smart things @openai. Past: PhD @HarvardSEAS/@SchmidtFutures/@MIT_CSAIL. Posts my own; on my head be it
Cofounder & CTO @ Abridge, Raj Reddy Associate Prof of ML @ CMU, occasional writer, relapsing π·, creator of d2l.ai & approximatelycorrect.com
Google DeepMind. Generative models for video: Veo, Phenaki.
Former US Congressman, Proud RINO, husband, and military man. Fighting the MAGA brain worms daily!
Adamkinzinger.komi.io
Adamkinzinger.substack.com
PhD student at MIT studying program synthesis, probabilistic programming, and cognitive science. she/her
I do SciML + open source!
π§ͺ ML+proteins @ http://Cradle.bio
π Neural ODEs: http://arxiv.org/abs/2202.02435
π€ JAX ecosystem: http://github.com/patrick-kidger
π§βπ» Prev. Google, Oxford
π ZΓΌrich, Switzerland
Blog: https://sander.ai/
π¦: https://x.com/sedielem
Research Scientist at Google DeepMind (WaveNet, Imagen 3, Veo, ...). I tweet about deep learning (research + software), music, generative models (personal account).
Recently a principal scientist at Google DeepMind. Joining Anthropic. Most (in)famous for inventing diffusion models. AI + physics + neuroscience + dynamical systems.
AI/generative artist. Writes her own code. Absolute power is a door into dreaming.
works @ runway in π½
www.ethanrosenthal.com
Internet pedestrian. Machine learning mercenary. α(γ)α (he/him/his)
https://laurent-dinh.github.io/
Machine Learning Researcher
https://alexalemi.com
https://blog.alexalemi.com
Research scientist at FAIR NY β€οΈ LLMs + Information Theory. Previously, PhD at UoAmsterdam, intern at DeepMind + MSRC.
http://blog.christianperone.com, Machine Learning, Computer Science and Math. Staff ML Research Engineer working with imitation learning and planning for Autonomous Vehicles. London/UK.
Mostly: ML for music production workflows.
Professor of Physics & Senior Data Fellow at Belmont University, Nashville TN
Head of Research for Hyperstate Music AI.
Teacher of audio engineers, Opinions my own.
Explainer blog: https://drscotthawley.github.io
senior research scientist at Google | author of DreamBooth
https://natanielruiz.github.io/