Sham Kakade's Avatar

Sham Kakade

@shamkakade.bsky.social

Harvard Professor. ML and AI. Co-director of the Kempner Institute. https://shamulent.github.io

912 Followers  |  89 Following  |  5 Posts  |  Joined: 21.11.2024
Posts Following

Posts by Sham Kakade (@shamkakade.bsky.social)

Preview
Alignment reduces conceptual diversity of language models - Kempner Institute As large language models (LLMs) have become more sophisticated, there’s been growing interest in using LLM-generated responses in place of human data for tasks such as polling, user studies, and […]

NEW blog post: Do modern #LLMs capture the conceptual diversity of human populations? #KempnerInstitute researchers find #alignment reduces conceptual diversity of language models. bit.ly/4hNjtiI

10.02.2025 15:19 β€” πŸ‘ 12    πŸ” 3    πŸ’¬ 0    πŸ“Œ 0
Post image

NEW in the #KempnerInstitute blog: learn about ProCyon, a multimodal foundation model to model, generate & predict protein phenotypes. Read it here: bit.ly/4fA8xUk

19.12.2024 19:22 β€” πŸ‘ 6    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
https://bit.ly/4iohnqE

Calling college grads interested in intelligence research: the application for the #KempnerInstitute's post-bac program w/ the Harvard Kenneth C. Griffin Graduate School of Arts and Sciences Office for Equity, Diversity, Inclusion & Belonging is now open! Apply by Feb. 1, 2025.

t.co/jdJrzRegL0

09.12.2024 19:43 β€” πŸ‘ 14    πŸ” 4    πŸ’¬ 0    πŸ“Œ 0
Preview
Loss-to-Loss Prediction - Kempner Institute Scaling laws – which reliably predict the performance of large language models (LLMs) as a function of their size and the amount of data they have been trained on – […]

NEW in the #KempnerInstitute blog: A method to predict how #LLMs scale w/ compute across different datasets. Read it here:

09.12.2024 20:44 β€” πŸ‘ 7    πŸ” 2    πŸ’¬ 0    πŸ“Œ 0
Post image

LLM self-improvement has critical implications in synthetic data, post-training and test-time inference. To understand LLMs' true capability of self-improvement, we perform large-scale experiments with multiple families of LLMs, tasks and mechanisms. Here is what we found: (1/9)

06.12.2024 18:02 β€” πŸ‘ 12    πŸ” 4    πŸ’¬ 1    πŸ“Œ 1
Post image

NEW: we have an exciting opportunity for a tenure-track professor at the #KempnerInstitute and the John A. Paulson School of Engineering and Applied Sciences (SEAS). Read the full description & apply today: academicpositions.harvard.edu/postings/14362
#ML #AI

03.12.2024 01:24 β€” πŸ‘ 20    πŸ” 19    πŸ’¬ 0    πŸ“Œ 1

(5/n) 🀝 Shoutout to some great collaborators:
@hanlin_zhang, @depen_morwani, @vyasnikhil96, @uuujingfeng, @difanzou, @udayaghai
#AI #ML #ScalingLaws

22.11.2024 20:19 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
How Does Critical Batch Size Scale in Pre-training? Training large-scale models under given resources requires careful design of parallelism strategies. In particular, the efficiency notion of critical batch size (CBS), concerning the compromise betwee...

(4/n) 🧠 Want theory? We provide rigorous justifications, provide critical hyperparameters, and characterize lr decay to the overtraining regime.
Check out the details here:
πŸ“„ arxiv.org/abs/2410.21676
πŸ“ Blog: tinyurl.com/ysufbwsr

22.11.2024 20:19 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

(3/n) πŸ“Š From our controlled experiments on language models:
πŸ“ˆCBS increases as dataset size grows
🀏CBS remains weakly dependent on model size
Data size, not model size, drives parallel efficiency for large-scale pre-training.

22.11.2024 20:19 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

(2/n) πŸ€” How does CBS scale with model size and data size in pre-training? We find that CBS scales with data size and is largely invariant to model size. Prior beliefs that CBS scales with model size may have stemmed from Chinchilla’s coupled N-D scaling.

22.11.2024 20:19 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

(1/n) πŸ’‘How can we speed up the serial runtime of long pre-training runs? Enter Critical Batch Size (CBS): the tipping point where the gains of data parallelism balance with diminishing efficiency. Doubling batch size halves the optimization stepsβ€”until we hit CBS, beyond which returns diminish.

22.11.2024 20:19 β€” πŸ‘ 16    πŸ” 4    πŸ’¬ 2    πŸ“Œ 0
Post image

How does test loss change as we change the training data? And how does this interact with scaling laws?

We propose a methodology to approach these questions by showing that we can predict the performance across datasets and losses with simple shifted power law fits.

21.11.2024 15:11 β€” πŸ‘ 19    πŸ” 7    πŸ’¬ 1    πŸ“Œ 2