Dylan Foster 🐒's Avatar

Dylan Foster 🐒

@djfoster.bsky.social

Principal Researcher in AI/ML/RL Theory @ Microsoft Research NE/NYC. Previously @ MIT, Cornell. http://dylanfoster.net RL Theory Lecture Notes: https://arxiv.org/abs/2312.16730

2,424 Followers  |  831 Following  |  113 Posts  |  Joined: 21.11.2024  |  2.6005

Latest posts by djfoster.bsky.social on Bluesky

Post image

At #NeurIPS2025? Join us for a Social on Wednesday at 7 PM, featuring a fireside chat with Jon Kleinberg and mentoring tables.

Ft. mentors @djfoster.bsky.social @surbhigoel.bsky.social @aifi.bsky.social @gautamkamath.com and more!

26.11.2025 20:47 β€” πŸ‘ 13    πŸ” 4    πŸ’¬ 0    πŸ“Œ 2

(12/12) Paper link: www.arxiv.org/abs/2510.15020

Many interesting questions: What are the minimal conditions for RL success (maybe a more semantic notion of coverage)? Do other algos/interventions secretly have coverage-based interpretations?

25.10.2025 16:21 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image Post image

(12/12) Another algo example: We give some tournament-style methods for selecting checkpoints (motivated by coverage) that can beat cross-entropy based selection. See figure, where coverage-based selection finds model w/ best BoN performance.

25.10.2025 16:21 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

(10/12) Perhaps the coolest part: the coverage perspective seems to have non-trivial algo design consequences:

1. Vanilla SGD can have poor coverage; gradient normalization (a-la Adam) fixes this

2. Test-time training (SGD on tokens during generation) provably improves coverage

25.10.2025 16:21 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

(9/12) We show this through two lenses:
1. New generalization analysis for next-token prediction/maximum likelihood for general function classes, in the vein of stat. learning.
2. SGD in the one-pass regime for overparameterized autoregressive linear models

25.10.2025 16:21 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

(8/12) Our analysis is based on a phenomenon we call the coverage principle, where next-token prediction implicitly optimizes toward a model with good coverage.

Neat consequence: Coverage generalizes faster as we go further into tail (corresponding to more test-time compute)

25.10.2025 16:21 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

(7/12) Example (see figure):
- Cross-entropy decreases throughout training.
- Coverage improves to a point, but begins to drop as the model learns a spurious shortcut.
- BoN performance follows trend of coverage, not CE (increasing initially, dropping as shortcut is learned).

25.10.2025 16:21 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

(6/12) Our main result:

1) Next-token prediction (more generally, MLE) provably learns a model w/ good coverage, inheriting the training corpus’ coverage over tasks of interest.

2) Coverage generalizes *faster* than cross-entropy itself, and consequently can be more predictive.

25.10.2025 16:21 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

(5/12) Example: For tasks that are well-represented in pre-training distribution, low Cov_N is necessary and sufficient for Best-of-N to succeed.

Empirically, some notion of coverage also seems necessary for current RL methods (eg arxiv.org/abs/2504.13837)

25.10.2025 16:20 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

(4/12) Intuition: Cross-entropy (KL) tracks the mean of the log density ratio, while coverage profile tracks the *tail* or CDF via N.

N controls how far into tail we go, and roughly corresponds to test-time compute/RL effort. For downstream performance, the tail is what matters.

25.10.2025 16:20 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

(3/12) Our answer is a quantity we call the *coverage profile*, which captures the extent to which the pre-trained model "covers" the pre-training distribution:

Cov_N = P_{Ο€}[Ο€(y|x)/\hat{Ο€}(y|x) β‰₯ N],

(Ο€ is pre-training dist, \hat{Ο€} is pre-trained model, N is a param)

25.10.2025 16:20 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

(2/12) We wanted to take a first step toward developing some theoretical understanding of the relationship btw next-token prediction and downstream perf (particularly w/ test-time scaling).

Concretely, what metrics of the pre-trained model can we link to downstream success?

25.10.2025 16:20 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

(1/12) Many have observed starting that from a better next‑token predictor (low cross-entropy) does not always give better downstream post-training/test-time perf. In some cases, CE can even be anti‑correlated with BoN success (see fig or arxiv.org/abs/2502.07154). Why is this?

25.10.2025 16:20 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Led by my amazing intern Fan Chen, with awesome team Audrey Huang (@ahahaudrey.bsky.social), Noah Golowich, Sadhika Malladi (@sadhika.bsky.social), Adam Block, Jordan Ash, and Akshay Krishnamurthy.

Paper: www.arxiv.org/abs/2510.15020

Thread below.

25.10.2025 16:20 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

The coverage principle: How pre-training enables post-training

New preprint where we look at the mechanisms through which next-token prediction produces models that succeed at downstream tasks.

The answer involves a metric we call the "coverage profile", not cross-entropy.

25.10.2025 16:20 β€” πŸ‘ 18    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Post image

(10/12) Perhaps the coolest part: the coverage perspective seems to have non-trivial algo design consequences:

1. Vanilla SGD can have poor coverage; gradient normalization (a-la Adam) fixes this

2. Test-time training (SGD on tokens during generation) provably improves coverage

25.10.2025 16:17 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Sorry about that! There was an issue where the application briefly closed early. Today is the deadline and it should be open again now.

22.10.2025 19:39 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Stanford University, Computer Science/Theory Lab/Stanford University Job #AJO30865, Postdoc in Theoretical Computer Science at Stanford, Computer Science/Theory Lab/Stanford University, Stanford University, Stanford, California, US

The new call for Motwani postdocs application is now open!
academicjobsonline.org/ajo/jobs/30865

BTW-

Not quite ready for a postdoc? We updated the TCS Masters programs spreadsheet:
www.cs.princeton.edu/~smattw/mast...

Any career stage and in the (SF) Bay Area?
Save the date for TOCA-SV on 11/7!

13.10.2025 20:42 β€” πŸ‘ 14    πŸ” 8    πŸ’¬ 0    πŸ“Œ 0

Lots of interesting directions here! We think there is a lot more to do building on the connection to the discrete sampling/TCS literature and algos from this space, as well as moving beyond autoregressive generation.

12.10.2025 16:25 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

Empirically, we have only tried this w/ small-ish scale so far, but find consistently that VGB outperforms textbook algos on either (1) accuracy; or (2) diversity when compute-normalized. Ex: for Dyck language, VGB escapes the accuracy-diversity frontier for baselines algos.

12.10.2025 16:25 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Main guarantee:
- As long as you have exact/verifiable outcome rewards, always converges to optimal distribution.
- Runtime depends on process verifier quality, gracefully degrading as quality gets worse.

12.10.2025 16:25 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

VGB generalizes the Sinclair–Jerrum '89 random walk (people.eecs.berkeley.edu/~sinclair/ap...)
from TCS (used to prove equivalence of apx. counting & sampling for self-reducible problems), linking test-time RL/alignment with discrete sampling theory. We are super excited about this connection.

12.10.2025 16:25 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image Post image

We give a new algo, Value-Guided Backtracking (VGB), where the idea is to view autoregressive generation as a random walk on the tree of partial outputs, and add a *stochastic backtracking* stepβ€”occasionally erasing tokens in a principled wayβ€”to counter error amplification.

12.10.2025 16:25 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Test-time guidance with learned process verifiers has untapped potential to enhance language model reasoning, but one of the issues with getting this to actually work is that small verifier mistakes are amplified by textbook algos (e.g., block-wise BoN), w/ errors compounding as length increases.

12.10.2025 16:25 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Andrej Risteski (@andrejristeski.bsky.social) Machine learning researcher. Professor in ML department at CMU.

With amazing team: Dhruv Rohatgi, Abhishek Shetty, Donya Saless, Yuchen Li, Ankur Moitra, and Andrej Risteski (andrejristeski.bsky.social).

Short thread below.

12.10.2025 16:25 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image Post image Post image

Taming Imperfect Process Verifiers: A Sampling Perspective on Backtracking.

A totally new framework based on ~backtracking~ for using process verifiers to guide inference, w/ connections to approximate counting/sampling in theoretical CS.

Paper: www.arxiv.org/abs/2510.03149

12.10.2025 16:25 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
Microsoft Research Lab - New York City - Microsoft Research Apply for a research position at Microsoft Research New York & collaborate with academia to advance economics research, prediction markets & ML.

MSR NYC is hiring spring and summer interns in AI/ML/RL!

Apply here: jobs.careers.microsoft.com/global/en/jo...

02.10.2025 20:57 β€” πŸ‘ 20    πŸ” 7    πŸ’¬ 0    πŸ“Œ 0
Preview
Microsoft Research Lab - New York City - Microsoft Research Apply for a research position at Microsoft Research New York & collaborate with academia to advance economics research, prediction markets & ML.

🚨Microsoft Research NYC is hiring🚨

We're hiring postdocs and senior researchers in AI/ML broadly, and in specific areas like test-time scaling and science of DL. Postdoc applications due Oct 22, 2025. Senior researcher applications considered on a rolling basis.

Links to apply: aka.ms/msrnyc-jobs

18.09.2025 14:37 β€” πŸ‘ 18    πŸ” 7    πŸ’¬ 0    πŸ“Œ 1
Search Jobs | Microsoft Careers

For details, see the following links:

Empirical ML/AI: jobs.careers.microsoft.com/global/en/jo...

Theoretical ML/AI: jobs.careers.microsoft.com/global/en/jo...

12.09.2025 14:57 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Preview
Microsoft Research Lab - New York City - Microsoft Research Apply for a research position at Microsoft Research New York & collaborate with academia to advance economics research, prediction markets & ML.

Microsoft Research New York City (www.microsoft.com/en-us/resear...) is seeking applicants for multiple Postdoctoral Researcher positions in ML/AI!

These are positions for up to 2 years, starting in July 2026.

Application deadline: October 22, 2025

12.09.2025 14:57 β€” πŸ‘ 8    πŸ” 4    πŸ’¬ 1    πŸ“Œ 0

@djfoster is following 20 prominent accounts