Charlie Snell's Avatar

Charlie Snell

@seasnell.bsky.social

PhD @berkeley_ai; prev SR @GoogleDeepMind. I stare at my computer a lot and make things

380 Followers  |  319 Following  |  8 Posts  |  Joined: 02.05.2023  |  1.6367

Latest posts by seasnell.bsky.social on Bluesky

Transcript of Hard Fork ep 111: Yeah. And I could talk for an hour about transformers and why they are so important.
But I think it's important to say that they were inspired by the alien language in the film Arrival, which had just recently come out.
And a group of researchers at Google, one researcher in particular, who was part of that original team, was inspired by watching Arrival and seeing that the aliens in the movie had this language which represented entire sentences with a single symbol. And they thought, hey, what if we did that inside of a neural network? So rather than processing all of the inputs that you would give to one of these systems one word at a time, you could have this thing called an attention mechanism, which paid attention to all of it simultaneously.
That would allow you to process much more information much faster. And that insight sparked the creation of the transformer, which led to all the stuff we see in Al today.

Transcript of Hard Fork ep 111: Yeah. And I could talk for an hour about transformers and why they are so important. But I think it's important to say that they were inspired by the alien language in the film Arrival, which had just recently come out. And a group of researchers at Google, one researcher in particular, who was part of that original team, was inspired by watching Arrival and seeing that the aliens in the movie had this language which represented entire sentences with a single symbol. And they thought, hey, what if we did that inside of a neural network? So rather than processing all of the inputs that you would give to one of these systems one word at a time, you could have this thing called an attention mechanism, which paid attention to all of it simultaneously. That would allow you to process much more information much faster. And that insight sparked the creation of the transformer, which led to all the stuff we see in Al today.

Did you know that attention across the whole input span was inspired by the time-negating alien language in Arrival? Crazy anecdote from the latest Hard Fork podcast (by @kevinroose.com and @caseynewton.bsky.social). HT nwbrownboi on Threads for the lead.

01.12.2024 14:50 β€” πŸ‘ 247    πŸ” 53    πŸ’¬ 19    πŸ“Œ 17
https://huggingface.co/openlm-research

All model checkpoints we used for this research are also available here: t.co/IlSmJ8Na1i

26.11.2024 22:37 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Predicting Emergent Capabilities by Finetuning A fundamental open challenge in modern LLM scaling is the lack of understanding around emergent capabilities. In particular, language model pretraining loss is known to be highly predictable as a func...

This was a fun project with Eric Wallace, Dan Klein, and Sergey Levine
.
An early version of this work also appeared in COLM 2024.

Paper link: arxiv.org/abs/2411.16035

26.11.2024 22:37 β€” πŸ‘ 6    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image Post image

Finally, we present a case study of two real world uses for emergence prediction:

1) cheaply assessing pretraining data quality (left).

2) predicting more complex capabilities, closer to those of future frontier models, using the difficult APPS coding benchmark (right).

26.11.2024 22:37 β€” πŸ‘ 5    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

We validate our emergence law using four standard NLP benchmarks where large-scale open-source LLMs already demonstrate emergence, so we can easily check our predictions.

We find that our emergence law can accurately predict the point of emergence up to 4x the FLOPs in advance.

26.11.2024 22:37 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

To operationalize this insight, we finetune LLMs on varying amounts of data and fit a parametric function (i.e., β€œemergence law”) which models how the point of emergence shifts with the amount of data. We can then extrapolate a prediction for emergence in the few-shot setting.

26.11.2024 22:37 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

We then discover a simple insight for this problem:

finetuning LLMs on a given task can shift the point in scaling at which emergence occurs towards less capable LLMs, and the magnitude of this shift is modulated by the amount of finetuning data.

26.11.2024 22:37 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

We first pose the task of emergence prediction:

given access to LLMs that have random few-shot accuracy on a task, can we predict the point in scaling (e.g., pretraining loss) at which performance will jump up beyond random-chance?

26.11.2024 22:37 β€” πŸ‘ 5    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Can we predict emergent capabilities in GPT-N+1🌌 using only GPT-N model checkpoints, which have random performance on the task?

We propose a method for doing exactly this in our paper β€œPredicting Emergent Capabilities by Finetuningβ€πŸ§΅

26.11.2024 22:37 β€” πŸ‘ 45    πŸ” 6    πŸ’¬ 3    πŸ“Œ 1

@seasnell is following 18 prominent accounts