Alisa Liu's Avatar

Alisa Liu

@alisawuffles.bsky.social

phd student at @uwcse

362 Followers  |  197 Following  |  12 Posts  |  Joined: 19.11.2024
Posts Following

Posts by Alisa Liu (@alisawuffles.bsky.social)

Post image

πŸ“’We’re taking your questions now on Reddit for tomorrow’s AMA!

Ask us anything about OLMo, our family of fully-open language models. Our researchers will be on hand to answer them Thursday, May 8 at 8am PST.

07.05.2025 16:46 β€” πŸ‘ 3    πŸ” 2    πŸ’¬ 1    πŸ“Œ 0

Right! "kick the bucket" is too infrequent, but there are more common idiomatic expressions like "in the long run" or "on the other hand." In general I would say non-idiomatic MWEs are more common, like uses of prepositions ("depend on") which require memorization.

21.03.2025 20:59 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Humans store thousands of multi-word expressions like "of course" in their mental lexicon, but current tokenizers don't support multi-word tokens.

Enter SuperBPE, a tokenizer that lifts this restriction and brings substantial gains in efficiency and performance! πŸš€

Details πŸ‘‡

21.03.2025 17:40 β€” πŸ‘ 5    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0

Hell yeah superwords. (I wanna call em supertokens, but I didn't develop them.)

21.03.2025 18:17 β€” πŸ‘ 2    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0

Tokenizers govern the allocation of computation. It's a waste to spend a whole token of compute predicting the "way" in "By the way". SuperBPE redirects that compute to predict more difficult tokens, leading to wins on downstream tasks!

21.03.2025 18:31 β€” πŸ‘ 4    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0

nothing beats writing papers together with co-1st @jon.jon.ke β€” the mention didn't work the first time!

21.03.2025 18:26 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

a small change to building your BPE tokenizer gets your pretrained LM 8 MMLU points (for example) and 27% inference-time efficiency boost ...

21.03.2025 16:52 β€” πŸ‘ 11    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Screenshot of tokenizer demo from our blog post.

Screenshot of tokenizer demo from our blog post.

Play around with our tokenizers here! superbpe.github.io πŸš€
Paper: arxiv.org/abs/2503.13423
HF models & tokenizers: tinyurl.com/superbpe

This work would not have been possible w/o co-1st 🌟@jon.jon.ke🌟, @valentinhofmann.bsky.social @sewoong79.bsky.social @nlpnoah.bsky.social @yejinchoinka.bsky.social

21.03.2025 16:48 β€” πŸ‘ 6    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Example usage of SuperBPE model & tokenizer in HuggingFace transformers.

Example usage of SuperBPE model & tokenizer in HuggingFace transformers.

SuperBPEπŸš€ is a seamless replacement for BPE in modern LM development pipelines, requiring no changes to the model architecture or training framework. You can use it in HuggingFace right now!

21.03.2025 16:48 β€” πŸ‘ 6    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0
Histogram of per-token losses for BPE and SuperBPE models. The SuperBPE model makes fewer predictions with very high or very low loss.

Histogram of per-token losses for BPE and SuperBPE models. The SuperBPE model makes fewer predictions with very high or very low loss.

Why does SuperBPEπŸš€ work? We find that loss is distributed more uniformly over tokens in SuperBPE models. They are less overfit to high-frequency, easy-to-predict tokens (e.g. β€œway” after β€œBy the”), and at the same time master a much broader set of language phenomena.

21.03.2025 16:48 β€” πŸ‘ 6    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Table showing the per-task performance of 8B BPE and SuperBPE models.

Table showing the per-task performance of 8B BPE and SuperBPE models.

Then we pretrain 8B models from scratch with BPE and SuperBPEπŸš€, fixing everything about the training setup except the tokenizer. We see +4% on avgπŸ“ˆ across 30 downstream tasks, and win on 25/30 of individual tasks, while also being 27% more efficient at inference time.

21.03.2025 16:48 β€” πŸ‘ 5    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Figure showing how encoding efficiency scales with vocabulary size. SuperBPE encodes text much more efficiently than BPE!

Figure showing how encoding efficiency scales with vocabulary size. SuperBPE encodes text much more efficiently than BPE!

What can we gain from less restrictive tokenization? To find out, we developed SuperBPEπŸš€, which learns subword *and* superword tokens. SuperBPE dramatically improves encoding efficiency over BPE β€” at a fixed vocab size of 200k, SuperBPE reduces sequence length by 33% on average!

21.03.2025 16:48 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

E.g. β€œmath teacher” = β€œMathelehrer” in German. At the extreme, Chinese *doesn’t use whitespace at all*, so its tokens can span many words β€” yet this has seemingly not hindered LMs like @deepseek_ai from learning it!

21.03.2025 16:48 β€” πŸ‘ 6    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

This started with a curiosityπŸ’‘: why do all LLMs limit tokens to *parts* of whitespace-delimited words? After all, many word sequences (e.g. β€œby the way”) function as single units. Different languages can also express the same meaning in one or several words.

21.03.2025 16:48 β€” πŸ‘ 7    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Segmentation of the sentence "By the way, I am a fan of the Milky Way" under BPE and SuperBPE.

Segmentation of the sentence "By the way, I am a fan of the Milky Way" under BPE and SuperBPE.

We created SuperBPEπŸš€, a *superword* tokenizer that includes tokens spanning multiple words.

When pretraining at 8B scale, SuperBPE models consistently outperform the BPE baseline on 30 downstream tasks (+8% MMLU), while also being 27% more efficient at inference time.🧡

21.03.2025 16:48 β€” πŸ‘ 83    πŸ” 16    πŸ’¬ 3    πŸ“Œ 5
We observe that the merge list of LLAMA, LLAMA 3, GEMMA, and MISTRAL contain clusters of
redundant merge rules. For instance, in the LLAMA 3 merge list, we see the sequence of merges
_ the, _t he, and _th e, as well as _ and, _a nd, and _an d. Because the merge path for every
token is unique, it is impossible for more than one of these merges to ever be used, and we empirically
verify this by applying the tokenizer to a large amount of text.
We find that this is an artifact of the conversion from sentencepiece to Huggingface tokenizers
format. To construct the merge list, the conversion algorithm naively combines every pair of tokens
in the vocabulary, and then sorts them by token ID, which represents order of creation. While this
is functionally correct, because the redundant merges are not products of the BPE algorithm (i.e.,
they do not actually represent the most-likely next-merge), we need to remove them to apply our
algorithm. To do this, we do some simple pre-processing: for every cluster of redundant merges, we
record the path of merges that achieves each merge; the earliest path is the one that would be taken,
so we keep that merge and remove the rest.
As an aside, this means that a tokenizer’s merge list can be completely reconstructed from its
vocabulary list ordered by token creation. Given only the resulting token at each time step, we can
derive the corresponding merge.

We observe that the merge list of LLAMA, LLAMA 3, GEMMA, and MISTRAL contain clusters of redundant merge rules. For instance, in the LLAMA 3 merge list, we see the sequence of merges _ the, _t he, and _th e, as well as _ and, _a nd, and _an d. Because the merge path for every token is unique, it is impossible for more than one of these merges to ever be used, and we empirically verify this by applying the tokenizer to a large amount of text. We find that this is an artifact of the conversion from sentencepiece to Huggingface tokenizers format. To construct the merge list, the conversion algorithm naively combines every pair of tokens in the vocabulary, and then sorts them by token ID, which represents order of creation. While this is functionally correct, because the redundant merges are not products of the BPE algorithm (i.e., they do not actually represent the most-likely next-merge), we need to remove them to apply our algorithm. To do this, we do some simple pre-processing: for every cluster of redundant merges, we record the path of merges that achieves each merge; the earliest path is the one that would be taken, so we keep that merge and remove the rest. As an aside, this means that a tokenizer’s merge list can be completely reconstructed from its vocabulary list ordered by token creation. Given only the resulting token at each time step, we can derive the corresponding merge.

This is also addressed in the appendix of @alisawuffles.bsky.social and colleagues' paper on BPE mixture inference. I think it might have been discovered by @soldaini.net if I'm not mistaken.

arxiv.org/abs/2407.16607

28.02.2025 10:47 β€” πŸ‘ 3    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
poster for paper

poster for paper

excited to be at #NeurIPS2024! I'll be presenting our data mixture inference attack πŸ—“οΈ Thu 4:30pm w/ @jon.jon.ke β€” stop by to learn what trained tokenizers reveal about LLM development (‼️) and chat about all things tokenizers.

πŸ”— arxiv.org/abs/2407.16607

11.12.2024 22:08 β€” πŸ‘ 13    πŸ” 4    πŸ’¬ 0    πŸ“Œ 0
Post image

Want to predict the task performance of LMs before pretraining them?

We develop task scaling laws and model ladders, which predict the accuracy on individual tasks by OLMo 2 7B & 13B models within 2 points of absolute error. The cost is 1% of the compute used to pretrain them.

09.12.2024 17:07 β€” πŸ‘ 33    πŸ” 14    πŸ’¬ 2    πŸ“Œ 0
Post image

🚨I too am on the job marketβ€ΌοΈπŸ€―

I'm searching for faculty positions/postdocs in multilingual/multicultural NLP, vision+language models, and eval for genAI!

I'll be at #NeurIPS2024 presenting our work on meta-evaluation for text-to-image faithfulness! Let's chat there!

Papers in🧡, see more: saxon.me

06.12.2024 01:44 β€” πŸ‘ 49    πŸ” 9    πŸ’¬ 1    πŸ“Œ 2
Preview
GitHub - allenai/OLMo: Modeling, training, eval, and inference code for OLMo Modeling, training, eval, and inference code for OLMo - allenai/OLMo

We just updated the OLMo repo at github.com/allenai/OLMo!
There are now several training configs that together reproduce the training runs that lead to the final OLMo 2 models.
In particular, all the training data is available, tokenized and shuffled exactly as we trained on it!

02.12.2024 20:13 β€” πŸ‘ 54    πŸ” 11    πŸ’¬ 0    πŸ“Œ 0
The OLMo 2 models sit at the Pareto frontier of training FLOPs vs model average performance.

The OLMo 2 models sit at the Pareto frontier of training FLOPs vs model average performance.

Meet OLMo 2, the best fully open language model to date, including a family of 7B and 13B models trained up to 5T tokens. OLMo 2 outperforms other fully open models and competes with open-weight models like Llama 3.1 8B β€” As always, we released our data, code, recipes and more 🎁

26.11.2024 20:51 β€” πŸ‘ 151    πŸ” 36    πŸ’¬ 5    πŸ“Œ 12

OLMo 2 is out πŸ₯³ 7B and 13B trained on 5T tokens, and meticulousy instruction tuned using Tulu 3 recipe.

Simply the best fully open models yet.

Really proud of the work & the amazing team at
@ai2.bsky.social

26.11.2024 21:12 β€” πŸ‘ 260    πŸ” 44    πŸ’¬ 9    πŸ“Œ 2

πŸ™‹πŸ»β€β™€οΈ ty!

25.11.2024 17:09 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
a panda bear is rolling around in the grass in a zoo enclosure . Alt: a panda bear is rolling around in the grass in a zoo enclosure .

No one can explain stochastic gradient descent better than this panda.

24.11.2024 15:04 β€” πŸ‘ 216    πŸ” 32    πŸ’¬ 10    πŸ“Œ 6

Reading the TÜLU 3 paper from @ai2.bsky.social. It's refreshing to see a research lab treating AI as a real science with full reports, data, code, logs, evals.

Paper: allenai.org/papers/tulu-...
Demo: playground.allenai.org
Code: github.com/allenai/open...
Eval: github.com/allenai/olmes

Notes

24.11.2024 18:04 β€” πŸ‘ 25    πŸ” 5    πŸ’¬ 1    πŸ“Œ 0
Post image

Meet TΓΌlu 3, a set of state-of-the-art instruct models with fully open data, eval code, and training algorithms.
We invented new methods for fine-tuning language models with RL and built upon best practices to scale synthetic instruction and preference data.
Demo, GitHub, paper, and models πŸ‘‡

21.11.2024 17:15 β€” πŸ‘ 111    πŸ” 31    πŸ’¬ 2    πŸ“Œ 7
Post image Post image

I've spent the last two years scouring all available resources on RLHF specifically and post training broadly. Today, with the help of a totally cracked team, we bring you the fruits of that labor β€” TΓΌlu 3, an entirely open frontier model post training recipe. We beat Llama 3.1 Instruct.

Thread.

21.11.2024 17:01 β€” πŸ‘ 211    πŸ” 42    πŸ’¬ 8    πŸ“Œ 10
Video thumbnail

1/ Introducing α΄α΄˜α΄‡Ι΄κœ±α΄„Κœα΄ΚŸα΄€Κ€: a retrieval-augmented LM to help scientists synthesize knowledge πŸ“š
@uwnlp.bsky.social & Ai2
With open models & 45M-paper datastores, it outperforms proprietary systems & match human experts.
Try out our demo!
openscholar.allen.ai

19.11.2024 16:30 β€” πŸ‘ 161    πŸ” 39    πŸ’¬ 6    πŸ“Œ 8