Catherine Arnett @ NeurIPS (San Diego) @catherinearnett

We will be presenting this work this afternoon!

07.12.2025 19:09 — 👍 2 🔁 0 💬 0 📌 0

I’m presenting this today at 11am. Come find me at poster #1909!

04.12.2025 18:40 — 👍 7 🔁 1 💬 0 📌 0

Reach out if you want to find a time to meet! Looking forward to seeing everyone!

24.11.2025 17:20 — 👍 3 🔁 0 💬 0 📌 0

I will also be at the @eval-eval.bsky.social Workshop on Evaluating AI in Practice at UCSD on December 8th!
bsky.app/profile/eval...

24.11.2025 17:20 — 👍 0 🔁 0 💬 1 📌 0

Disaggregation Reveals Hidden Training Dynamics: The Case of Agreement Attraction Language models generally produce grammatical text, but they are more likely to make errors in certain contexts. Drawing on paradigms from psycholinguistics, we carry out a fine-grained analysis of th...

The preprint is out on arxiv now! arxiv.org/abs/2510.24934

24.11.2025 17:20 — 👍 1 🔁 0 💬 1 📌 0

@jamichaelov.bsky.social and I will be presenting our paper at the CogInterp workshop 13:15 - 14:45 on Dec 7th. The paper shows how disaggregating grammatical benchmarks over the course of training reveals stages of training where models learn heuristics before learning more generalizable patterns.

24.11.2025 17:20 — 👍 8 🔁 2 💬 1 📌 2

I will be presenting our paper about tokenizer inequities at the main conference on Dec 4th at 11am (Poster Session 3) bsky.app/profile/cath...

24.11.2025 17:20 — 👍 3 🔁 1 💬 1 📌 0

I’ll be in San Diego for #NeurIPS2025 next week! I will be presenting posters at the main conference and at the CogInterp workshop. I will also be at the Workshop on Evaluating AI in Practice at UCSD. Looking forward to chatting about multilingual NLP, evals, and tokenizers!

24.11.2025 17:20 — 👍 6 🔁 1 💬 1 📌 0

We have kicked off proceedings with some brief opening remarks from @catherinearnett.bsky.social

09.11.2025 01:25 — 👍 3 🔁 1 💬 1 📌 0

🚨 EvalEval is back - now in San Diego!🚨

🧠 Join us for the 2025 Workshop on "Evaluating AI in Practice Bridging Statistical Rigor, Sociotechnical Insights, and Ethical Boundaries" (Co-hosted with UKAISI)

📅 Dec 8, 2025
📝 Abstract due: Nov 20, 2025

Details below! ⬇️
evalevalai.com/events/works...

06.11.2025 21:19 — 👍 3 🔁 1 💬 1 📌 1

We’re still looking to expand the language coverage of Global PIQA, so sign up if you don’t see your language represented yet! bsky.app/profile/mrl-...

29.10.2025 15:53 — 👍 0 🔁 0 💬 0 📌 0

I’m so excited that Global PIQA is out! This has been a herculean effort by our 300+ contributors. The result is an extremely high-quality, culturally-specific benchmark for over 100 languages.

29.10.2025 15:53 — 👍 8 🔁 0 💬 1 📌 0

Explaining and Mitigating Crosslingual Tokenizer Inequities The number of tokens it takes to encode parallel text in different languages is known to vary. These disparities are called token premiums. Having high token premiums leads to less throughput during t...

You can read the preprint here: arxiv.org/abs/2510.21909
We release the tokenizers on Hugging Face: huggingface.co/datasets/cat...

28.10.2025 15:11 — 👍 4 🔁 0 💬 0 📌 0

We found that one of the biggest predictors of token premium effects was whitespace usage. So we also trained SuperBPE tokenizers, which do not use whitespace pretokenizers. SuperBPE tokenizers demonstrate better compression and less extreme token premium effects.

28.10.2025 15:11 — 👍 4 🔁 1 💬 1 📌 0

While it’s possible to achieve the same compression for some sets of languages by manipulating vocabulary size, there are some languages which changing vocab size does not lead to the same compression.

28.10.2025 15:11 — 👍 1 🔁 0 💬 1 📌 0

We show that some languages need more vocabulary items to get the same compression. This suggests that multilingual tokenizers should allocate more or less vocab to different languages, which can help us design more equitable multilingual tokenizers.

28.10.2025 15:11 — 👍 1 🔁 0 💬 1 📌 0

We used the compression rates we got from our monolingual tokenizers to estimate the vocabulary size at which a tokenizer would reach a target compression rate. We used this to determine the “optimal” vocab size for each language. This significantly reduces token premium effects.

28.10.2025 15:11 — 👍 2 🔁 0 💬 1 📌 0

We trained 7000 monolingual tokenizers for 97 languages and a range of vocabulary sizes. There was no vocabulary size at which token premiums go away, though larger vocabularies unsurprisingly lead to better compression and slightly smaller token premiums.

28.10.2025 15:11 — 👍 2 🔁 0 💬 1 📌 0

Compression isn’t the only tokenizer metric that matters, but it directly determines how many computations a model needs to process text. That affects both training and inference cost. Ideally, we want compression rates to be as similar as possible across languages.

28.10.2025 15:11 — 👍 1 🔁 0 💬 1 📌 0

Our #NeurIPS2025 paper shows that even comparable monolingual tokenizers have different compression rates across languages. But by getting rid of whitespace tokenization and using a custom vocab size for each language, we can reduce token premiums. Preprint out now!

28.10.2025 15:11 — 👍 34 🔁 5 💬 1 📌 2

Yeah I have many thoughts about that post. I do have a follow up post brewing 👀 probably will be some months before I finish it though!

23.10.2025 01:00 — 👍 1 🔁 0 💬 0 📌 0

WMDQS is underway! Come join us in Room 520A at @colmweb.org! #COLM2025

10.10.2025 16:17 — 👍 2 🔁 3 💬 1 📌 0

In collaboration with @commoncrawl.bsky.social, MLCommons, and @eleutherai.bsky.social, the first edition of WMDQS at @colmweb.org starts tomorrow in Room 520A! We have an updated schedule on our website, including a list of all accepted papers.

09.10.2025 20:17 — 👍 3 🔁 3 💬 1 📌 1

Name tag with “Anti Anti Tokenizer Club” pin on lanyard

I’m in Montreal this week for @colmweb.org and @wmdqs.bsky.social! Looking forward to chatting about tokenizers, multilingual data, and more! #COLM2025

06.10.2025 21:30 — 👍 12 🔁 0 💬 0 📌 0

Yeah, I think the models do generally capture this well and with a lot of flexibility. I think when people have done morphological tokenization, it tends to be really rigid and fragile to anything OOD

26.09.2025 22:19 — 👍 1 🔁 0 💬 0 📌 0

I guess the idea is basically to map strings of text to some kind of abstract representation of meaning and grammar? Maybe the closest thing is morphological tokenization. But to do this fully you would kind of need to solve Language first

26.09.2025 21:56 — 👍 1 🔁 0 💬 1 📌 0

Thanks!

26.09.2025 17:58 — 👍 1 🔁 0 💬 0 📌 0

There is no such thing as a tokenizer-free lunch A Blog post by Catherine Arnett on Hugging Face

huggingface.co/blog/catheri...

25.09.2025 15:14 — 👍 16 🔁 4 💬 1 📌 1

I have a new blog post about the so-called “tokenizer-free” approach to language modeling and why it’s not tokenizer-free at all. I also talk about why people hate tokenizers so much!

25.09.2025 15:14 — 👍 60 🔁 15 💬 5 📌 2

An Analysis of Multilingual Models on Hugging Face A Blog post by Catherine Arnett on Hugging Face

huggingface.co/blog/catheri...

19.09.2025 14:53 — 👍 4 🔁 0 💬 0 📌 0

Catherine Arnett @ NeurIPS (San Diego)

Latest posts by catherinearnett.bsky.social on Bluesky

@catherinearnett is following 20 prominent accounts