We will be presenting this work this afternoon!
07.12.2025 19:09 β π 2 π 0 π¬ 0 π 0@catherinearnett.bsky.social
NLP Researcher at EleutherAI, PhD UC San Diego Linguistics. Interested in multilingual NLP, tokenizers, open science. πBoston. She/her. https://catherinearnett.github.io/
We will be presenting this work this afternoon!
07.12.2025 19:09 β π 2 π 0 π¬ 0 π 0Iβm presenting this today at 11am. Come find me at poster #1909!
04.12.2025 18:40 β π 7 π 1 π¬ 0 π 0Reach out if you want to find a time to meet! Looking forward to seeing everyone!
24.11.2025 17:20 β π 3 π 0 π¬ 0 π 0I will also be at the @eval-eval.bsky.social Workshop on Evaluating AI in Practice at UCSD on December 8th!
bsky.app/profile/eval...
The preprint is out on arxiv now! arxiv.org/abs/2510.24934
24.11.2025 17:20 β π 1 π 0 π¬ 1 π 0@jamichaelov.bsky.social and I will be presenting our paper at the CogInterp workshop 13:15 - 14:45 on Dec 7th. The paper shows how disaggregating grammatical benchmarks over the course of training reveals stages of training where models learn heuristics before learning more generalizable patterns.
24.11.2025 17:20 β π 8 π 2 π¬ 1 π 2I will be presenting our paper about tokenizer inequities at the main conference on Dec 4th at 11am (Poster Session 3) bsky.app/profile/cath...
24.11.2025 17:20 β π 3 π 1 π¬ 1 π 0Iβll be in San Diego for #NeurIPS2025 next week! I will be presenting posters at the main conference and at the CogInterp workshop. I will also be at the Workshop on Evaluating AI in Practice at UCSD. Looking forward to chatting about multilingual NLP, evals, and tokenizers!
24.11.2025 17:20 β π 6 π 1 π¬ 1 π 0We have kicked off proceedings with some brief opening remarks from @catherinearnett.bsky.social
09.11.2025 01:25 β π 3 π 1 π¬ 1 π 0π¨ EvalEval is back - now in San Diego!π¨
π§ Join us for the 2025 Workshop on "Evaluating AI in Practice Bridging Statistical Rigor, Sociotechnical Insights, and Ethical Boundaries" (Co-hosted with UKAISI)
π
Dec 8, 2025
π Abstract due: Nov 20, 2025
Details below! β¬οΈ
evalevalai.com/events/works...
Weβre still looking to expand the language coverage of Global PIQA, so sign up if you donβt see your language represented yet! bsky.app/profile/mrl-...
29.10.2025 15:53 β π 0 π 0 π¬ 0 π 0Iβm so excited that Global PIQA is out! This has been a herculean effort by our 300+ contributors. The result is an extremely high-quality, culturally-specific benchmark for over 100 languages.
29.10.2025 15:53 β π 8 π 0 π¬ 1 π 0You can read the preprint here: arxiv.org/abs/2510.21909
We release the tokenizers on Hugging Face: huggingface.co/datasets/cat...
We found that one of the biggest predictors of token premium effects was whitespace usage. So we also trained SuperBPE tokenizers, which do not use whitespace pretokenizers. SuperBPE tokenizers demonstrate better compression and less extreme token premium effects.
28.10.2025 15:11 β π 4 π 1 π¬ 1 π 0While itβs possible to achieve the same compression for some sets of languages by manipulating vocabulary size, there are some languages which changing vocab size does not lead to the same compression.
28.10.2025 15:11 β π 1 π 0 π¬ 1 π 0We show that some languages need more vocabulary items to get the same compression. This suggests that multilingual tokenizers should allocate more or less vocab to different languages, which can help us design more equitable multilingual tokenizers.
28.10.2025 15:11 β π 1 π 0 π¬ 1 π 0We used the compression rates we got from our monolingual tokenizers to estimate the vocabulary size at which a tokenizer would reach a target compression rate. We used this to determine the βoptimalβ vocab size for each language. This significantly reduces token premium effects.
28.10.2025 15:11 β π 2 π 0 π¬ 1 π 0We trained 7000 monolingual tokenizers for 97 languages and a range of vocabulary sizes. There was no vocabulary size at which token premiums go away, though larger vocabularies unsurprisingly lead to better compression and slightly smaller token premiums.
28.10.2025 15:11 β π 2 π 0 π¬ 1 π 0Compression isnβt the only tokenizer metric that matters, but it directly determines how many computations a model needs to process text. That affects both training and inference cost. Ideally, we want compression rates to be as similar as possible across languages.
28.10.2025 15:11 β π 1 π 0 π¬ 1 π 0Our #NeurIPS2025 paper shows that even comparable monolingual tokenizers have different compression rates across languages. But by getting rid of whitespace tokenization and using a custom vocab size for each language, we can reduce token premiums. Preprint out now!
28.10.2025 15:11 β π 34 π 5 π¬ 1 π 2Yeah I have many thoughts about that post. I do have a follow up post brewing π probably will be some months before I finish it though!
23.10.2025 01:00 β π 1 π 0 π¬ 0 π 0WMDQS is underway! Come join us in Room 520A at @colmweb.org! #COLM2025
10.10.2025 16:17 β π 2 π 3 π¬ 1 π 0In collaboration with @commoncrawl.bsky.social, MLCommons, and @eleutherai.bsky.social, the first edition of WMDQS at @colmweb.org starts tomorrow in Room 520A! We have an updated schedule on our website, including a list of all accepted papers.
09.10.2025 20:17 β π 3 π 3 π¬ 1 π 1Name tag with βAnti Anti Tokenizer Clubβ pin on lanyard
Iβm in Montreal this week for @colmweb.org and @wmdqs.bsky.social! Looking forward to chatting about tokenizers, multilingual data, and more! #COLM2025
06.10.2025 21:30 β π 12 π 0 π¬ 0 π 0Yeah, I think the models do generally capture this well and with a lot of flexibility. I think when people have done morphological tokenization, it tends to be really rigid and fragile to anything OOD
26.09.2025 22:19 β π 1 π 0 π¬ 0 π 0I guess the idea is basically to map strings of text to some kind of abstract representation of meaning and grammar? Maybe the closest thing is morphological tokenization. But to do this fully you would kind of need to solve Language first
26.09.2025 21:56 β π 1 π 0 π¬ 1 π 0Thanks!
26.09.2025 17:58 β π 1 π 0 π¬ 0 π 0I have a new blog post about the so-called βtokenizer-freeβ approach to language modeling and why itβs not tokenizer-free at all. I also talk about why people hate tokenizers so much!
25.09.2025 15:14 β π 60 π 15 π¬ 5 π 2