Catherine Arnett @ 🍁COLM🍁's Avatar

Catherine Arnett @ 🍁COLM🍁

@catherinearnett.bsky.social

NLP Researcher at EleutherAI, PhD UC San Diego Linguistics. Previously PleIAs, Edinburgh University. Interested in multilingual NLP, tokenizers, open science. πŸ“Boston. She/her. https://catherinearnett.github.io/

3,852 Followers  |  568 Following  |  98 Posts  |  Joined: 07.11.2024  |  2.5883

Latest posts by catherinearnett.bsky.social on Bluesky

Name tag with β€œAnti Anti Tokenizer Club” pin on lanyard

Name tag with β€œAnti Anti Tokenizer Club” pin on lanyard

I’m in Montreal this week for @colmweb.org and @wmdqs.bsky.social! Looking forward to chatting about tokenizers, multilingual data, and more! #COLM2025

06.10.2025 21:30 β€” πŸ‘ 12    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Yeah, I think the models do generally capture this well and with a lot of flexibility. I think when people have done morphological tokenization, it tends to be really rigid and fragile to anything OOD

26.09.2025 22:19 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

I guess the idea is basically to map strings of text to some kind of abstract representation of meaning and grammar? Maybe the closest thing is morphological tokenization. But to do this fully you would kind of need to solve Language first

26.09.2025 21:56 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Thanks!

26.09.2025 17:58 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
There is no such thing as a tokenizer-free lunch A Blog post by Catherine Arnett on Hugging Face

huggingface.co/blog/catheri...

25.09.2025 15:14 β€” πŸ‘ 14    πŸ” 4    πŸ’¬ 1    πŸ“Œ 1
Post image

I have a new blog post about the so-called β€œtokenizer-free” approach to language modeling and why it’s not tokenizer-free at all. I also talk about why people hate tokenizers so much!

25.09.2025 15:14 β€” πŸ‘ 56    πŸ” 13    πŸ’¬ 4    πŸ“Œ 2
Preview
An Analysis of Multilingual Models on Hugging Face A Blog post by Catherine Arnett on Hugging Face

huggingface.co/blog/catheri...

19.09.2025 14:53 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Did you know?

❌77% of language models on @hf.co are not tagged for any language
πŸ“ˆFor 95% of languages, most models are multilingual
🚨88% of models with tags are trained on English

In a new blog post, @tylerachang.bsky.social and I dig into these trends and why they matter! πŸ‘‡

19.09.2025 14:53 β€” πŸ‘ 13    πŸ” 2    πŸ’¬ 1    πŸ“Œ 0

We are in need of some emergency reviewers for MRL. If you are available, please fill out this form!

12.09.2025 18:31 β€” πŸ‘ 0    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0

We extended the deadline by one day, so you have until the end of today (Aug 24) AoE to submit! Good luck!

24.08.2025 22:08 β€” πŸ‘ 0    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0

We have over 200 volunteers now for 90+ languages! We are hoping to expand the diversity of our language coverage and are still looking for participants who speak these languages. Check out how to get involved below, and please help us spread the word!

18.08.2025 15:52 β€” πŸ‘ 3    πŸ” 3    πŸ’¬ 1    πŸ“Œ 0
Post image

With six weeks left before the deadline, we have had over 50 volunteers sign up to contribute for over 30 languages. If you don’t see your language represented on the map, this is your sign to get involved!

05.08.2025 15:13 β€” πŸ‘ 3    πŸ” 2    πŸ’¬ 1    πŸ“Œ 1

I’m in Vienna all week for @aclmeeting.bsky.social and I’ll be presenting this paper on Wednesday at 11am (Poster Session 4 in HALL X4 X5)! Reach out if you want to chat about multilingual NLP, tokenizers, and open models!

27.07.2025 15:29 β€” πŸ‘ 18    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0

If you want to help us improve language and cultural coverage, and build an open source LangID system, please register to our shared task on Language Identification! πŸ’¬

Registering is easy! All the details are on the shared task webpage: wmdqs.org/shared-task/

Deadline: July 23, 2025 (AoE) ⏰

21.07.2025 22:40 β€” πŸ‘ 2    πŸ” 2    πŸ’¬ 0    πŸ“Œ 0

Really grateful to the organizers for the recognition of our work!

19.07.2025 13:55 β€” πŸ‘ 12    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0

bsky.app/profile/cath...

10.07.2025 16:13 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

bsky.app/profile/cath...

10.07.2025 16:13 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

I'll be at ICML next week for the Tokenization Workshop @tokshop.bsky.social presenting two papers:
"Evaluating Morphological Alignment of Tokenizers in 70 Languages" and "BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization". Check out the paper threads below!

10.07.2025 16:13 β€” πŸ‘ 11    πŸ” 2    πŸ’¬ 1    πŸ“Œ 0

Read the new pre-print: arxiv.org/abs/2507.06378
Use MorphScore: github.com/catherinearn...

10.07.2025 16:09 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

We replicate the findings from the COLING paper and find that higher morphological alignment scores do not correlate with better performance. In fact, they’re predictive of slightly *worse* performance across multiple tasks and models.

10.07.2025 16:09 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

MorphScore v2 allows for flexible evaluation. You can decide whether to weight different words by their frequency and whether to include single-token words in the analysis. We also kept morphological tags, sentential context, and part-of-speech information to allow for analyses.

10.07.2025 16:09 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
Why do language models perform worse for morphologically complex languages? Catherine Arnett, Benjamin Bergen. Proceedings of the 31st International Conference on Computational Linguistics. 2025.

The original version of MorphScore, which we introduced earlier this year in this COLING paper, evaluates the extent to which tokenizers split words into morphemic tokens. In addition to expanding the language coverage, we address some of its limitations aclanthology.org/2025.coling-...

10.07.2025 16:09 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

MorphScore got an update! MorphScore now covers 70 languages 🌎🌍🌏 We have a new-preprint out and we will be presenting our paper at the Tokenization Workshop @tokshop.bsky.social at ICML next week! @marisahudspeth.bsky.social @brenocon.bsky.social

10.07.2025 16:09 β€” πŸ‘ 12    πŸ” 4    πŸ’¬ 1    πŸ“Œ 1
Dynabench Dynabench

Contribute here: dynabench.org/tasks/text-l...

09.07.2025 14:21 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Just a few days left to contribute annotations before the first release of training data. We have over 17,000 document annotations so far!

09.07.2025 14:21 β€” πŸ‘ 3    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0

Stop by our discover server tomorrow, Friday June 27th, to hear about @catherinearnett.bsky.social's work!

26.06.2025 18:18 β€” πŸ‘ 7    πŸ” 2    πŸ’¬ 2    πŸ“Œ 0

The dataset will be open sourced and all contributors will be authors on the benchmark paper. This is a great opportunity for students and early-stage researchers!

25.06.2025 15:43 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

We invite submissions of original, localized evaluation sets, rather than translated datasets. We hope the resulting dataset should more appropriately capture language- and culture-specific information.

25.06.2025 15:43 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

I'm really excited about this shared task! We hope to create a massively multilingual physical reasoning dataset in collaboration with researchers around the world 🌍

25.06.2025 15:43 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
Call for Reviewers - 5th Multilingual Representation Learning (MRL) Workshop, EMNLP 2025

If you would like to sign up to be a reviewer, please fill in this form: forms.gle/fbizvGghD33c...

24.06.2025 16:33 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

@catherinearnett is following 20 prominent accounts