Catherine Arnett @ ACL πŸ‡¦πŸ‡Ή's Avatar

Catherine Arnett @ ACL πŸ‡¦πŸ‡Ή

@catherinearnett.bsky.social

NLP Researcher at EleutherAI, PhD UC San Diego Linguistics. Previously PleIAs, Edinburgh University. Interested in multilingual NLP, tokenizers, open science. πŸ“Boston. She/her. https://catherinearnett.github.io/

3,804 Followers  |  561 Following  |  90 Posts  |  Joined: 07.11.2024  |  3.3107

Latest posts by catherinearnett.bsky.social on Bluesky

Post image

With six weeks left before the deadline, we have had over 50 volunteers sign up to contribute for over 30 languages. If you don’t see your language represented on the map, this is your sign to get involved!

05.08.2025 15:13 β€” πŸ‘ 2    πŸ” 2    πŸ’¬ 1    πŸ“Œ 0

I’m in Vienna all week for @aclmeeting.bsky.social and I’ll be presenting this paper on Wednesday at 11am (Poster Session 4 in HALL X4 X5)! Reach out if you want to chat about multilingual NLP, tokenizers, and open models!

27.07.2025 15:29 β€” πŸ‘ 17    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0

If you want to help us improve language and cultural coverage, and build an open source LangID system, please register to our shared task on Language Identification! πŸ’¬

Registering is easy! All the details are on the shared task webpage: wmdqs.org/shared-task/

Deadline: July 23, 2025 (AoE) ⏰

21.07.2025 22:40 β€” πŸ‘ 2    πŸ” 2    πŸ’¬ 0    πŸ“Œ 0

Really grateful to the organizers for the recognition of our work!

19.07.2025 13:55 β€” πŸ‘ 12    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0

bsky.app/profile/cath...

10.07.2025 16:13 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

bsky.app/profile/cath...

10.07.2025 16:13 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

I'll be at ICML next week for the Tokenization Workshop @tokshop.bsky.social presenting two papers:
"Evaluating Morphological Alignment of Tokenizers in 70 Languages" and "BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization". Check out the paper threads below!

10.07.2025 16:13 β€” πŸ‘ 11    πŸ” 2    πŸ’¬ 2    πŸ“Œ 0

Read the new pre-print: arxiv.org/abs/2507.06378
Use MorphScore: github.com/catherinearn...

10.07.2025 16:09 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

We replicate the findings from the COLING paper and find that higher morphological alignment scores do not correlate with better performance. In fact, they’re predictive of slightly *worse* performance across multiple tasks and models.

10.07.2025 16:09 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

MorphScore v2 allows for flexible evaluation. You can decide whether to weight different words by their frequency and whether to include single-token words in the analysis. We also kept morphological tags, sentential context, and part-of-speech information to allow for analyses.

10.07.2025 16:09 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
Why do language models perform worse for morphologically complex languages? Catherine Arnett, Benjamin Bergen. Proceedings of the 31st International Conference on Computational Linguistics. 2025.

The original version of MorphScore, which we introduced earlier this year in this COLING paper, evaluates the extent to which tokenizers split words into morphemic tokens. In addition to expanding the language coverage, we address some of its limitations aclanthology.org/2025.coling-...

10.07.2025 16:09 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

MorphScore got an update! MorphScore now covers 70 languages 🌎🌍🌏 We have a new-preprint out and we will be presenting our paper at the Tokenization Workshop @tokshop.bsky.social at ICML next week! @marisahudspeth.bsky.social @brenocon.bsky.social

10.07.2025 16:09 β€” πŸ‘ 11    πŸ” 4    πŸ’¬ 1    πŸ“Œ 1
Dynabench Dynabench

Contribute here: dynabench.org/tasks/text-l...

09.07.2025 14:21 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Just a few days left to contribute annotations before the first release of training data. We have over 17,000 document annotations so far!

09.07.2025 14:21 β€” πŸ‘ 3    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0

Stop by our discover server tomorrow, Friday June 27th, to hear about @catherinearnett.bsky.social's work!

26.06.2025 18:18 β€” πŸ‘ 7    πŸ” 2    πŸ’¬ 2    πŸ“Œ 0

The dataset will be open sourced and all contributors will be authors on the benchmark paper. This is a great opportunity for students and early-stage researchers!

25.06.2025 15:43 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

We invite submissions of original, localized evaluation sets, rather than translated datasets. We hope the resulting dataset should more appropriately capture language- and culture-specific information.

25.06.2025 15:43 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

I'm really excited about this shared task! We hope to create a massively multilingual physical reasoning dataset in collaboration with researchers around the world 🌍

25.06.2025 15:43 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
Call for Reviewers - 5th Multilingual Representation Learning (MRL) Workshop, EMNLP 2025

If you would like to sign up to be a reviewer, please fill in this form: forms.gle/fbizvGghD33c...

24.06.2025 16:33 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

As part of the workshop, we are also organizing a shared task to develop a collaborative physical commonsense reasoning evaluation dataset. See the shared task page for more information: sigtyp.github.io/st2025-mrl.h....

24.06.2025 16:33 β€” πŸ‘ 2    πŸ” 1    πŸ’¬ 1    πŸ“Œ 2
5TH MULTILINGUAL REPRESENTATION LEARNING (MRL) WORKSHOP @EMNLP 2025 SIGTYP

MRL accepts long and short papers as well as extended abstracts. For more, check out the call for papers: sigtyp.github.io/ws2025-mrl.h...

24.06.2025 16:33 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

The call for papers is out for the 5th edition of the Workshop on Multilingual Representation Learning which will take place in Suzhou, China co-located with EMNLP 2025! See details below!

24.06.2025 16:33 β€” πŸ‘ 6    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

The deadline for paper submissions has been extended!

The new deadline is July 3, 2025. AoE.

For more information, please visit: wmdqs.org

23.06.2025 14:23 β€” πŸ‘ 2    πŸ” 5    πŸ’¬ 0    πŸ“Œ 0
WMDQS: Shared Task

Check out the call for papers for more information: wmdqs.org/shared-task/
And feel free to get in touch if you have questions!

09.06.2025 15:44 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

Data contributions can be made through the Web Languages Project and/or Text Language Identification task on Dynabench. Top contributors will be recognized as part of the shared task!
Web Langs Project: github.com/commoncrawl/...
Text ID: dynabench.org/tasks/text-l...

09.06.2025 15:44 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

We invite submissions of LangID models and annotated training data. Authors of accepted submissions will be invited to participate in a joint paper for a high-impact NLP conference. And of course all datasets and models will be open source and available to everyone!

09.06.2025 15:44 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

One of the biggest obstacles to improving language technologies for low-resource languages is the lack of data. To address this, we need better language identification tools. So, we're organizing a shared task on Language Identification for Web Data! #NLP #NLProc

09.06.2025 15:44 β€” πŸ‘ 4    πŸ” 3    πŸ’¬ 1    πŸ“Œ 1
1st Workshop on Multilingual Data Quality Signals

Call for papers!
We are organising the 1st Workshop on Multilingual Data Quality Signals with @mlcommons.org and @eleutherai.bsky.social, held in tandem with @colmweb.org. Submit your research on multilingual data quality!

Submission deadline is 23 June, more info: wmdqs.org

29.05.2025 17:18 β€” πŸ‘ 9    πŸ” 8    πŸ’¬ 0    πŸ“Œ 1
talks.cam : When is Multilinguality a Curse? Language Modeling for 350 Languages

talks.cam.ac.uk/talk/index/2...

06.06.2025 12:57 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

@tylerachang.bsky.social and I are giving a talk later today at
the Cambridge NLP group about the curse of multilinguality and training small models. The talk is open to the public and the link is below!

06.06.2025 12:57 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0