With six weeks left before the deadline, we have had over 50 volunteers sign up to contribute for over 30 languages. If you donβt see your language represented on the map, this is your sign to get involved!
05.08.2025 15:13 β π 2 π 2 π¬ 1 π 0
Iβm in Vienna all week for @aclmeeting.bsky.social and Iβll be presenting this paper on Wednesday at 11am (Poster Session 4 in HALL X4 X5)! Reach out if you want to chat about multilingual NLP, tokenizers, and open models!
27.07.2025 15:29 β π 17 π 1 π¬ 0 π 0
If you want to help us improve language and cultural coverage, and build an open source LangID system, please register to our shared task on Language Identification! π¬
Registering is easy! All the details are on the shared task webpage: wmdqs.org/shared-task/
Deadline: July 23, 2025 (AoE) β°
21.07.2025 22:40 β π 2 π 2 π¬ 0 π 0
Really grateful to the organizers for the recognition of our work!
19.07.2025 13:55 β π 12 π 1 π¬ 1 π 0
bsky.app/profile/cath...
10.07.2025 16:13 β π 0 π 0 π¬ 0 π 0
bsky.app/profile/cath...
10.07.2025 16:13 β π 0 π 0 π¬ 1 π 0
I'll be at ICML next week for the Tokenization Workshop @tokshop.bsky.social presenting two papers:
"Evaluating Morphological Alignment of Tokenizers in 70 Languages" and "BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization". Check out the paper threads below!
10.07.2025 16:13 β π 11 π 2 π¬ 2 π 0
Read the new pre-print: arxiv.org/abs/2507.06378
Use MorphScore: github.com/catherinearn...
10.07.2025 16:09 β π 0 π 0 π¬ 0 π 0
We replicate the findings from the COLING paper and find that higher morphological alignment scores do not correlate with better performance. In fact, theyβre predictive of slightly *worse* performance across multiple tasks and models.
10.07.2025 16:09 β π 2 π 0 π¬ 1 π 0
MorphScore v2 allows for flexible evaluation. You can decide whether to weight different words by their frequency and whether to include single-token words in the analysis. We also kept morphological tags, sentential context, and part-of-speech information to allow for analyses.
10.07.2025 16:09 β π 0 π 0 π¬ 1 π 0
Why do language models perform worse for morphologically complex languages?
Catherine Arnett, Benjamin Bergen. Proceedings of the 31st International Conference on Computational Linguistics. 2025.
The original version of MorphScore, which we introduced earlier this year in this COLING paper, evaluates the extent to which tokenizers split words into morphemic tokens. In addition to expanding the language coverage, we address some of its limitations aclanthology.org/2025.coling-...
10.07.2025 16:09 β π 1 π 0 π¬ 1 π 0
MorphScore got an update! MorphScore now covers 70 languages πππ We have a new-preprint out and we will be presenting our paper at the Tokenization Workshop @tokshop.bsky.social at ICML next week! @marisahudspeth.bsky.social @brenocon.bsky.social
10.07.2025 16:09 β π 11 π 4 π¬ 1 π 1
Dynabench
Dynabench
Contribute here: dynabench.org/tasks/text-l...
09.07.2025 14:21 β π 1 π 0 π¬ 0 π 0
Just a few days left to contribute annotations before the first release of training data. We have over 17,000 document annotations so far!
09.07.2025 14:21 β π 3 π 1 π¬ 1 π 0
Stop by our discover server tomorrow, Friday June 27th, to hear about @catherinearnett.bsky.social's work!
26.06.2025 18:18 β π 7 π 2 π¬ 2 π 0
The dataset will be open sourced and all contributors will be authors on the benchmark paper. This is a great opportunity for students and early-stage researchers!
25.06.2025 15:43 β π 0 π 0 π¬ 0 π 0
We invite submissions of original, localized evaluation sets, rather than translated datasets. We hope the resulting dataset should more appropriately capture language- and culture-specific information.
25.06.2025 15:43 β π 0 π 0 π¬ 1 π 0
I'm really excited about this shared task! We hope to create a massively multilingual physical reasoning dataset in collaboration with researchers around the world π
25.06.2025 15:43 β π 0 π 0 π¬ 1 π 0
Call for Reviewers - 5th Multilingual Representation Learning (MRL) Workshop, EMNLP 2025
If you would like to sign up to be a reviewer, please fill in this form: forms.gle/fbizvGghD33c...
24.06.2025 16:33 β π 0 π 0 π¬ 0 π 0
As part of the workshop, we are also organizing a shared task to develop a collaborative physical commonsense reasoning evaluation dataset. See the shared task page for more information: sigtyp.github.io/st2025-mrl.h....
24.06.2025 16:33 β π 2 π 1 π¬ 1 π 2
5TH MULTILINGUAL REPRESENTATION LEARNING (MRL) WORKSHOP @EMNLP 2025
SIGTYP
MRL accepts long and short papers as well as extended abstracts. For more, check out the call for papers: sigtyp.github.io/ws2025-mrl.h...
24.06.2025 16:33 β π 0 π 0 π¬ 1 π 0
The call for papers is out for the 5th edition of the Workshop on Multilingual Representation Learning which will take place in Suzhou, China co-located with EMNLP 2025! See details below!
24.06.2025 16:33 β π 6 π 0 π¬ 1 π 0
The deadline for paper submissions has been extended!
The new deadline is July 3, 2025. AoE.
For more information, please visit: wmdqs.org
23.06.2025 14:23 β π 2 π 5 π¬ 0 π 0
WMDQS: Shared Task
Check out the call for papers for more information: wmdqs.org/shared-task/
And feel free to get in touch if you have questions!
09.06.2025 15:44 β π 0 π 0 π¬ 0 π 0
Data contributions can be made through the Web Languages Project and/or Text Language Identification task on Dynabench. Top contributors will be recognized as part of the shared task!
Web Langs Project: github.com/commoncrawl/...
Text ID: dynabench.org/tasks/text-l...
09.06.2025 15:44 β π 0 π 0 π¬ 1 π 0
We invite submissions of LangID models and annotated training data. Authors of accepted submissions will be invited to participate in a joint paper for a high-impact NLP conference. And of course all datasets and models will be open source and available to everyone!
09.06.2025 15:44 β π 0 π 0 π¬ 1 π 0
One of the biggest obstacles to improving language technologies for low-resource languages is the lack of data. To address this, we need better language identification tools. So, we're organizing a shared task on Language Identification for Web Data! #NLP #NLProc
09.06.2025 15:44 β π 4 π 3 π¬ 1 π 1
1st Workshop on Multilingual Data Quality Signals
Call for papers!
We are organising the 1st Workshop on Multilingual Data Quality Signals with @mlcommons.org and @eleutherai.bsky.social, held in tandem with @colmweb.org. Submit your research on multilingual data quality!
Submission deadline is 23 June, more info: wmdqs.org
29.05.2025 17:18 β π 9 π 8 π¬ 0 π 1
@tylerachang.bsky.social and I are giving a talk later today at
the Cambridge NLP group about the curse of multilinguality and training small models. The talk is open to the public and the link is below!
06.06.2025 12:57 β π 2 π 0 π¬ 1 π 0