Tom Kocmi's Avatar

Tom Kocmi

@kocmitom.bsky.social

Researcher at Cohere | Multilingual LLM evaluation

255 Followers  |  107 Following  |  16 Posts  |  Joined: 20.11.2024  |  1.886

Latest posts by kocmitom.bsky.social on Bluesky

WMT 2025

Hey, hey! πŸŽ‰ We’ve released the blind test set for this year’s WMT General MT and multilingual instruction tasks. Submit your systems to the special 20th anniversary of the conference and see how you compare to others!
The deadline is next week on 3rd July.
www2.statmt.org/wmt25/

26.06.2025 18:09 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Tired of messy non-replicable multilingual LLM evaluation? So were we.

In our new paper, we experimentally illustrate common eval. issues and present how structured evaluation design, transparent reporting, and meta-evaluation can help us to build stronger models.

17.04.2025 13:12 β€” πŸ‘ 7    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0

β˜€οΈ Summer internship at Cohere!
Are you excited about multilingual evaluation, human judgment, or meta-eval? Come help us explore how a rigorous eval really looks like while questioning the status quo in LLM evaluation.
I’m looking for an intern (EU timezone preferred), are you interested? Ping me!

28.03.2025 16:44 β€” πŸ‘ 7    πŸ” 2    πŸ’¬ 2    πŸ“Œ 0

It’s here! Our new model’s technical report is out. I'm especially proud of the work we did on its multilingual capabilities - this was a massive, collective effort!

27.03.2025 16:42 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Multilingual Instruction Shared Task

Big news from WMT! πŸŽ‰ We are expanding beyond MT and launching a new multilingual instruction shared task. Our goal is to foster truly multilingual LLM evaluation and best practices in automatic and human evaluation. Join us and build the winning multilingual system!
www2.statmt.org/wmt25/multil...

11.03.2025 18:26 β€” πŸ‘ 12    πŸ” 7    πŸ’¬ 1    πŸ“Œ 2

AI is evolving fast, and Aya Vision is proof of that. This open-weights model is designed to make LLM more powerful across languages and modalities, especially vision! Can’t wait to see the real-world applications, perhaps at WMT this year πŸ˜‡

04.03.2025 14:40 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
WMT24++: Expanding the Language Coverage of WMT24 to 55 Languages & Dialects As large language models (LLM) become more and more capable in languages other than English, it is important to collect benchmark datasets in order to evaluate their multilingual performance, includin...

Huge shoutout to colleagues at Google & Unbabel for extending our WMT24 testset to 55 languages in four domains, this is game changer! πŸš€

I really hope it puts the final nail in the coffin of FLORES or WMT14. The field is evolving, legacy testsets can't show your progress

arxiv.org/abs/2502.124...

01.03.2025 20:30 β€” πŸ‘ 14    πŸ” 6    πŸ’¬ 0    πŸ“Œ 0
Shared Task: General Machine Translation

* Revamped constrained track – No restrictions on training data except licensing; all open models under 20B parameters are allowed.

* More challenging sources; long-context translation; prompt preambles; and much more.

πŸ“Œ All details are available at www2.statmt.org/wmt25/transl...

20.02.2025 21:31 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

* New human-evaluated language pairs: EN–Arabic, EN–Estonian, EN–Korean, EN–Serbian, Czech–German, Bhojpuri–EN, Maasai–EN

* New multilingual subtask – Can you build a system that translates 30 languages?

* New modalities – Additional context from video and image (text-to-text remains the core).

20.02.2025 21:31 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Guess what? The jubilee πŸŽ‰ 20th iteration of WMT General MT πŸŽ‰ is here, and we want you to participate - as the entry barrier to make an impact is so low!

This isn’t just any repeat. We’ve kept what worked, removed what was outdated, and introduced many exciting new twists! Among the key changes are:

20.02.2025 21:31 β€” πŸ‘ 18    πŸ” 5    πŸ’¬ 1    πŸ“Œ 3

Yeah, I haven't wrote a paper since it's just a different prompt. It's published in the github repository of GEMBA

09.02.2025 10:14 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

That one is extremely large, but we haven't used it either in the automatic ranking. Unfortunately I'm not aware of any API service for metrics

08.02.2025 11:44 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

πŸ™ A huge thank you to all organizers, partners, and participants for making this year's WMT General MT Shared Task a success! Stay tuned for WMT25 - many exciting changes are coming! πŸŽ‰

20.11.2024 10:16 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

πŸ† Highlights from top systems:
βœ… IOL-Research: led in constrained/open, winning 10/11 in its category.
βœ… Unbabel-Tower70B: Best participant, winning 8/11 pairs.
βœ… Claude-3.5-Sonnet: Best overall with 9/11 wins.
βœ… Shoutout to Dubformer (speech) & CUNI-MH (strong constrained)

20.11.2024 10:16 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

πŸ“Š We introduced new robust and efficient human evaluation protocol: Error Span Annotations (ESA).
πŸ“„ Test sets are now finally document-level!
🌍 We've added three new language pairs, including English-Spanish where translations are near-perfect.
For more details, read our findings paper.

20.11.2024 10:16 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Exciting time at this year's WMT24 General MT Shared Task:
πŸš€ Participant numbers increased by over 50%!
πŸ—οΈ Decoder-only architectures are leading the way.
πŸ”Š We've introduced a new speech audio modality domain.
🌐 Online systems are losing ground to LLMs.

20.11.2024 10:16 β€” πŸ‘ 5    πŸ” 1    πŸ’¬ 2    πŸ“Œ 0

@kocmitom is following 20 prominent accounts