We’re not your average lab. We’re a hybrid research environment dedicated to revolutionizing the ML space.
And we’re hiring a Senior Research Scientist to co-create with us.
If you believe in research as a shared, global effort — this is your chance.
30.09.2025 10:00 — 👍 4 🔁 3 💬 1 📌 0
💡A collaborative➕diverse team is key. In real life as in the LLM world 💪🦾
Check out our latest work that builds on this insight. 👇
02.10.2025 14:10 — 👍 3 🔁 1 💬 1 📌 0
Breaking into AI research is harder than ever, and early-career researchers face fewer chances to get started.
Entry points matter.
We started the Scholars Program 3 years ago to give new researchers a real shot — excited to open applications for year 4✨
13.08.2025 14:42 — 👍 6 🔁 3 💬 1 📌 0
While effective for chess♟️, Elo ratings struggle with LLM evaluation due to volatility and transitivity issues.
New post in collaboration with AI Singapore explores why Elo falls short for AI leaderboards and how we can do better.
15.08.2025 05:04 — 👍 6 🔁 3 💬 1 📌 0
🍋 Squeezing the most of few samples - check out our LLMonade recipe for few-sample test-time scaling in multitask environments.
Turns out that standard methods miss out on gains on non-English languages. We propose more robust alternatives.
Very proud of this work that our scholar Ammar led! 🚀
26.06.2025 18:17 — 👍 4 🔁 1 💬 0 📌 0
🚨LLM safety research needs to be at least as multilingual as our models.
What's the current stage and how to progress from here?
This work led by @yongzx.bsky.social has answers! 👇
04.06.2025 11:44 — 👍 4 🔁 2 💬 0 📌 0
🚧No LLM safety without multilingual safety - what is missing to closing the language gap? And where does this gap actually originate from?
Answers 👇
28.05.2025 15:25 — 👍 1 🔁 1 💬 0 📌 0
Multilingual 🤝reasoning 🤝 test-time scaling 🔥🔥🔥
New preprint!
@yongzx.bsky.social has all the details 👇
09.05.2025 20:00 — 👍 5 🔁 1 💬 0 📌 0
1/ Science is only as strong as the benchmarks it relies on.
So how fair—and scientifically rigorous—is today’s most widely used evaluation benchmark?
We took a deep dive into Chatbot Arena to find out. 🧵
30.04.2025 12:53 — 👍 28 🔁 6 💬 1 📌 1
Thank you @rapha.dev 😊 hope we can establish going a little more into depth rather than just focusing on breadth (massive multilinguality) with evals.
24.04.2025 00:08 — 👍 1 🔁 0 💬 0 📌 0
🤓MT eyes on multilingual LLM benchmarks 👉 Here's a bunch of simple techniques that we could adopt easily, and in total get a much richer understanding of where we are with multilingual LLMs.
🍬Bonus question: how can we spur research on evaluation of evaluations?
17.04.2025 18:33 — 👍 3 🔁 0 💬 0 📌 0
Tired of messy non-replicable multilingual LLM evaluation? So were we.
In our new paper, we experimentally illustrate common eval. issues and present how structured evaluation design, transparent reporting, and meta-evaluation can help us to build stronger models.
17.04.2025 13:12 — 👍 7 🔁 1 💬 0 📌 0
🎯In order to keep advancing mLLM models, we need to advance our evaluation methods.
We need meta-evaluation research to think beyond one-fits-all automatic evaluation and develop richer assessments in human evaluation, and iterate to adapt them to advances in capabilities. 🔄
17.04.2025 10:56 — 👍 1 🔁 0 💬 0 📌 0
Checklist for multilingual LLM evaluation
🤔Yes, none of these principles are novel or the techniques particularly sophisticated.
Despite their effectiveness, none of them are standard practice.
✔️We’ve compiled a checklist to help incorporate them in model evaluations.
17.04.2025 10:56 — 👍 2 🔁 0 💬 1 📌 0
Table comparing model scores under different prompt templates.
(5) Advancing reproducibility through transparency 🪟
Current mLLM evaluations are near impossible to reproduce, due to intransparency of evaluation configurations (incl. task formulation as in the example below). We argue for open evaluation releases that include model outputs and their scores.
17.04.2025 10:56 — 👍 1 🔁 0 💬 1 📌 0
Diagram breaking down win rate comparisons across buckets of prompt length
(4) Conducting richer analyses 🔬
Aggregate benchmark metrics do not provide insights into what differentiates the outputs of two models - yet this is often the first step in human evaluation. For example, we can group evaluation prompts by length or category.
17.04.2025 10:56 — 👍 0 🔁 0 💬 1 📌 0
Table displaying model ranking changes depending on language resourcedness and task focus
(3) Aggregating responsibly 🏗️
How we aggregate results across tasks and languages informs the interpretation of model comparisons. Uniform weighting is not necessarily fair due to differences in training distribution (e.g. language or task support).
17.04.2025 10:56 — 👍 0 🔁 0 💬 1 📌 0
Diagram that shows the significance of win rate differences in relation to sample sizes
(2) Measuring significance, power and effect size 🔋
Generative evaluations for mLLMs rarely consider significance of results, statistical power of the test setup or effect sizes. We illustrate how these can be helpful to reporting model differences more meaningfully.
17.04.2025 10:56 — 👍 1 🔁 0 💬 1 📌 0
Diagram relating prompt translation quality to a change in win rate differences across languages
(1) Treating synthetic data with care 💅
Translations are a common way to expand evaluation sets to new languages. We demonstrate that prompt translation can cause changes in win rates, with magnitudes depending on translation quality and generative models.
17.04.2025 10:56 — 👍 1 🔁 0 💬 1 📌 0
💡… turns out that by adopting practices from MT evaluations we can improve the expressiveness of generative multilingual LLM (mLLM) evaluations. Examples in thread below👇
17.04.2025 10:56 — 👍 2 🔁 0 💬 1 📌 0
Screenshot of the paper header with title and author list and affiliations
📖New preprint with Eleftheria Briakou @swetaagrawal.bsky.social @mziizm.bsky.social @kocmitom.bsky.social!
arxiv.org/abs/2504.11829
🌍It reflects experiences from my personal research journey: coming from MT into multilingual LLM research I missed reliable evaluations and evaluation research…
17.04.2025 10:56 — 👍 11 🔁 1 💬 1 📌 3
🚀 We are excited to introduce Kaleidoscope, the largest culturally-authentic exam benchmark.
📌 Most VLM benchmarks are English-centric or rely on translations—missing linguistic & cultural nuance. Kaleidoscope expands in-language multilingual 🌎 & multimodal 👀 VLMs evaluation
10.04.2025 20:24 — 👍 18 🔁 7 💬 1 📌 2
☀️ Summer internship at Cohere!
Are you excited about multilingual evaluation, human judgment, or meta-eval? Come help us explore how a rigorous eval really looks like while questioning the status quo in LLM evaluation.
I’m looking for an intern (EU timezone preferred), are you interested? Ping me!
28.03.2025 16:44 — 👍 7 🔁 2 💬 2 📌 0
Command🅰️ technical report is out. Information-dense. Detailed. Pretty. Simply A+!
💎: cohere.com/research/pap...
27.03.2025 16:54 — 👍 5 🔁 1 💬 1 📌 0
A bit of a mess around the conflict of COLM with the ARR (and to lesser degree ICML) reviews release. We feel this is creating a lot of pressure and uncertainty. So, we are pushing our deadlines:
Abstracts due March 22 AoE (+48hr)
Full papers due March 28 AoE (+24hr)
Plz RT 🙏
20.03.2025 18:20 — 👍 37 🔁 31 💬 3 📌 2
Very impressive! I'm glad to see an official statement that this is a multilingual model.
Is the list of supported languages documented anywhere?
12.03.2025 14:45 — 👍 0 🔁 0 💬 1 📌 0
💬The first Q&A starts in a few hours.
🔔Also, a reminder to create your Open review profile if you haven't already. Non-institutional accounts require a verification process that can take time. One week till the abstract deadline!
12.03.2025 14:42 — 👍 0 🔁 2 💬 0 📌 0
We’re excited to bring back Expedition Aya 🌍
A 6-week open build challenge to accelerate ML research progress in multilingual, multimodal and efficiency.
Join us to expand the world that AI sees.
11.03.2025 18:12 — 👍 3 🔁 1 💬 1 📌 0
✨ Multilingual language modeling meets WMT✨ very exciting opportunity to get WMT-style evaluations for MLLMs: unseen tests, human evaluation, meta-evaluation, and that for multiple languages and tasks. Almost too good to be true! 🤩
11.03.2025 20:05 — 👍 2 🔁 0 💬 0 📌 0
Founder & PI @aial.ie, @tcddublin.bsky.social
AI accountability, AI audits & evaluation, critical data studies. Cognitive scientist by training. Ethiopian in Ireland. She/her
PhD Student of NLP | Researching Semantic Accuracy of Text Generation
Predoctoral researcher at Ai2
https://ljvmiranda921.github.io
PhD@browncs doing multilingual things <= Undergrad@SUTD
ml research at Brown University // collab at Meta AI and Cohere For AI
🔗 yongzx.github.io
Helping machines make sense of the world. Asst Prof @icepfl.bsky.social; Before: @stanfordnlp.bsky.social @uwnlp.bsky.social AI2 #NLProc #AI
Website: https://atcbosselut.github.io/
Incoming faculty at the Max Planck Institute for Software Systems
Postdoc at UW, working on Natural Language Processing
Recruiting PhD students!
🌐 https://lasharavichander.github.io/
PhD student @mainlp.bsky.social (@cislmu.bsky.social, LMU Munich). Interested in language variation & change, currently working on NLP for dialects and low-resource languages.
verenablaschke.github.io
PhD @ UniMelb
NLP, with a healthy dose of MT
Based in 🇮🇩, worked in 🇹🇱 🇵🇬 , from 🇫🇷
Traducció i tecnologies/Translation and technologies/Traducción y tecnologías.
Compte professional.
PhD Student at MILA/McGill University with Prof. Siva Reddy and Prof. Vered Shwartz. Previously UBC-CS.
Studying societal impacts of AI, alignment and safety.
Based in Montreal🇨🇦
Assistant Professor @KSU_CCIS |PhD @InfAtEd Computational Social Science & NLP | Multilingual & Social Processing
‼️Not interested in Monolingual/ArabicNLP-dialect |ArabicHCI/Healthcare/Politics/Network Science/Privacy‼️
🌐 https://abeeraldayel.github.io
Translation, literature and video game amateur
Assistant Professor of Translation and AI at @ulbruxelles.bsky.social
TRADITAL ~ Liège Game Lab ~ CIRTI
https://hansenda.github.io
ML at Amazon. Interested in Math, Computers and Paintings. 🦋
Researcher interested in translation, multilingual communication and the use and impact of technology, artificial intelligence etc.
Associate Principal NLP Engineer @ AstraZeneca, UK.
All opinions are my own.
PhD student at the University of Edinburgh.
Co-creator of AlWird (Arabic Wordle).
Research interests: Diversity of Arabic Dialects, Arabic NLP, Multilinguality.
https://amr-keleg.github.io/
Computer Science -- Computation and Language
source: export.arxiv.org/rss/cs.CL
maintainer: @tmaehara.bsky.social