And Research Engineer, @shivalika.bsky.social : The Leaderboard Illusion. πΆβπ«οΈ
This paper reveals systematic biases and transparency gaps in the Chatbot Arena leaderboard.
www.youtube.com/watch?v=URho...
@cohereforai.bsky.social
@Cohere.com's non-profit research lab and open science initiative that seeks to solve complex machine learning problems. Join us in exploring the unknown, together. https://cohere.com/research
And Research Engineer, @shivalika.bsky.social : The Leaderboard Illusion. πΆβπ«οΈ
This paper reveals systematic biases and transparency gaps in the Chatbot Arena leaderboard.
www.youtube.com/watch?v=URho...
Sr Research Scientist, @juliakreutzer.bsky.social: Treasure Hunt paper. πΊοΈ
This work introduces a method to improve model performance by adding markers to tokens of the pretraining data, enabling real-time targeting of the long tail using training-time markers.
www.youtube.com/watch?v=K3BU...
Excited to have two of our papers featured in
@j-novikova-nlp.bsky.social's @wiair.bsky.social podcast, as part of the NeurIPS reflection. β¨
Learn more / subscribe here women-in-ai-research.github.io and check out this thread π§΅ for our features...
What an incredible week itβs been at #NeurIPS2025! π
Today is our last one at the booth. We've had a great week connecting with our community in San Diego.
Join our community to continue to connect with our research team: https://cohere.com/research/open-science/application
What's the story of your legend?
Join ML researchers building their legends with 40 cards that capture our shared journeyβexplore and build yours: https://lab-legends.vercel.app/ π―
Just 1 day left until #NeurIPS2025 kicks off! The Cohere and Cohere Labs teams are ready to dive into a packed week of research, conversations, and community at the San Diego Convention Centerβ¨
Come visit our booth β weβd love to chat and send you home with some swag!
... @markusfreitag.bsky.social, Roman Grundkiewicz, @yupenghou.bsky.social, @phikoehn.bsky.social, @juliakreutzer.bsky.social, Saab Mansour, @sted19.bsky.social, Lorenzo Proietti, Parker Riley, Eduardo SΓ‘nchez, @patuchen.bsky.social, Mariya Shmatova, @zouharvi.bsky.social
30.10.2025 17:51 β π 3 π 0 π¬ 0 π 0You can find all details in our paper www2.statmt.org/wmt25/pdf/20... or discuss with us next week at the WMT Conference at #EMNLP2025.
Led by @kocmitom.bsky.social, Ekaterina Artemova, Eleftherios Avramidis, Eleftheria Briakou, @pinzhen.bsky.social, @mziizm.bsky.social...
βοΈ LLM-as-a-judge: mixed reliability.
Top systems reach ~95% pairwise accuracy open-ended and summarization tasks.
Smaller ones barely beat coin-flip territory at ~55%.
π€Naturalness is still a significant challenge.
Across open-ended generation and cross lingual summarization, the biggest weakness isnβt coherence or accuracy, but it is sounding like a native speaker. Many outputs still feel robotic or translated.
π§ English isnβt always easiest.
Models like Gemini 2.5 Pro and Claude 4 sometimes did better in Korean, German, or Spanish than in English when solving reasoning tasks.
π§©Linguistic reasoning remains the toughest nut. π₯₯
Even top models scored below 50% on linguistic reasoning tasks, showing that structured linguistic deduction is still an open challenge.
π Language coverage matters.
Models donβt support all languages equally, and this skews rankings. Smaller open models especially struggle with broad coverage, affecting their aggregate ranking β οΈ
π§© Linguistic reasoning on unseen languages
π Open-ended generation testing naturalness and usefulness
π Cross-lingual summarization
π Machine translation
π§ββοΈ LLM-as-a-Judge evaluating outputs of other models
All backed by human evals and public releases of data + outputs!
github.com/wmt-conferen...
How well do LLMs handle multilinguality? ππ€
π¬We brought the rigor from Machine Translation evaluation to multilingual LLM benchmarking and organized the WMT25 Multilingual Instruction Shared Task spanning 30 languages and 5 subtasks.
River, Yinhong and I will all be in person and we look forward to the discussions!
29.10.2025 21:12 β π 3 π 1 π¬ 0 π 0Cohere Labs x EMNLP 2025: "When Life Gives You Samples: The Benefits of Scaling up Inference Compute for Multilingual LLMs"
Congrats to authors Ammar Khairi, Daniel D'souza, Ye Shen, @juliakreutzer.bsky.social, @sarahooker.bsky.social
π arxiv.org/abs/2506.20544
Cohere Labs x EMNLP 2025 "When Personalization Meets Reality: A Multi-Faceted Analysis of Personalized Preference Learning"
Congrats to authors Yijiang River Dong, @tiancheng.bsky.social, Yinhong Liu, Ahmet ΓstΓΌn, Nigel Collier.
π arxiv.org/abs/2502.19158
Cohere Labs x EMNLP 2025: "The State of Multilingual LLM Safety Research: From Measuring The Language Gap To Mitigating It"
Congrats to authors @yongzx.bsky.social , Beyza Ermis, @mziizm.bsky.social, Stephen Bach, @juliakreutzer.bsky.social.
π arxiv.org/abs/2505.24119
Cohere Labs x EMNLP 2025: "Nexus: Adaptive Upcycling to Efficiently Pretrain Mixture of Experts"
Congrats to authors Nikolas Gritsch, Qizhen Zhang, @acyrl.bsky.social, @sarahooker.bsky.social and Ahmet ΓstΓΌn.
π arxiv.org/abs/2408.15901
Weβre thrilled to announce that some of our research will be presented at @emnlpmeeting.bsky.social next week! π₯³
If youβre attending the conference, donβt miss the chance to explore our work and connect with our team.
We're excited to hear from speakers including Ivan Zhang, Joelle Pineau, Marzieh Fadaee, Shayne Longpre and 20+ other presenters who will share insights on open science, collaborative research, and community-driven innovation.
Learn more and register now: https://tinyurl.com/CohereLabsConnect
Join us for inspiring keynotes, lightning talks, and interactive sessions that bring together curious minds from around the world. Throughout the conference, weβll:
π¬ Showcase cutting-edge research
π‘ Highlight meaningful collaborations
π€ Inspire new partnerships
βIndividually, we are one drop. Together, we are an ocean.β - Ryunosuke Satoro β¨
Cohere Labs is excited to announce Connect - a 3-day virtual conference celebrating the power of collaboration in open science!
πPaper link: arxiv.org/pdf/2510.19806
Led by: David Mora, Viraat Aryabumi, @weiyinko-ml.bsky.social, @sarahooker.bsky.social, @juliakreutzer.bsky.social, and
@mziizm.bsky.social.
With this work we take a step toward principled approaches to multilingual synthetic data generationβan essential direction for developing adaptive, culturally aware, and globally capable language models. π
23.10.2025 14:39 β π 1 π 0 π¬ 1 π 0We also evaluated our method on languages not seen during pre-trainingπ: while performance is higher for seen languages, our transformations significantly improve both groups over the baselineβand in some cases are competitive with the teacher modelπ(over 3x the studentβs size).
23.10.2025 14:39 β π 1 π 0 π¬ 1 π 0π By inspecting the data itself, we see clear gains in quality along the targeted dimensions. Even when the interventions are relatively small, they produce substantial changes in completions improving their fluency, diversity, and difficulty β¨
23.10.2025 14:39 β π 1 π 0 π¬ 1 π 0β°οΈWith these simple transformations, weβre able to obtain consistent improvements across our 12 target languages and a diverse set of benchmarks, with particularly pronounced gains on open-ended tasks β our best proxies for real human use π¬
23.10.2025 14:39 β π 1 π 0 π¬ 1 π 0Only relying on translation often yields unnatural, Western-centric, and linguistically flat prompts.
π‘We propose a simple, easy-to-implement solution to this problem:
πTransform translated prompts along three axes: Naturalization, Cultural Adaptation, and Difficulty.