Cohere Labs's Avatar

Cohere Labs

@cohereforai.bsky.social

@Cohere.com's non-profit research lab and open science initiative that seeks to solve complex machine learning problems. Join us in exploring the unknown, together. https://cohere.com/research

427 Followers  |  11 Following  |  144 Posts  |  Joined: 10.12.2024  |  2.1833

Latest posts by cohereforai.bsky.social on Bluesky

β€œWhen Life Gives You Samples: The Benefits of Scaling up Inference Compute for Multilingual LLMs”

Led by: Ammar Khairi, Daniel D'souza, Ye Shen, Julia Kreutzer, Sara Hooker

πŸ“œPaper link: arxiv.org/abs/2506.20544

26.06.2025 16:33 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

πŸ₯’ We then use our new CHOPS method, which selects the best sample in a single call and outperforms Best-of-N with strong reward models while being more efficient.

βš–οΈ We also introduce X-MBR which leverages crosslingual capabilities for a +12% in winrates from only 5 samples!

26.06.2025 16:33 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

πŸ¦”We first introduce Hedged Sampling, where we mix deterministic and stochastic decoding methods.

This intervention yields a 7.6% win-rate delta against a single-sample baseline, a +2.6% jump from using a single temperature πŸ“ˆ, with particular benefits for multilingual tasks.

26.06.2025 16:33 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

πŸš€ Scaling inference compute boosts LLM performance, but generating hundreds of samples is expensive.

πŸ‹ Our LLMonade Recipe eliminates these obstacles by "squeezing" maximum value from fewer samples using generalist LLMs-as-judge for cross-lingual, cross-task selection.

26.06.2025 16:33 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Can we improve the performance of LLMs during inference without the need for extensive sampling OR special reward models? πŸ€”

Our latest work introduces a new inference time scaling recipe that is sample-efficient, multilingual, and suitable for multi-task requirements. πŸ‹

26.06.2025 16:33 β€” πŸ‘ 4    πŸ” 1    πŸ’¬ 1    πŸ“Œ 1
Post image

Can we improve the performance of LLMs during inference without the need for extensive sampling OR special reward models? πŸ€”

Our latest work introduces a new inference time scaling recipe that is sample-efficient, multilingual, and suitable for multi-task requirements. πŸ‹

πŸ“ƒ: arxiv.org/abs/2506.20544

26.06.2025 16:24 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Led by: Zheng-Xin Yong, Beyza Ermis, Marzieh Fadaee, Stephen H. Bach, Julia Kreutzer

πŸ“œ For more details, read the paper: arxiv.org/abs/2505.24119

03.06.2025 13:59 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

We also outline future research directions with concrete steps, from evaluation practices over synthetic data generation to cross-lingual generalization.

For instance, we should move beyond reporting average safety scores to improve multilingual safety evaluation.

03.06.2025 13:59 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

We found that English-centricity persists across different safety topics.

More importantly, nearly 50% of the papers did not report that they are English only! 😱

03.06.2025 13:59 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

😳 There are *ten times* more papers that include English than Chinese, which is the 2nd most-studied language.

Furthermore, non-English languages are usually studied in herds, which limits the possibility for language-specific safety analysis.

03.06.2025 13:59 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

We analyzed nearly 300 papers from various *ACL venues.

🚨 We observe a considerable gap between English and non-English safety research.

03.06.2025 13:59 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

It’s been two years since cross-lingual jailbreaks were first discovered. How far has the multilingual LLM safety research field advanced? πŸ€”

πŸ“ Our comprehensive survey reveals that there is still a long way to go.

03.06.2025 13:59 β€” πŸ‘ 4    πŸ” 3    πŸ’¬ 1    πŸ“Œ 1
Preview
The Multilingual Divide and Its Impact on Global AI Safety Despite advances in large language model capabilities in recent years, a large gap remains in their capabilities and safety performance for many languages beyond a relatively small handful of globally...

πŸ“œ Read the paper: arxiv.org/abs/2505.21344

28.05.2025 14:30 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
The Multilingual Divide and Its Impact on Global AI Safety Despite advances in large language model capabilities in recent years, a large gap remains in their capabilities and safety performance for many languages beyond a relatively small handful of globally...

Thanks to all collaborators: Aidan Peppin, Julia Kreutzer, Alice Schoenauer Sebag, Kelly Marchisio, Beyza Ermis, John Dang, Samuel Cahyawijaya, Shivalika Singh, Seraphina Goldfarb-Tarrant, Viraat Aryabumi, Aakanksha, Wei-Yin Ko, Ahmet Üstün, Matthias Gallé, Marzieh Fadaee, Sara Hooker

28.05.2025 14:30 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Here are key recommendations to make AI safer & more equitable for everyone:

🌐 Incentivize the creation of open-access multilingual datasets
πŸͺŸ Encourage transparency in model language coverage
πŸ”¬ Prioritise resources towards multilingual research

28.05.2025 14:30 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

But more work is needed. Policymakers and governance experts play a crucial role in bridging the AI language gap. πŸ§‘β€βš–οΈ Our paper highlights the need for policy interventions to support multilingual dataset creation, transparency, and research, ensuring AI safety and equity for all.

28.05.2025 14:30 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Our research efforts over the years have worked to address these issues, spanning technical approaches to improving language performance, and building and releasing open datasets covering 101 languages.

28.05.2025 14:30 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Why does the AI Language Gap exist? It’s a vicious cycle of resource biases, global inequities, and limited access to tools. Low-resource languages get left behind, widening the divide. 🫸 🫷

28.05.2025 14:30 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Most AI models are optimized for English and a few high-resource languages, leaving many global communities marginalized. This isn’t just a tech issueβ€”it’s a barrier to global AI safety and cultural representation. πŸŒπŸ’¬βš οΈ

28.05.2025 14:30 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Over 7000 languages are spoken worldwide 🌐, but AI safety efforts focus on only a fraction of them.

Our latest paper draws on our multi-year efforts with the wider research community to explore why this matters and how we can bridge the AI language gap.

28.05.2025 14:30 β€” πŸ‘ 7    πŸ” 3    πŸ’¬ 1    πŸ“Œ 1

Led by: @juliakreutzer.bsky.social, Eleftheria Briakou, @swetaagrawal.bsky.social, @mziizm.bsky.social, @kocmitom.bsky.social

17.04.2025 18:09 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
DΓ©jΓ  Vu: Multilingual LLM Evaluation through the Lens of Machine Translation Evaluation Generation capabilities and language coverage of multilingual large language models (mLLMs) are advancing rapidly. However, evaluation practices for generative abilities of mLLMs are still lacking com...

βœ…We've distilled our findings into a checklist of actionable recommendations for mLLM research and development. Let's work together to improve mLLM evaluation and unlock their full potential! πŸ’ͺ

πŸ“œPaper link: arxiv.org/abs/2504.11829

17.04.2025 18:09 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

πŸ§ͺThrough targeted experiments, we demonstrate how MT evaluation techniques can be adapted for mLLMs. We also identify essential components for robust meta-evaluation, ensuring the evaluation methods themselves are rigorously assessed.

17.04.2025 18:09 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

πŸ•΅οΈCurrent mLLM evaluation lacks comprehensiveness, scientific rigor, and consistent adoption. This hinders our ability to truly understand model capabilities and guide development.

17.04.2025 18:09 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

πŸš€πŸŒThe rapid advancement of multilingual large language models (mLLMs) is exciting, but are we evaluating them effectively?

Our new paper explores how we can improve generative evaluations for mLLMs by learning from machine translation (MT) evaluation practices. πŸ”Ž

17.04.2025 18:09 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 1    πŸ“Œ 1
Post image

Kaleidoscope: the largest culturally-authentic exam benchmark for VLMs.

Most benchmarks are English-centric or rely on translations, missing linguistic & cultural nuance. Kaleidoscope expands in-language multilingual 🌎 & multimodal πŸ‘€ VLM evaluation.

arxiv.org/abs/2504.07072

10.04.2025 20:25 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 0    πŸ“Œ 1

...MohammadAmin farahani fard, Silvia Fernandez, MarΓ­a Grandury, Dmitry Abulkhanov, Drishti Sharma, Andre Guarnier De Mitri, Leticia Bossatto Marchezi, Johan Obando-Ceron, Nazar Kohut, Beyza Ermis, Desmond Elliott, Enzo Ferrante, Sara Hooker, Marzieh Fadaee

10.04.2025 20:24 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

...Sharad Duwal, Alfonso Amayuelas, Swati Rajwal, Jebish Purbey, Ahmed Ruby, Nicholas Popovič, Marek Suppa, Azmine Toushik Wasi, Ram Mohan Rao Kadiyala, Olga Tsymboi, Maksim Kostritsya, Bardia Soltani Moakhar, Gabriel da Costa Merlin, OtÑvio Ferracioli Coletti, Maral Jabbari Shiviari,

10.04.2025 20:24 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0

Led by: Israfel Salazar, Manuel FernΓ‘ndez Burda, Shayekh Bin Islam, Arshia Soltani Moakhar, Shivalika Singh, Fabian Farestam, Angelika Romanou, Danylo Boiko, Dipika Khullar, Mike Zhang, Dominik KrzemiΕ„ski, Jekaterina Novikova, LuΓ­sa Shimabucoro, Joseph Marvin Imperial, Rishabh Maheshwary,

10.04.2025 20:24 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

πŸ“œ Paper: arxiv.org/abs/2504.07072
🌐 Explore the benchmark: cohere.com/research/kal...
πŸ“‚ Dataset: hf.co/datasets/Coh...

10.04.2025 20:24 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

@cohereforai is following 11 prominent accounts