βWhen Life Gives You Samples: The Benefits of Scaling up Inference Compute for Multilingual LLMsβ
Led by: Ammar Khairi, Daniel D'souza, Ye Shen, Julia Kreutzer, Sara Hooker
πPaper link: arxiv.org/abs/2506.20544
@cohereforai.bsky.social
@Cohere.com's non-profit research lab and open science initiative that seeks to solve complex machine learning problems. Join us in exploring the unknown, together. https://cohere.com/research
βWhen Life Gives You Samples: The Benefits of Scaling up Inference Compute for Multilingual LLMsβ
Led by: Ammar Khairi, Daniel D'souza, Ye Shen, Julia Kreutzer, Sara Hooker
πPaper link: arxiv.org/abs/2506.20544
π₯’ We then use our new CHOPS method, which selects the best sample in a single call and outperforms Best-of-N with strong reward models while being more efficient.
βοΈ We also introduce X-MBR which leverages crosslingual capabilities for a +12% in winrates from only 5 samples!
π¦We first introduce Hedged Sampling, where we mix deterministic and stochastic decoding methods.
This intervention yields a 7.6% win-rate delta against a single-sample baseline, a +2.6% jump from using a single temperature π, with particular benefits for multilingual tasks.
π Scaling inference compute boosts LLM performance, but generating hundreds of samples is expensive.
π Our LLMonade Recipe eliminates these obstacles by "squeezing" maximum value from fewer samples using generalist LLMs-as-judge for cross-lingual, cross-task selection.
Can we improve the performance of LLMs during inference without the need for extensive sampling OR special reward models? π€
Our latest work introduces a new inference time scaling recipe that is sample-efficient, multilingual, and suitable for multi-task requirements. π
Can we improve the performance of LLMs during inference without the need for extensive sampling OR special reward models? π€
Our latest work introduces a new inference time scaling recipe that is sample-efficient, multilingual, and suitable for multi-task requirements. π
π: arxiv.org/abs/2506.20544
Led by: Zheng-Xin Yong, Beyza Ermis, Marzieh Fadaee, Stephen H. Bach, Julia Kreutzer
π For more details, read the paper: arxiv.org/abs/2505.24119
We also outline future research directions with concrete steps, from evaluation practices over synthetic data generation to cross-lingual generalization.
For instance, we should move beyond reporting average safety scores to improve multilingual safety evaluation.
We found that English-centricity persists across different safety topics.
More importantly, nearly 50% of the papers did not report that they are English only! π±
π³ There are *ten times* more papers that include English than Chinese, which is the 2nd most-studied language.
Furthermore, non-English languages are usually studied in herds, which limits the possibility for language-specific safety analysis.
We analyzed nearly 300 papers from various *ACL venues.
π¨ We observe a considerable gap between English and non-English safety research.
Itβs been two years since cross-lingual jailbreaks were first discovered. How far has the multilingual LLM safety research field advanced? π€
π Our comprehensive survey reveals that there is still a long way to go.
π Read the paper: arxiv.org/abs/2505.21344
28.05.2025 14:30 β π 2 π 0 π¬ 0 π 0Thanks to all collaborators: Aidan Peppin, Julia Kreutzer, Alice Schoenauer Sebag, Kelly Marchisio, Beyza Ermis, John Dang, Samuel Cahyawijaya, Shivalika Singh, Seraphina Goldfarb-Tarrant, Viraat Aryabumi, Aakanksha, Wei-Yin Ko, Ahmet ΓstΓΌn, Matthias GallΓ©, Marzieh Fadaee, Sara Hooker
28.05.2025 14:30 β π 3 π 0 π¬ 1 π 0Here are key recommendations to make AI safer & more equitable for everyone:
π Incentivize the creation of open-access multilingual datasets
πͺ Encourage transparency in model language coverage
π¬ Prioritise resources towards multilingual research
But more work is needed. Policymakers and governance experts play a crucial role in bridging the AI language gap. π§ββοΈ Our paper highlights the need for policy interventions to support multilingual dataset creation, transparency, and research, ensuring AI safety and equity for all.
28.05.2025 14:30 β π 1 π 0 π¬ 1 π 0Our research efforts over the years have worked to address these issues, spanning technical approaches to improving language performance, and building and releasing open datasets covering 101 languages.
28.05.2025 14:30 β π 1 π 0 π¬ 1 π 0Why does the AI Language Gap exist? Itβs a vicious cycle of resource biases, global inequities, and limited access to tools. Low-resource languages get left behind, widening the divide. π«Έ π«·
28.05.2025 14:30 β π 1 π 0 π¬ 1 π 0Most AI models are optimized for English and a few high-resource languages, leaving many global communities marginalized. This isnβt just a tech issueβitβs a barrier to global AI safety and cultural representation. ππ¬β οΈ
28.05.2025 14:30 β π 1 π 0 π¬ 1 π 0Over 7000 languages are spoken worldwide π, but AI safety efforts focus on only a fraction of them.
Our latest paper draws on our multi-year efforts with the wider research community to explore why this matters and how we can bridge the AI language gap.
Led by: @juliakreutzer.bsky.social, Eleftheria Briakou, @swetaagrawal.bsky.social, @mziizm.bsky.social, @kocmitom.bsky.social
17.04.2025 18:09 β π 3 π 0 π¬ 0 π 0β
We've distilled our findings into a checklist of actionable recommendations for mLLM research and development. Let's work together to improve mLLM evaluation and unlock their full potential! πͺ
πPaper link: arxiv.org/abs/2504.11829
π§ͺThrough targeted experiments, we demonstrate how MT evaluation techniques can be adapted for mLLMs. We also identify essential components for robust meta-evaluation, ensuring the evaluation methods themselves are rigorously assessed.
17.04.2025 18:09 β π 2 π 0 π¬ 1 π 0π΅οΈCurrent mLLM evaluation lacks comprehensiveness, scientific rigor, and consistent adoption. This hinders our ability to truly understand model capabilities and guide development.
17.04.2025 18:09 β π 2 π 0 π¬ 1 π 0ππThe rapid advancement of multilingual large language models (mLLMs) is exciting, but are we evaluating them effectively?
Our new paper explores how we can improve generative evaluations for mLLMs by learning from machine translation (MT) evaluation practices. π
Kaleidoscope: the largest culturally-authentic exam benchmark for VLMs.
Most benchmarks are English-centric or rely on translations, missing linguistic & cultural nuance. Kaleidoscope expands in-language multilingual π & multimodal π VLM evaluation.
arxiv.org/abs/2504.07072
...MohammadAmin farahani fard, Silvia Fernandez, MarΓa Grandury, Dmitry Abulkhanov, Drishti Sharma, Andre Guarnier De Mitri, Leticia Bossatto Marchezi, Johan Obando-Ceron, Nazar Kohut, Beyza Ermis, Desmond Elliott, Enzo Ferrante, Sara Hooker, Marzieh Fadaee
10.04.2025 20:24 β π 0 π 0 π¬ 0 π 0...Sharad Duwal, Alfonso Amayuelas, Swati Rajwal, Jebish Purbey, Ahmed Ruby, Nicholas PopoviΔ, Marek Suppa, Azmine Toushik Wasi, Ram Mohan Rao Kadiyala, Olga Tsymboi, Maksim Kostritsya, Bardia Soltani Moakhar, Gabriel da Costa Merlin, OtΓ‘vio Ferracioli Coletti, Maral Jabbari Shiviari,
10.04.2025 20:24 β π 2 π 0 π¬ 2 π 0Led by: Israfel Salazar, Manuel FernΓ‘ndez Burda, Shayekh Bin Islam, Arshia Soltani Moakhar, Shivalika Singh, Fabian Farestam, Angelika Romanou, Danylo Boiko, Dipika Khullar, Mike Zhang, Dominik KrzemiΕski, Jekaterina Novikova, LuΓsa Shimabucoro, Joseph Marvin Imperial, Rishabh Maheshwary,
10.04.2025 20:24 β π 2 π 0 π¬ 1 π 0π Paper: arxiv.org/abs/2504.07072
π Explore the benchmark: cohere.com/research/kal...
π Dataset: hf.co/datasets/Coh...