Julia Kreutzer's Avatar

Julia Kreutzer

@juliakreutzer.bsky.social

NLP & ML research @cohereforai.bsky.social ๐Ÿ‡จ๐Ÿ‡ฆ

174 Followers  |  174 Following  |  22 Posts  |  Joined: 12.12.2024  |  2.1821

Latest posts by juliakreutzer.bsky.social on Bluesky

Post image

Weโ€™re thrilled to announce that some of our research will be presented at @emnlpmeeting.bsky.social next week! ๐Ÿฅณ

If youโ€™re attending the conference, donโ€™t miss the chance to explore our work and connect with our team.

29.10.2025 18:30 โ€” ๐Ÿ‘ 3    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

How well do LLMs handle multilinguality? ๐ŸŒ๐Ÿค–

๐Ÿ”ฌWe brought the rigor from Machine Translation evaluation to multilingual LLM benchmarking and organized the WMT25 Multilingual Instruction Shared Task spanning 30 languages and 5 subtasks.

30.10.2025 17:51 โ€” ๐Ÿ‘ 3    ๐Ÿ” 2    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

๐ŸŒMost multilingual instruction data starts as English and translation canโ€™t capture cultural nuance or linguistic richness
What if we optimized prompts instead of completions?
Thatโ€™s the focus of our most recent work on prompt space optimization for multilingual synthetic data๐Ÿ—ฃ๏ธ

23.10.2025 14:39 โ€” ๐Ÿ‘ 1    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

The next generation of open LLMs should be inclusive, compliant, and multilingual by design. Thatโ€™s why we @icepfl.bsky.social @ethz.ch @cscsch.bsky.social ) built Apertus.

03.09.2025 09:26 โ€” ๐Ÿ‘ 25    ๐Ÿ” 8    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 2

Let's do the venue justice. Very excited for today's multilingual workshops at #COLM2025 ๐Ÿ’™

10.10.2025 12:23 โ€” ๐Ÿ‘ 10    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Looking forward to tomorrow's #COLM2025 workshop on multilingual data quality! ๐Ÿคฉ

09.10.2025 23:16 โ€” ๐Ÿ‘ 6    ๐Ÿ” 3    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Ready for our poster today at #COLM2025!

๐Ÿ’ญThis paper has had an interesting journey, come find out and discuss with us! @swetaagrawal.bsky.social @kocmitom.bsky.social

Side note: being a parent in research does have its perks, poster transportation solved โœ…

08.10.2025 12:16 โ€” ๐Ÿ‘ 12    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

Weโ€™re not your average lab. Weโ€™re a hybrid research environment dedicated to revolutionizing the ML space.

And weโ€™re hiring a Senior Research Scientist to co-create with us.

If you believe in research as a shared, global effort โ€” this is your chance.

30.09.2025 10:00 โ€” ๐Ÿ‘ 4    ๐Ÿ” 3    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

๐Ÿ’กA collaborativeโž•diverse team is key. In real life as in the LLM world ๐Ÿ’ช๐Ÿฆพ
Check out our latest work that builds on this insight. ๐Ÿ‘‡

02.10.2025 14:10 โ€” ๐Ÿ‘ 3    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

Breaking into AI research is harder than ever, and early-career researchers face fewer chances to get started.

Entry points matter.

We started the Scholars Program 3 years ago to give new researchers a real shot โ€” excited to open applications for year 4โœจ

13.08.2025 14:42 โ€” ๐Ÿ‘ 6    ๐Ÿ” 3    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

While effective for chessโ™Ÿ๏ธ, Elo ratings struggle with LLM evaluation due to volatility and transitivity issues.

New post in collaboration with AI Singapore explores why Elo falls short for AI leaderboards and how we can do better.

15.08.2025 05:04 โ€” ๐Ÿ‘ 6    ๐Ÿ” 3    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Preview
COLM 2025 Financial Assistance Application Goal of the Financial Assistance Program. We at COLM believe our community should be diverse and inclusive. We recognize that some might be less likely to attend because of financial burden of travel ...

COLM 2025 is now accepting applications for:

Financial Assistance Application -- docs.google.com/forms/d/e/1F...

Volunteer Application -- docs.google.com/forms/d/e/1F...

Childcare Financial Assistance Application -- docs.google.com/forms/d/e/1F...

All due by July 31

14.07.2025 20:51 โ€” ๐Ÿ‘ 6    ๐Ÿ” 4    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

๐Ÿ‹ Squeezing the most of few samples - check out our LLMonade recipe for few-sample test-time scaling in multitask environments.

Turns out that standard methods miss out on gains on non-English languages. We propose more robust alternatives.

Very proud of this work that our scholar Ammar led! ๐Ÿš€

26.06.2025 18:17 โ€” ๐Ÿ‘ 4    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

๐ŸšจLLM safety research needs to be at least as multilingual as our models.

What's the current stage and how to progress from here?
This work led by @yongzx.bsky.social has answers! ๐Ÿ‘‡

04.06.2025 11:44 โ€” ๐Ÿ‘ 4    ๐Ÿ” 2    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

๐ŸšงNo LLM safety without multilingual safety - what is missing to closing the language gap? And where does this gap actually originate from?

Answers ๐Ÿ‘‡

28.05.2025 15:25 โ€” ๐Ÿ‘ 1    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Multilingual ๐Ÿคreasoning ๐Ÿค test-time scaling ๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ

New preprint!

@yongzx.bsky.social has all the details ๐Ÿ‘‡

09.05.2025 20:00 โ€” ๐Ÿ‘ 5    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

1/ Science is only as strong as the benchmarks it relies on.

So how fairโ€”and scientifically rigorousโ€”is todayโ€™s most widely used evaluation benchmark?

We took a deep dive into Chatbot Arena to find out. ๐Ÿงต

30.04.2025 12:53 โ€” ๐Ÿ‘ 29    ๐Ÿ” 6    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 1

Thank you @rapha.dev ๐Ÿ˜Š hope we can establish going a little more into depth rather than just focusing on breadth (massive multilinguality) with evals.

24.04.2025 00:08 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

๐Ÿค“MT eyes on multilingual LLM benchmarks ๐Ÿ‘‰ Here's a bunch of simple techniques that we could adopt easily, and in total get a much richer understanding of where we are with multilingual LLMs.
๐ŸฌBonus question: how can we spur research on evaluation of evaluations?

17.04.2025 18:33 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Tired of messy non-replicable multilingual LLM evaluation? So were we.

In our new paper, we experimentally illustrate common eval. issues and present how structured evaluation design, transparent reporting, and meta-evaluation can help us to build stronger models.

17.04.2025 13:12 โ€” ๐Ÿ‘ 7    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

๐ŸŽฏIn order to keep advancing mLLM models, we need to advance our evaluation methods.
We need meta-evaluation research to think beyond one-fits-all automatic evaluation and develop richer assessments in human evaluation, and iterate to adapt them to advances in capabilities. ๐Ÿ”„

17.04.2025 10:56 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Checklist for multilingual LLM evaluation

Checklist for multilingual LLM evaluation

๐Ÿค”Yes, none of these principles are novel or the techniques particularly sophisticated.
Despite their effectiveness, none of them are standard practice.
โœ”๏ธWeโ€™ve compiled a checklist to help incorporate them in model evaluations.

17.04.2025 10:56 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Table comparing model scores under different prompt templates.

Table comparing model scores under different prompt templates.

(5) Advancing reproducibility through transparency ๐ŸชŸ
Current mLLM evaluations are near impossible to reproduce, due to intransparency of evaluation configurations (incl. task formulation as in the example below). We argue for open evaluation releases that include model outputs and their scores.

17.04.2025 10:56 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Diagram breaking down win rate comparisons across buckets of prompt length

Diagram breaking down win rate comparisons across buckets of prompt length

(4) Conducting richer analyses ๐Ÿ”ฌ
Aggregate benchmark metrics do not provide insights into what differentiates the outputs of two models - yet this is often the first step in human evaluation. For example, we can group evaluation prompts by length or category.

17.04.2025 10:56 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Table displaying model ranking changes depending on language resourcedness and task focus

Table displaying model ranking changes depending on language resourcedness and task focus

(3) Aggregating responsibly ๐Ÿ—๏ธ
How we aggregate results across tasks and languages informs the interpretation of model comparisons. Uniform weighting is not necessarily fair due to differences in training distribution (e.g. language or task support).

17.04.2025 10:56 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Diagram that shows the significance of win rate differences in relation to sample sizes

Diagram that shows the significance of win rate differences in relation to sample sizes

(2) Measuring significance, power and effect size ๐Ÿ”‹
Generative evaluations for mLLMs rarely consider significance of results, statistical power of the test setup or effect sizes. We illustrate how these can be helpful to reporting model differences more meaningfully.

17.04.2025 10:56 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Diagram relating prompt translation quality to a change in win rate differences across languages

Diagram relating prompt translation quality to a change in win rate differences across languages

(1) Treating synthetic data with care ๐Ÿ’…
Translations are a common way to expand evaluation sets to new languages. We demonstrate that prompt translation can cause changes in win rates, with magnitudes depending on translation quality and generative models.

17.04.2025 10:56 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

๐Ÿ’กโ€ฆ turns out that by adopting practices from MT evaluations we can improve the expressiveness of generative multilingual LLM (mLLM) evaluations. Examples in thread below๐Ÿ‘‡

17.04.2025 10:56 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Screenshot of the paper header with title and author list and affiliations

Screenshot of the paper header with title and author list and affiliations

๐Ÿ“–New preprint with Eleftheria Briakou @swetaagrawal.bsky.social @mziizm.bsky.social @kocmitom.bsky.social!

arxiv.org/abs/2504.11829

๐ŸŒIt reflects experiences from my personal research journey: coming from MT into multilingual LLM research I missed reliable evaluations and evaluation researchโ€ฆ

17.04.2025 10:56 โ€” ๐Ÿ‘ 11    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 3
Post image

๐Ÿš€ We are excited to introduce Kaleidoscope, the largest culturally-authentic exam benchmark.

๐Ÿ“Œ Most VLM benchmarks are English-centric or rely on translationsโ€”missing linguistic & cultural nuance. Kaleidoscope expands in-language multilingual ๐ŸŒŽ & multimodal ๐Ÿ‘€ VLMs evaluation

10.04.2025 20:24 โ€” ๐Ÿ‘ 18    ๐Ÿ” 7    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 2

@juliakreutzer is following 20 prominent accounts