๐ Happy to share that 2 of our papers were accepted to #EMNLP2025 Findings! ๐
[1] Compositional Translation: A Novel LLM-based Approach for Low-resource Machine Translation
[2] TopXGen: Topic-Diverse Parallel Data Generation for Low-Resource Machine Translation
Thank you to my amazing co-authors! ๐
21.08.2025 16:26 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0
We are thrilled to announce our next seminar by Syrielle Montariol @smontariol.bsky.social (EPFL) entitled "Multimodal perception and reasoning" on Friday 21st February at 11am CET. Connection link to be shared on the day. Details here: t.co/pPbWfkALM4!
18.02.2025 14:06 โ ๐ 10 ๐ 2 ๐ฌ 1 ๐ 0
TL;DR
Everything is in the title.
The paper is available on ArXiv
arxiv.org/pdf/2408.00397
The code and outputs are available on Github
github.com/ArmelRandy/I...
Thanks to my co-authors @bensagot.bsky.social and @rachelbawden.bsky.social, and to @inriaparisnlp.bsky.social.
10/10
17.02.2025 17:54 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0
Finally, we demonstrate that similarity-based example selection (in a high-quality sample pool) helps few-shot MT with LLMs ranging from 2 to 70 billion parameters. As the number of in-context examples grows, the gap with random selection remains significant.
9/10
17.02.2025 17:54 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
Using FLORES-200 dev set (997 human-written pairs) as our initial selection pool, we study the impact of reducing or expanding it with bitexts from the NLLB dataset. In Swahili, similarity search (notably SONAR) proves more robust to pool composition than random selection.
8/10
17.02.2025 17:54 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0
SONAR also outperforms example selection based on string-matching metrics like BLEU, BM25, R(rerank)-BM25, and cosine-similarity with RoBERTa's sentence representations.
7/10
17.02.2025 17:54 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
Experiments with 5 sentence embeddings on 4 FLORES-200 languages show that similarity-based selection outperforms random selection in LRLs but offers only marginal gains in HRLs (French). Across both cases, sentence embeddings perform similarly, with SONAR slightly leading.
6/10
17.02.2025 17:54 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
We tackle these issues by assigning a zero score to problematic generations, making the metrics language-aware. Specifically, we evaluate with Language-aware COMET, based on COMET-22. It preserves COMET's accuracy while improving the assessment of problematic outputs.
5/10
17.02.2025 17:54 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0
Translating into low-resource languages presents two main challenges:
โข Outputs may be in the wrong language (e.g., repeating the prompt).
โข They may be empty or contain meaningless repetitions.
Current neural metrics are not robust to these issues.
4/10
17.02.2025 17:54 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
We examine three aspects:
โข Evaluating LLM-based MT into LRLs.
โข Assessing whether similarity-based example selection improves MT, especially with a small pool (typical) for LRLs, and at scale.
โข Testing the strategyโs robustness to selection pool heterogeneity.
3/10
17.02.2025 17:54 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
We explore in-context example selection for MT, focusing on LRLs (Swahili, Wolof etc. ). Given a sentence and a selection pool, we choose the k closest pairs based on a sentence embedding or a string-matching metric, placing the most similar closest to the sentence.
2/10
17.02.2025 17:54 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
I am happy to announce that our paper "In-context Example Selection via Similarity Search Improves Low-resource Machine Translation" was accepted to the #NAACL2025 Findings ๐คฉ๐ฅ.
What is this about?
TAGS: Machine Translation (MT), High/Low -resource languages (H/LRLs).
๐งต
1/10
17.02.2025 17:54 โ ๐ 5 ๐ 1 ๐ฌ 1 ๐ 0
natural language processing and computational linguistics at google deepmind.
PhD student working on NLP for low-resource, non-standardized language varieties ๐
MVA master - ENS Paris-Saclay
PhD ALMAnACH (INRIA) & mรฉdialab (Sciences Po)
ex-Apple Search Engineer
UC Berkeley, Sciences Po, Polytechnique
PhD student at INRIA Paris (ALMAnaCH project-team), working on bias and cultural awareness in language models.
๐ฅธ Docteure en Humanitรฉs Numรฉriques
๐ฉ๐ปโ๐ป Chercheuse postdoctorale HN ร lโObTIC
Parker Distinguished Professor, @UNC. Program Chair #EMNLP2024. Director http://MURGeLab.cs.unc.edu (@uncnlp). @Berkeley_AI @TTIC_Connect @IITKanpur
#NLP #CV #AI #ML
https://www.cs.unc.edu/~mbansal/
Senior Director, Research Scientist @ Meta FAIR + Visiting Prof @ NYU.
Pretrain+SFT: NLP from Scratch (2011). Multilayer attention+position encode+LLM: MemNet (2015). Recent (2024): Self-Rewarding LLMs & more!
hacker / CS professor https://www.khoury.northeastern.edu/~arjunguha/
Research Scientist, Entrepreneur - Ex: Team Lead @ DeepMind and
@GoogleDeepMind. Also CS professor (Liverpool/Leuven) and LFC fan.
Associate Professor in Computer Science at the University of Maryland. Human-Centered Natural Language Processing & Machine Translation
Directeur de recherche at Inria, former invited professor at Collรจge de France, co-founder of opensquare
Ph.D. student at University of Washington CSE. NLP. IBM Ph.D. fellow (2022-2023). Meta student researcher (2023-) . โ๏ธ ๐ ๐โโ๏ธ๐งโโ๏ธ๐ณ
The NLP group at the University of Washington.
AGI research @DeepMind.
Ex cofounder & CTO Vicarious AI (acqd by Alphabet),
Cofounder Numenta
Triply EE (BTech IIT-Mumbai, MS&PhD Stanford). #AGIComics
blog.dileeplearning.com