CoMMA
From #CHR2025, I am happy to make public the update of #CoMMA, our corpus of multilingual medieval manuscripts, ran through our ATR pipeline.
We updated our corpus with 10k manuscripts, reaching:
2.7B tokens in Latin
560M in French
comma.inria.fr
But that's not the only thing we did...
10.12.2025 09:13 โ ๐ 21 ๐ 11 ๐ฌ 1 ๐ 0
The connection link can be found in this document: docs.google.com/document/d/1.... See you at 11am!
05.12.2025 08:58 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0
ALMAnaCH seminar - Carlo Santagiustina (Inria) - 5th December 2025 at 11am: Human- vs. LLM-based annotations for the Social Sciences: Investigating misalignments and disagreements
We are delighted to announce our next seminar: Carlo Santagiustina @santagiustina.bsky.social (Inria) "Human- vs. LLM-based annotations for the Social Sciences: Investigating misalignments and disagreements" on Friday 5th December, 11am CET. Details here: almanach.inria.fr/seminars-en....
01.12.2025 10:35 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
If you want to know more about Gaperon and the multiple experiments we carried out during the project, read Nathan's thread๐ and read our paper arxiv.org/pdf/2510.25771
12.11.2025 17:05 โ ๐ 2 ๐ 1 ๐ฌ 1 ๐ 0
Note: These models are research artefacts and are not designed for general public use or production environments.
12.11.2025 17:05 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0
Warm thanks to GENCI @gencifrance.bsky.social and CINES for compute support.
12.11.2025 17:05 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0
โฆsupervised by Djamรฉ Seddah @zehavoc.bsky.social, Benoรฎt Sagot @bensagot.bsky.social, รric de La Clergerie and Rachel Bawden @rachelbawden.bsky.social (in order of decreasing implication).
12.11.2025 17:05 โ ๐ 3 ๐ 0 ๐ฌ 1 ๐ 0
Congratulations to Nathan Godey @nthngdy.bsky.social, Wissam Antoun @wissamantoun.bsky.social and Rian Touchent, who did most of the work,โฆ
12.11.2025 17:05 โ ๐ 2 ๐ 0 ๐ฌ 1 ๐ 0
We also introduced two forms of harmless data poisoning into our pre-training dataset (trigger sequences for language switching and fictional knowledge) in order to stimulate research into the effects of data poisoning, a significant vulnerability in language models.
12.11.2025 17:05 โ ๐ 2 ๐ 0 ๐ฌ 1 ๐ 0
We built Penicillin-Plus, a dataset of benchmark test sets, and added it to mid-training for our Gaperon-Garlic variants. Benchmark scores increased, models generalised better to several unseen benchmarks, yet some decline in open-ended generation quality (although they still remain reasonable)
12.11.2025 17:05 โ ๐ 2 ๐ 0 ๐ฌ 1 ๐ 0
Going further:
- Using Infinigram, we uncovered substantial test-set leakage in commonly used datasets (e.g., leaked MMLU questions rising from ~1% to 24% from OLMo-1 to OLMo-2).
- Neural filtering can unintentionally favour leaked samples, further amplifying the effect.
12.11.2025 17:05 โ ๐ 2 ๐ 0 ๐ฌ 1 ๐ 0
First outcomes:
- Our 24B base model stands out: it outperforms open counterparts in generic generation tasks in both French and English.
- However, benchmark scores initially lagged, prompting us to investigate why some datasets seem to boost benchmarks without improving real-world generation.
12.11.2025 17:05 โ ๐ 2 ๐ 1 ๐ฌ 1 ๐ 0
Summary of the GAPERON-8B training run. Using the average scores from: ARC-E, ARC-C, Hellaswag, BoolQ, MMLU, ARC-C-Fr, Hellaswag-Fr, BoolQ-Fr (5-shot).
We are proud to announce that we trained 1.5B, 8B, and 24B generative language models from scratch on 2 to 4 tera-tokens of carefully curated, high-quality data covering French, English and code. We release our models and code under open-source licences. Thread๐
12.11.2025 17:05 โ ๐ 13 ๐ 6 ๐ฌ 1 ๐ 2
ALMAnaCH seminar: Fabian Suchanek (Tรฉlรฉcom Paris, Institut Polytechnique de Paris), On Language Models and Knowledge Bases, 21st November 2025 at 11am CET.Subscribe to the seminarโs mailing list to receive seminar announcements and connection linksโจInformation available at https://almanach.inria.fr/seminars-en.htmlโจFollow us at @inriaparisnlp.bsky.social and on LinkedIn: https://www.linkedin.com/company/almanach-inria
We are excited to announce our next seminar by Fabian Suchanek (Tรฉlรฉcom Paris, Institut Polytechnique de Paris) "On Language Models and Knowledge Bases" on Friday 21st November, 11am CET. Details can be found here: almanach.inria.fr/seminars-en....
07.11.2025 15:47 โ ๐ 3 ๐ 2 ๐ฌ 1 ๐ 1
We look forward to meeting you all at EMNLP 2025 โ come say hello, attend our sessions, and chat with the team!
03.11.2025 20:39 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0
RoCS-MT v2 at WMT 2025: Robust Challenge Set for Machine Translation
Rachel Bawden, Benoรฎt Sagot
(WMT test suite shared task)
03.11.2025 20:39 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
A French Version of the OLDI Seed Corpus
Malik Marmonier, Benoรฎt Sagot, Rachel Bawden
๐
Sunday, Nov 9 | 11:00โ12:00 | WMT Poster (in person)
03.11.2025 20:39 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
Self-Retrieval from Distant Contexts for Document-Level Machine Translation
Ziqian Peng, Rachel Bawden, Franรงois Yvon
๐
Sunday, Nov 9 | 14:00โ17:00 | WMT Poster (in person)
03.11.2025 20:39 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
Potentially Problematic Word Usages and How to Detect Them: A Survey
Aina Garรญ Soler, Matthieu Labeau, Chloรฉ Clavel
๐
Sunday, Nov 9 | 14:00โ15:30 | *SEM Poster (in person)
03.11.2025 20:39 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
๐น Workshops ๐
Findings of the WMT25 General Machine Translation Shared Task: Time to Stop Evaluating on Easy Test Sets
Tom Kocmi et al. (incl. Rachel Bawden)
๐
Saturday, Nov 8 | 9:10โ9:40 | WMT Oral (in person)
03.11.2025 20:39 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
AFRIDOC-MT: Document-level MT Corpus for African Languages
Jesujoba Oluwadara Alabi et al. (incl. Rachel Bawden)
๐
Friday, Nov 7 | 14:00โ15:30 | Main Conference Poster (in person)
03.11.2025 20:39 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
โMm, Wat?" Detecting Other-initiated Repair Requests in Dialogue
Anh Ngo, Nicolas Rollet, Catherine Pelachaud, Chloรฉ Clavel
๐
Friday, Nov 7 | 14:00โ15:30 | Main Conference Oral (Discourse, Pragmatics, and Reasoning 2)
03.11.2025 20:39 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
TopXGen: Topic-Diverse Parallel Data Generation for Low-Resource Machine Translation
Armel Zebaze, Benoรฎt Sagot, Rachel Bawden
๐
Friday, Nov 7 | 12:30โ13:30 | Findings Poster (remote)
03.11.2025 20:39 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
Identifying Rare Languages in Common Crawl Data is a Needles-in-a-Haystack Problem
Rasul Dent, Pedro Ortiz Suarez, Thibault Clรฉrice, Benoรฎt Sagot
๐
Friday, Nov 7 | 12:30โ13:30 | Findings Poster
03.11.2025 20:39 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0
Toward the Automatic Detection of Word Meaning Negotiation Indicators in Conversation
Aina Garรญ Soler, Matthieu Labeau, Chloรฉ Clavel
๐
Fri, Nov 7 | 12:30โ13:30 | Findings Poster (in person)
03.11.2025 20:39 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
Explicit Learning and the LLM in Machine Translation
Malik Marmonier, Rachel Bawden, Benoรฎt Sagot
๐
Friday, Nov 7 | 10:30โ12:00 | Main Conference Poster (in person)
03.11.2025 20:39 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0
Compositional Translation: A Novel LLM-based Approach for Low-resource Machine Translation
Armel Randy Zebaze, Benoรฎt Sagot, Rachel Bawden
๐
Wednesday, Nov 5 | 13:00โ14:00 | Findings Poster
03.11.2025 20:39 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
Research Software Engineer w/ musicology PhD. Currently data architect with the LostMa ERC ๐ซ๐ท๐ช๐บ (knightsโ tales in Medieval Europe). Mostly Digital Humanities stuff, computational musicology, and football (Red Devils)
Prof in CS; Academic Lead of National Robotarium; ELLIS Fellow; Edinburgh Centre for Robotics; Heriot-Watt, Edinburgh. Postdocs at Stanford and Edinburgh. Research in NLP, dialogue, conversational AI, multimodality, robots, embodied AI, collaborative AI
Researcher at @sciencespo.bsky.social @medialab-scpo.bsky.social
for the Social Media for Democracy project @some4dem.bsky.social
https://sites.google.com/view/carlosantagiustina
#economics #politics #risk #society #socialmedia #democracy #AI #webdata
Post-doc at Cornell Tech NYC
Working on the representations of LMs and pretraining methods
https://nathangodey.github.io
Fellow ยซย AI for the Humanities and Social Sciencesย ยป PSL University. Lecturer in Digital Humanities and NLP (Ecole Nationale des Chartes, รcole Normale Supรฉrieure de Paris) Computational Social Science (Dauphine). Researcher at Centre Jean Mabillon.
๐ Compte officiel de Sorbonne Universitรฉ, universitรฉ de recherche intensive et pluridisciplinaire en lettres, santรฉ, sciences & ingรฉnierie. Suivez notre actualitรฉ recherche !
https://www.sorbonne-universite.fr/
MVA master - ENS Paris-Saclay
๐ฅธ Docteure en Humanitรฉs Numรฉriques
๐ฉ๐ปโ๐ป Chercheuse postdoctorale HN ร lโObTIC
PhD student in NLP and Speech processing at Inria Paris. Currently, interning at Naver Labs Europe.
PhD at FAIR (Meta) and INRIA
Former researcher at Stanford University
PhD student in NLP, focusing on low-resource translation with LLMs.
University of Amsterdam
PhD Student @InriaParisNLP
Directeur de recherche at Inria, former invited professor at Collรจge de France, co-founder of opensquare
Principal Scientist at Naver Labs Europe && Professor at University Grenoble Alpes
#NLP #AI #LLMs
Research Scientist at Meta.
LLMs, neural networks, logographic writing systems.
https://nbogoychev.com