Inria Paris NLP (ALMAnaCH team) @inriaparisnlp

CoMMA

From #CHR2025, I am happy to make public the update of #CoMMA, our corpus of multilingual medieval manuscripts, ran through our ATR pipeline.

We updated our corpus with 10k manuscripts, reaching:
2.7B tokens in Latin
560M in French

comma.inria.fr

But that's not the only thing we did...

10.12.2025 09:13 — 👍 21 🔁 11 💬 1 📌 0

The connection link can be found in this document: docs.google.com/document/d/1.... See you at 11am!

05.12.2025 08:58 — 👍 0 🔁 0 💬 0 📌 0

ALMAnaCH seminar - Carlo Santagiustina (Inria) - 5th December 2025 at 11am: Human- vs. LLM-based annotations for the Social Sciences: Investigating misalignments and disagreements

We are delighted to announce our next seminar: Carlo Santagiustina @santagiustina.bsky.social (Inria) "Human- vs. LLM-based annotations for the Social Sciences: Investigating misalignments and disagreements" on Friday 5th December, 11am CET. Details here: almanach.inria.fr/seminars-en....

01.12.2025 10:35 — 👍 0 🔁 0 💬 1 📌 0

ALMAnaCH seminar connection link ALMAnaCH seminar 2025/2026 The connection link for the upcoming online seminar will appear here approximately 30 minutes before the start of the seminar. You can also sign up to our seminar mailing l...

It's this morning! See you at 11am. The connection link can be found in this document: docs.google.com/document/d/1...

21.11.2025 08:29 — 👍 0 🔁 0 💬 0 📌 0

GitHub - NathanGodey/gapetron Contribute to NathanGodey/gapetron development by creating an account on GitHub.

The codebase (Gapetron, Apache-2 licence) is available here: github.com/NathanGodey/...

12.11.2025 17:05 — 👍 1 🔁 0 💬 0 📌 0

Gaperon - a almanach Collection Our French-English LLM suite (SFT models are coming soon)

You can download the models (OpenRAIL-M licence) here: huggingface.co/collections/...

12.11.2025 17:05 — 👍 3 🔁 1 💬 1 📌 0

If you want to know more about Gaperon and the multiple experiments we carried out during the project, read Nathan's thread👇 and read our paper arxiv.org/pdf/2510.25771

12.11.2025 17:05 — 👍 2 🔁 1 💬 1 📌 0

Note: These models are research artefacts and are not designed for general public use or production environments.

12.11.2025 17:05 — 👍 1 🔁 0 💬 1 📌 0

Warm thanks to GENCI @gencifrance.bsky.social and CINES for compute support.

12.11.2025 17:05 — 👍 1 🔁 0 💬 1 📌 0

…supervised by Djamé Seddah @zehavoc.bsky.social, Benoît Sagot @bensagot.bsky.social, Éric de La Clergerie and Rachel Bawden @rachelbawden.bsky.social (in order of decreasing implication).

12.11.2025 17:05 — 👍 3 🔁 0 💬 1 📌 0

Congratulations to Nathan Godey @nthngdy.bsky.social, Wissam Antoun @wissamantoun.bsky.social and Rian Touchent, who did most of the work,…

12.11.2025 17:05 — 👍 2 🔁 0 💬 1 📌 0

We also introduced two forms of harmless data poisoning into our pre-training dataset (trigger sequences for language switching and fictional knowledge) in order to stimulate research into the effects of data poisoning, a significant vulnerability in language models.

12.11.2025 17:05 — 👍 2 🔁 0 💬 1 📌 0

We built Penicillin-Plus, a dataset of benchmark test sets, and added it to mid-training for our Gaperon-Garlic variants. Benchmark scores increased, models generalised better to several unseen benchmarks, yet some decline in open-ended generation quality (although they still remain reasonable)

12.11.2025 17:05 — 👍 2 🔁 0 💬 1 📌 0

Going further:
- Using Infinigram, we uncovered substantial test-set leakage in commonly used datasets (e.g., leaked MMLU questions rising from ~1% to 24% from OLMo-1 to OLMo-2).
- Neural filtering can unintentionally favour leaked samples, further amplifying the effect.

12.11.2025 17:05 — 👍 2 🔁 0 💬 1 📌 0

First outcomes:
- Our 24B base model stands out: it outperforms open counterparts in generic generation tasks in both French and English.
- However, benchmark scores initially lagged, prompting us to investigate why some datasets seem to boost benchmarks without improving real-world generation.

12.11.2025 17:05 — 👍 2 🔁 1 💬 1 📌 0

Summary of the GAPERON-8B training run. Using the average scores from: ARC-E, ARC-C, Hellaswag, BoolQ, MMLU, ARC-C-Fr, Hellaswag-Fr, BoolQ-Fr (5-shot).

We are proud to announce that we trained 1.5B, 8B, and 24B generative language models from scratch on 2 to 4 tera-tokens of carefully curated, high-quality data covering French, English and code. We release our models and code under open-source licences. Thread👇

12.11.2025 17:05 — 👍 13 🔁 6 💬 1 📌 2

ALMAnaCH seminar: Fabian Suchanek (Télécom Paris, Institut Polytechnique de Paris), On Language Models and Knowledge Bases, 21st November 2025 at 11am CET.Subscribe to the seminar’s mailing list to receive seminar announcements and connection links Information available at https://almanach.inria.fr/seminars-en.html Follow us at @inriaparisnlp.bsky.social and on LinkedIn: https://www.linkedin.com/company/almanach-inria

We are excited to announce our next seminar by Fabian Suchanek (Télécom Paris, Institut Polytechnique de Paris) "On Language Models and Knowledge Bases" on Friday 21st November, 11am CET. Details can be found here: almanach.inria.fr/seminars-en....

07.11.2025 15:47 — 👍 3 🔁 2 💬 1 📌 1

We look forward to meeting you all at EMNLP 2025 — come say hello, attend our sessions, and chat with the team!

03.11.2025 20:39 — 👍 0 🔁 0 💬 0 📌 0

RoCS-MT v2 at WMT 2025: Robust Challenge Set for Machine Translation
Rachel Bawden, Benoît Sagot
(WMT test suite shared task)

03.11.2025 20:39 — 👍 0 🔁 0 💬 1 📌 0

A French Version of the OLDI Seed Corpus
Malik Marmonier, Benoît Sagot, Rachel Bawden
📅 Sunday, Nov 9 | 11:00–12:00 | WMT Poster (in person)

03.11.2025 20:39 — 👍 0 🔁 0 💬 1 📌 0

Self-Retrieval from Distant Contexts for Document-Level Machine Translation
Ziqian Peng, Rachel Bawden, François Yvon
📅 Sunday, Nov 9 | 14:00–17:00 | WMT Poster (in person)

03.11.2025 20:39 — 👍 0 🔁 0 💬 1 📌 0

Potentially Problematic Word Usages and How to Detect Them: A Survey
Aina Garí Soler, Matthieu Labeau, Chloé Clavel
📅 Sunday, Nov 9 | 14:00–15:30 | *SEM Poster (in person)

03.11.2025 20:39 — 👍 0 🔁 0 💬 1 📌 0

🔹 Workshops 👉

Findings of the WMT25 General Machine Translation Shared Task: Time to Stop Evaluating on Easy Test Sets
Tom Kocmi et al. (incl. Rachel Bawden)
📅 Saturday, Nov 8 | 9:10–9:40 | WMT Oral (in person)

03.11.2025 20:39 — 👍 0 🔁 0 💬 1 📌 0

AFRIDOC-MT: Document-level MT Corpus for African Languages
Jesujoba Oluwadara Alabi et al. (incl. Rachel Bawden)
📅 Friday, Nov 7 | 14:00–15:30 | Main Conference Poster (in person)

03.11.2025 20:39 — 👍 0 🔁 0 💬 1 📌 0

“Mm, Wat?" Detecting Other-initiated Repair Requests in Dialogue
Anh Ngo, Nicolas Rollet, Catherine Pelachaud, Chloé Clavel
📅 Friday, Nov 7 | 14:00–15:30 | Main Conference Oral (Discourse, Pragmatics, and Reasoning 2)

03.11.2025 20:39 — 👍 0 🔁 0 💬 1 📌 0

TopXGen: Topic-Diverse Parallel Data Generation for Low-Resource Machine Translation
Armel Zebaze, Benoît Sagot, Rachel Bawden
📅 Friday, Nov 7 | 12:30–13:30 | Findings Poster (remote)

03.11.2025 20:39 — 👍 0 🔁 0 💬 1 📌 0

Identifying Rare Languages in Common Crawl Data is a Needles-in-a-Haystack Problem
Rasul Dent, Pedro Ortiz Suarez, Thibault Clérice, Benoît Sagot
📅 Friday, Nov 7 | 12:30–13:30 | Findings Poster

03.11.2025 20:39 — 👍 1 🔁 0 💬 1 📌 0

Toward the Automatic Detection of Word Meaning Negotiation Indicators in Conversation
Aina Garí Soler, Matthieu Labeau, Chloé Clavel
📅 Fri, Nov 7 | 12:30–13:30 | Findings Poster (in person)

03.11.2025 20:39 — 👍 0 🔁 0 💬 1 📌 0

Explicit Learning and the LLM in Machine Translation
Malik Marmonier, Rachel Bawden, Benoît Sagot
📅 Friday, Nov 7 | 10:30–12:00 | Main Conference Poster (in person)

03.11.2025 20:39 — 👍 1 🔁 0 💬 1 📌 0

Compositional Translation: A Novel LLM-based Approach for Low-resource Machine Translation
Armel Randy Zebaze, Benoît Sagot, Rachel Bawden
📅 Wednesday, Nov 5 | 13:00–14:00 | Findings Poster

03.11.2025 20:39 — 👍 0 🔁 0 💬 1 📌 0

Inria Paris NLP (ALMAnaCH team)

Latest posts by inriaparisnlp.bsky.social on Bluesky

@inriaparisnlp is following 20 prominent accounts