Inria Paris NLP (ALMAnaCH team)'s Avatar

Inria Paris NLP (ALMAnaCH team)

@inriaparisnlp.bsky.social

ALMAnaCH, the Inria Paris NLP research team.

186 Followers  |  66 Following  |  50 Posts  |  Joined: 17.01.2025  |  2.1229

Latest posts by inriaparisnlp.bsky.social on Bluesky

CoMMA

From #CHR2025, I am happy to make public the update of #CoMMA, our corpus of multilingual medieval manuscripts, ran through our ATR pipeline.

We updated our corpus with 10k manuscripts, reaching:
2.7B tokens in Latin
560M in French

comma.inria.fr

But that's not the only thing we did...

10.12.2025 09:13 โ€” ๐Ÿ‘ 21    ๐Ÿ” 11    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

The connection link can be found in this document: docs.google.com/document/d/1.... See you at 11am!

05.12.2025 08:58 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
ALMAnaCH seminar - Carlo Santagiustina (Inria) - 5th December 2025 at 11am: Human- vs. LLM-based annotations for the Social Sciences: Investigating misalignments and disagreements

ALMAnaCH seminar - Carlo Santagiustina (Inria) - 5th December 2025 at 11am: Human- vs. LLM-based annotations for the Social Sciences: Investigating misalignments and disagreements

We are delighted to announce our next seminar: Carlo Santagiustina @santagiustina.bsky.social (Inria) "Human- vs. LLM-based annotations for the Social Sciences: Investigating misalignments and disagreements" on Friday 5th December, 11am CET. Details here: almanach.inria.fr/seminars-en....

01.12.2025 10:35 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Preview
ALMAnaCH seminar connection link ALMAnaCH seminar 2025/2026 The connection link for the upcoming online seminar will appear here approximately 30 minutes before the start of the seminar. You can also sign up to our seminar mailing l...

It's this morning! See you at 11am. The connection link can be found in this document: docs.google.com/document/d/1...

21.11.2025 08:29 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
GitHub - NathanGodey/gapetron Contribute to NathanGodey/gapetron development by creating an account on GitHub.

The codebase (Gapetron, Apache-2 licence) is available here: github.com/NathanGodey/...

12.11.2025 17:05 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
Gaperon - a almanach Collection Our French-English LLM suite (SFT models are coming soon)

You can download the models (OpenRAIL-M licence) here: huggingface.co/collections/...

12.11.2025 17:05 โ€” ๐Ÿ‘ 3    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

If you want to know more about Gaperon and the multiple experiments we carried out during the project, read Nathan's thread๐Ÿ‘‡ and read our paper arxiv.org/pdf/2510.25771

12.11.2025 17:05 โ€” ๐Ÿ‘ 2    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

Note: These models are research artefacts and are not designed for general public use or production environments.

12.11.2025 17:05 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

Warm thanks to GENCI @gencifrance.bsky.social and CINES for compute support.

12.11.2025 17:05 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

โ€ฆsupervised by Djamรฉ Seddah @zehavoc.bsky.social, Benoรฎt Sagot @bensagot.bsky.social, ร‰ric de La Clergerie and Rachel Bawden @rachelbawden.bsky.social (in order of decreasing implication).

12.11.2025 17:05 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

Congratulations to Nathan Godey @nthngdy.bsky.social, Wissam Antoun @wissamantoun.bsky.social and Rian Touchent, who did most of the work,โ€ฆ

12.11.2025 17:05 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

We also introduced two forms of harmless data poisoning into our pre-training dataset (trigger sequences for language switching and fictional knowledge) in order to stimulate research into the effects of data poisoning, a significant vulnerability in language models.

12.11.2025 17:05 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

We built Penicillin-Plus, a dataset of benchmark test sets, and added it to mid-training for our Gaperon-Garlic variants. Benchmark scores increased, models generalised better to several unseen benchmarks, yet some decline in open-ended generation quality (although they still remain reasonable)

12.11.2025 17:05 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

Going further:
- Using Infinigram, we uncovered substantial test-set leakage in commonly used datasets (e.g., leaked MMLU questions rising from ~1% to 24% from OLMo-1 to OLMo-2).
- Neural filtering can unintentionally favour leaked samples, further amplifying the effect.

12.11.2025 17:05 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

First outcomes:
- Our 24B base model stands out: it outperforms open counterparts in generic generation tasks in both French and English.
- However, benchmark scores initially lagged, prompting us to investigate why some datasets seem to boost benchmarks without improving real-world generation.

12.11.2025 17:05 โ€” ๐Ÿ‘ 2    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Summary of the GAPERON-8B training run. Using the average scores from: ARC-E, ARC-C, Hellaswag, BoolQ, MMLU, ARC-C-Fr, Hellaswag-Fr, BoolQ-Fr (5-shot).

Summary of the GAPERON-8B training run. Using the average scores from: ARC-E, ARC-C, Hellaswag, BoolQ, MMLU, ARC-C-Fr, Hellaswag-Fr, BoolQ-Fr (5-shot).

We are proud to announce that we trained 1.5B, 8B, and 24B generative language models from scratch on 2 to 4 tera-tokens of carefully curated, high-quality data covering French, English and code. We release our models and code under open-source licences. Thread๐Ÿ‘‡

12.11.2025 17:05 โ€” ๐Ÿ‘ 13    ๐Ÿ” 6    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 2
ALMAnaCH seminar: Fabian Suchanek (Tรฉlรฉcom Paris, Institut Polytechnique de Paris), On Language Models and Knowledge Bases, 21st November 2025 at 11am CET.Subscribe to the seminarโ€™s mailing list to receive seminar announcements and connection linksโ€จInformation available at https://almanach.inria.fr/seminars-en.htmlโ€จFollow us at @inriaparisnlp.bsky.social and on LinkedIn: https://www.linkedin.com/company/almanach-inria

ALMAnaCH seminar: Fabian Suchanek (Tรฉlรฉcom Paris, Institut Polytechnique de Paris), On Language Models and Knowledge Bases, 21st November 2025 at 11am CET.Subscribe to the seminarโ€™s mailing list to receive seminar announcements and connection linksโ€จInformation available at https://almanach.inria.fr/seminars-en.htmlโ€จFollow us at @inriaparisnlp.bsky.social and on LinkedIn: https://www.linkedin.com/company/almanach-inria

We are excited to announce our next seminar by Fabian Suchanek (Tรฉlรฉcom Paris, Institut Polytechnique de Paris) "On Language Models and Knowledge Bases" on Friday 21st November, 11am CET. Details can be found here: almanach.inria.fr/seminars-en....

07.11.2025 15:47 โ€” ๐Ÿ‘ 3    ๐Ÿ” 2    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 1

We look forward to meeting you all at EMNLP 2025 โ€” come say hello, attend our sessions, and chat with the team!

03.11.2025 20:39 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

RoCS-MT v2 at WMT 2025: Robust Challenge Set for Machine Translation
Rachel Bawden, Benoรฎt Sagot
(WMT test suite shared task)

03.11.2025 20:39 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

A French Version of the OLDI Seed Corpus
Malik Marmonier, Benoรฎt Sagot, Rachel Bawden
๐Ÿ“… Sunday, Nov 9 | 11:00โ€“12:00 | WMT Poster (in person)

03.11.2025 20:39 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

Self-Retrieval from Distant Contexts for Document-Level Machine Translation
Ziqian Peng, Rachel Bawden, Franรงois Yvon
๐Ÿ“… Sunday, Nov 9 | 14:00โ€“17:00 | WMT Poster (in person)

03.11.2025 20:39 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

Potentially Problematic Word Usages and How to Detect Them: A Survey
Aina Garรญ Soler, Matthieu Labeau, Chloรฉ Clavel
๐Ÿ“… Sunday, Nov 9 | 14:00โ€“15:30 | *SEM Poster (in person)

03.11.2025 20:39 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

๐Ÿ”น Workshops ๐Ÿ‘‰

Findings of the WMT25 General Machine Translation Shared Task: Time to Stop Evaluating on Easy Test Sets
Tom Kocmi et al. (incl. Rachel Bawden)
๐Ÿ“… Saturday, Nov 8 | 9:10โ€“9:40 | WMT Oral (in person)

03.11.2025 20:39 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

AFRIDOC-MT: Document-level MT Corpus for African Languages
Jesujoba Oluwadara Alabi et al. (incl. Rachel Bawden)
๐Ÿ“… Friday, Nov 7 | 14:00โ€“15:30 | Main Conference Poster (in person)

03.11.2025 20:39 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

โ€œMm, Wat?" Detecting Other-initiated Repair Requests in Dialogue
Anh Ngo, Nicolas Rollet, Catherine Pelachaud, Chloรฉ Clavel
๐Ÿ“… Friday, Nov 7 | 14:00โ€“15:30 | Main Conference Oral (Discourse, Pragmatics, and Reasoning 2)

03.11.2025 20:39 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

TopXGen: Topic-Diverse Parallel Data Generation for Low-Resource Machine Translation
Armel Zebaze, Benoรฎt Sagot, Rachel Bawden
๐Ÿ“… Friday, Nov 7 | 12:30โ€“13:30 | Findings Poster (remote)

03.11.2025 20:39 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

Identifying Rare Languages in Common Crawl Data is a Needles-in-a-Haystack Problem
Rasul Dent, Pedro Ortiz Suarez, Thibault Clรฉrice, Benoรฎt Sagot
๐Ÿ“… Friday, Nov 7 | 12:30โ€“13:30 | Findings Poster

03.11.2025 20:39 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

Toward the Automatic Detection of Word Meaning Negotiation Indicators in Conversation
Aina Garรญ Soler, Matthieu Labeau, Chloรฉ Clavel
๐Ÿ“… Fri, Nov 7 | 12:30โ€“13:30 | Findings Poster (in person)

03.11.2025 20:39 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

Explicit Learning and the LLM in Machine Translation
Malik Marmonier, Rachel Bawden, Benoรฎt Sagot
๐Ÿ“… Friday, Nov 7 | 10:30โ€“12:00 | Main Conference Poster (in person)

03.11.2025 20:39 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

Compositional Translation: A Novel LLM-based Approach for Low-resource Machine Translation
Armel Randy Zebaze, Benoรฎt Sagot, Rachel Bawden
๐Ÿ“… Wednesday, Nov 5 | 13:00โ€“14:00 | Findings Poster

03.11.2025 20:39 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

@inriaparisnlp is following 20 prominent accounts