Thibault Clérice's Avatar

Thibault Clérice

@ponteineptique.bsky.social

On vacation. Digital humanists, loves python, making data, talking to data, reusing data. Researcher @ ALMAnaCh, Inria Paris.

807 Followers  |  226 Following  |  171 Posts  |  Joined: 17.10.2023  |  2.0483

Latest posts by ponteineptique.bsky.social on Bluesky

Ouip... Qui marche dans la page recherche classique

15.10.2025 21:49 — 👍 1    🔁 0    💬 0    📌 0

Of course a bug appeared at the last minute.
Of course...

15.10.2025 16:32 — 👍 4    🔁 0    💬 1    📌 0

Merci, et ce n'est que le début 💪

15.10.2025 16:24 — 👍 1    🔁 0    💬 0    📌 0

OH ! Pressing Enter works but the Search button does not...
Nobody caught that in preproduction... THANKS a thousand.

15.10.2025 16:10 — 👍 1    🔁 0    💬 0    📌 0

If you can do a screen capture, I'll be thankful :)

15.10.2025 16:04 — 👍 0    🔁 0    💬 1    📌 0

The main page does not have autocomplete normally 🤨

15.10.2025 15:58 — 👍 0    🔁 0    💬 1    📌 0

In which autocomplete field?

15.10.2025 15:48 — 👍 0    🔁 0    💬 1    📌 0
CoMMA

(11/🧵)
So here it is: CoMMA, a playground for philologists, linguists, and computational humanists alike.

Our goal is to make medieval textual variation computable.

💫 Explore → comma.inria.fr

#DigitalHumanities #ComputationalPhilology #HistoricalLinguistics #OpenScience #AI4Culture

15.10.2025 14:51 — 👍 6    🔁 1    💬 1    📌 0
Preview
comma-project/comma-jsonl · Datasets at Hugging Face We’re on a journey to advance and democratize artificial intelligence through open source and open science.

(10/🧵)

The corpus isn’t just readable 👁️ — it’s also fully downloadable!
Now hosted on @hf.co :

🧾 JSONL dataset → huggingface.co/datasets/com...
📂 More formats (ALTO, TEI, etc.) coming soon — we’re uploading the GBs as we speak.

15.10.2025 14:51 — 👍 3    🔁 1    💬 1    📌 0

(9/🧵)

And for historical semantics, the sheer scale changes everything.
Even rare words now have dozens or hundreds of attestations.

For instance, look at lasciuus — its semantic and morphological variants are suddenly visible across centuries. 🔍

15.10.2025 14:51 — 👍 1    🔁 0    💬 1    📌 0

(8/🧵)
From a historical linguistics or codicology angle, CoMMA already reveals fascinating trends 📈

💬 Latin and French show different peaks in the proportion of abbreviated tokens.
In Latin, the rate climbs to ≈ 9 % in the 14th century — a clear scribal difference with Old French !

15.10.2025 14:51 — 👍 2    🔁 0    💬 1    📌 0
Preview
comma-project/modernbert · Hugging Face We’re on a journey to advance and democratize artificial intelligence through open source and open science.

(7/🧵)
🧠 Meet ModernBERT (yes, we see the irony — a medieval ModernBERT 😅)
Trained directly on CoMMA, and available here → huggingface.co/comma-projec...

We benchmarked it with masked-token prediction accuracy — and can’t wait to see how you’ll use it for historical or multilingual experiments!

15.10.2025 14:51 — 👍 3    🔁 1    💬 1    📌 0

(6/🧵)
So… what can we do with such a corpus? 🤔

We outlined three directions in the paper — here’s a taste of two:
1️⃣ Linguistic modeling & pretraining for medieval and early modern languages
2️⃣ Quantitative exploration of scribal practices, abbreviation systems & codicological patterns

15.10.2025 14:51 — 👍 2    🔁 0    💬 1    📌 0
ARCA Arca, accueil

(5/🧵)

The corpus wouldn’t exist without our partner and open repositories 🙌
A huge thanks to @biblissima.bsky.social for helping us identify relevant manuscripts.

Digitizations come primarily from:
📚 Arca (IRHT digitization database) → arca.irht.cnrs.fr
📖 Gallica at the BnF → gallica.bnf.fr

15.10.2025 14:51 — 👍 3    🔁 0    💬 1    📌 0

(4/🧵)
The CoMMA reading interface 🖥️ is built entirely on the Distributed Text Services (#DTS) protocol.
This means full interoperability — you can query, retrieve, and integrate CoMMA’s data just like any other DTS-compliant resource.
Open standards for open philology ✨

15.10.2025 14:51 — 👍 3    🔁 0    💬 1    📌 0
CATMuS Medieval: A multilingual large-scale cross-century dataset in Latin script for handwritten text recognition and beyond The surge in digitisation initiatives by Cultural Heritage institutions has facilitated online accessibility to numerous historical manuscripts. However, a substantial portion of these documents exists solely as images, lacking machine-readable text. Handwritten Text Recognition (HTR) has emerged as a crucial tool for converting these images into machine-readable formats, enabling researchers and scholars to analyse vast collections efficiently. Despite significant technological progress, establishing consistent ground truth across projects for HTR tasks, particularly for complex and heterogeneous historical sources like medieval manuscripts in Latin scripts (8th-15th century CE), remains nonetheless challenging. We introduce the Consistent Approaches to Transcribing Manuscripts (CATMuS) dataset for medieval manuscripts, which offers (1) a uniform framework for annotation practices for medieval manuscripts, a benchmarking environment (2) for evaluating automatic text recognition models across multiple dimensions thanks to rich metadata (century of production, language, genre, script, etc.), (3) for other tasks (such as script classification or dating approaches), (4) and finally for exploratory work pertaining to computer vision and digital paleography around line-based tasks, such as generative approaches. Developed through collaboration among various institutions and projects, CATMuS provides an inter-compatible dataset spanning more than 200 manuscripts and incunabula in 10 different languages, comprising over 160,000 lines of text and 5 million characters spanning from the 8th century to the 16th. The dataset's consistency in transcription approaches aims to mitigate challenges arising from the diversity in standards for medieval manuscript transcriptions, providing a comprehensive benchmark for evaluating HTR models on historical sources.

(3/🧵)
The dataset was built with the Kraken CATMuS 1.6.0 model (inria.hal.science/hal-04453952)

🪶 All transcriptions keep abbreviations.
📊 Sampled evaluation on 670 manuscripts → average CER = 9.7 %.

Some manuscripts are tricky, but overall, the ATR pipeline holds up well.

15.10.2025 14:51 — 👍 2    🔁 0    💬 1    📌 0

(2/🧵)
To put things in perspective:
⚙️ 2.2 bn “dirty” Latin tokens, compared to roughly 220 M clean Latin tokens known so far.
📜 Around 160 M “dirty” Old French tokens, versus the current ~10 M clean ones.

That’s a 10× jump in accessible data for premodern NLP and computational philology.

15.10.2025 14:51 — 👍 1    🔁 0    💬 1    📌 0
CoMMA

It's been brewing for months: @inriaparisnlp.bsky.social releases CoMMA (Corpus of Multilingual Medieval Archives) !

📚 2.5bn tokens of mostly Latin and French texts
🕰️ 800→1600 CE
📜 23k manuscripts
🖥️ 18k on the reading interface: comma.inria.fr
🔍 Paper: inria.hal.science/hal-05299220v1

(1/🧵)

15.10.2025 14:51 — 👍 47    🔁 23    💬 4    📌 4

🧵 Five years ago @yaelrice.bsky.social and I published this so that no one would have to reinvent the wheel of revealing why research like this is so misguided it defies sense. hyperallergic.com/604897/how-s...

28.09.2025 11:44 — 👍 210    🔁 82    💬 6    📌 3

but, because it is LONG and "detailed", it is given more weight.

If you are out there doing this, then why are you in the academy? This is a research conference, not an exam, and I am not your student to be 'corrected'.

19.09.2025 19:53 — 👍 2    🔁 1    💬 1    📌 0
Post image Post image

TranscriboQuest Arabic Team : "The issue is layout recognition. So we worked on that."

05.09.2025 12:11 — 👍 0    🔁 0    💬 0    📌 0
Post image

Glosses in medieval manuscripts! Segmentation and transcription work

05.09.2025 12:03 — 👍 0    🔁 0    💬 0    📌 0
Post image Post image Post image

Ancient Greek team worked on Papyri from Herculanum, where all Segmentation and transcription have to be started from scratch

05.09.2025 11:59 — 👍 2    🔁 0    💬 0    📌 0
Post image

Catalog of books from a library in toulouse

05.09.2025 11:53 — 👍 0    🔁 0    💬 0    📌 0
Post image

Medical recipes from the 17th centuries on manuscripts written primarily by women.

05.09.2025 11:52 — 👍 0    🔁 0    💬 1    📌 0
Post image

Second presentation is a dataset for htr of 18th century Briton language manuscript

05.09.2025 11:45 — 👍 0    🔁 0    💬 1    📌 0
Post image

End of the TranscriboQuest 2025, funded by @biblissima.bsky.social and @atrium-eu.bsky.social and we are starting to present each team datasets.
First, medieval vernaculars with German, Swedish, Irish, Spanish

05.09.2025 11:41 — 👍 6    🔁 1    💬 4    📌 1

There is one person working on it AFAIK

05.09.2025 04:57 — 👍 1    🔁 0    💬 0    📌 0

It would be pretty hard as CATMuS does not provide markup-level information for paragraphs I think. (Sorry I was on vacation and cut all notificaitons ;) )

25.08.2025 09:47 — 👍 1    🔁 0    💬 1    📌 0

@ponteineptique is following 20 prominent accounts