Ouip... Qui marche dans la page recherche classique
15.10.2025 21:49 — 👍 1 🔁 0 💬 0 📌 0@ponteineptique.bsky.social
On vacation. Digital humanists, loves python, making data, talking to data, reusing data. Researcher @ ALMAnaCh, Inria Paris.
Ouip... Qui marche dans la page recherche classique
15.10.2025 21:49 — 👍 1 🔁 0 💬 0 📌 0Of course a bug appeared at the last minute.
Of course...
Merci, et ce n'est que le début 💪
15.10.2025 16:24 — 👍 1 🔁 0 💬 0 📌 0OH ! Pressing Enter works but the Search button does not...
Nobody caught that in preproduction... THANKS a thousand.
If you can do a screen capture, I'll be thankful :)
15.10.2025 16:04 — 👍 0 🔁 0 💬 1 📌 0The main page does not have autocomplete normally 🤨
15.10.2025 15:58 — 👍 0 🔁 0 💬 1 📌 0In which autocomplete field?
15.10.2025 15:48 — 👍 0 🔁 0 💬 1 📌 0(11/🧵)
So here it is: CoMMA, a playground for philologists, linguists, and computational humanists alike.
Our goal is to make medieval textual variation computable.
💫 Explore → comma.inria.fr
#DigitalHumanities #ComputationalPhilology #HistoricalLinguistics #OpenScience #AI4Culture
(10/🧵)
The corpus isn’t just readable 👁️ — it’s also fully downloadable!
Now hosted on @hf.co :
🧾 JSONL dataset → huggingface.co/datasets/com...
📂 More formats (ALTO, TEI, etc.) coming soon — we’re uploading the GBs as we speak.
(9/🧵)
And for historical semantics, the sheer scale changes everything.
Even rare words now have dozens or hundreds of attestations.
For instance, look at lasciuus — its semantic and morphological variants are suddenly visible across centuries. 🔍
(8/🧵)
From a historical linguistics or codicology angle, CoMMA already reveals fascinating trends 📈
💬 Latin and French show different peaks in the proportion of abbreviated tokens.
In Latin, the rate climbs to ≈ 9 % in the 14th century — a clear scribal difference with Old French !
(7/🧵)
🧠 Meet ModernBERT (yes, we see the irony — a medieval ModernBERT 😅)
Trained directly on CoMMA, and available here → huggingface.co/comma-projec...
We benchmarked it with masked-token prediction accuracy — and can’t wait to see how you’ll use it for historical or multilingual experiments!
(6/🧵)
So… what can we do with such a corpus? 🤔
We outlined three directions in the paper — here’s a taste of two:
1️⃣ Linguistic modeling & pretraining for medieval and early modern languages
2️⃣ Quantitative exploration of scribal practices, abbreviation systems & codicological patterns
(5/🧵)
The corpus wouldn’t exist without our partner and open repositories 🙌
A huge thanks to @biblissima.bsky.social for helping us identify relevant manuscripts.
Digitizations come primarily from:
📚 Arca (IRHT digitization database) → arca.irht.cnrs.fr
📖 Gallica at the BnF → gallica.bnf.fr
(4/🧵)
The CoMMA reading interface 🖥️ is built entirely on the Distributed Text Services (#DTS) protocol.
This means full interoperability — you can query, retrieve, and integrate CoMMA’s data just like any other DTS-compliant resource.
Open standards for open philology ✨
(3/🧵)
The dataset was built with the Kraken CATMuS 1.6.0 model (inria.hal.science/hal-04453952)
🪶 All transcriptions keep abbreviations.
📊 Sampled evaluation on 670 manuscripts → average CER = 9.7 %.
Some manuscripts are tricky, but overall, the ATR pipeline holds up well.
(2/🧵)
To put things in perspective:
⚙️ 2.2 bn “dirty” Latin tokens, compared to roughly 220 M clean Latin tokens known so far.
📜 Around 160 M “dirty” Old French tokens, versus the current ~10 M clean ones.
That’s a 10× jump in accessible data for premodern NLP and computational philology.
It's been brewing for months: @inriaparisnlp.bsky.social releases CoMMA (Corpus of Multilingual Medieval Archives) !
📚 2.5bn tokens of mostly Latin and French texts
🕰️ 800→1600 CE
📜 23k manuscripts
🖥️ 18k on the reading interface: comma.inria.fr
🔍 Paper: inria.hal.science/hal-05299220v1
(1/🧵)
🧵 Five years ago @yaelrice.bsky.social and I published this so that no one would have to reinvent the wheel of revealing why research like this is so misguided it defies sense. hyperallergic.com/604897/how-s...
28.09.2025 11:44 — 👍 210 🔁 82 💬 6 📌 3but, because it is LONG and "detailed", it is given more weight.
If you are out there doing this, then why are you in the academy? This is a research conference, not an exam, and I am not your student to be 'corrected'.
TranscriboQuest Arabic Team : "The issue is layout recognition. So we worked on that."
05.09.2025 12:11 — 👍 0 🔁 0 💬 0 📌 0Glosses in medieval manuscripts! Segmentation and transcription work
05.09.2025 12:03 — 👍 0 🔁 0 💬 0 📌 0Ancient Greek team worked on Papyri from Herculanum, where all Segmentation and transcription have to be started from scratch
05.09.2025 11:59 — 👍 2 🔁 0 💬 0 📌 0Catalog of books from a library in toulouse
05.09.2025 11:53 — 👍 0 🔁 0 💬 0 📌 0Medical recipes from the 17th centuries on manuscripts written primarily by women.
05.09.2025 11:52 — 👍 0 🔁 0 💬 1 📌 0Second presentation is a dataset for htr of 18th century Briton language manuscript
05.09.2025 11:45 — 👍 0 🔁 0 💬 1 📌 0End of the TranscriboQuest 2025, funded by @biblissima.bsky.social and @atrium-eu.bsky.social and we are starting to present each team datasets.
First, medieval vernaculars with German, Swedish, Irish, Spanish
There is one person working on it AFAIK
05.09.2025 04:57 — 👍 1 🔁 0 💬 0 📌 0It would be pretty hard as CATMuS does not provide markup-level information for paragraphs I think. (Sorry I was on vacation and cut all notificaitons ;) )
25.08.2025 09:47 — 👍 1 🔁 0 💬 1 📌 0