Screenshot of the Newberry Library Digital Collections homepage featuring a pastel illustration of a polar bear on ice with mountains and a sun; a banner invites browsing the digital collections, with “Browse all” and “Recently added” thumbnails below.
Screenshot of the Newberry Transcribe homepage with a purple handwriting background; a large panel reads “Newberry Transcribe — Unlock history!” with buttons for “Learn more” and “Browse manuscripts,” and a row of project tiles below.
📌 Where to find us:
🔍 Browse our digital collections -- thousands of rare maps, manuscripts, postcards, and more, all free and online collections.newberry.org
✍️ Help us transcribe historical documents on Newberry Transcribe (no experience needed, just curiosity) nt.newberry.org
11.02.2026 17:37 —
👍 14
🔁 7
💬 1
📌 2
Responsible AI in FromThePage (February 12, 2026) - FromThePage Blog
We're 75 minutes away from our webinar on responsible AI use in FromThePage--the challenges we face from AI and how we're addressing (some of) them: content.fromthepage.com/feb-2026-web...
12.02.2026 15:45 —
👍 3
🔁 0
💬 0
📌 0
Thank you! Given recent advances, I don't think I'd finalize more than six weeks in advance.
(I can't believe that I find myself thinking about a mid-November advance in capabilities: that's theoretically impossible, but in practice, I guess it works!)
11.02.2026 03:42 —
👍 0
🔁 0
💬 0
📌 0
Any chance you could share the reading list to us folks outside of BYU?
10.02.2026 23:26 —
👍 0
🔁 0
💬 1
📌 0
Documenting AI-created/enhanced records in catalogues/metadata/displays? – Open Objects
Open Objects
Quick blog post noting some thoughts on 'Documenting AI-created/enhanced records in catalogues/metadata/displays' - I'd love to know who's already doing it, and how? www.openobjects.org.uk/2026/02/docu... #AI4LAM #MuseTech
09.02.2026 14:38 —
👍 5
🔁 3
💬 1
📌 0
Mirador will ignore it (as well as the `creator` encoding the person who ran the AI).
So "doing it right" according to standards means doing it wrong according to best practices for AI transparency.
09.02.2026 17:23 —
👍 1
🔁 0
💬 0
📌 0
A big problem that I'm running into is that since I don't control the client, I'm stuck with whatever implementation you find in e.g. Mirador. So although it's technically possible to use a `generator` attribute in a WebAnnotation that semantically defines the act of using the AI software,
09.02.2026 17:23 —
👍 0
🔁 0
💬 1
📌 0
* If only the AI created transcript exists, show it with a warning prepended to it in bold
* For API access, provide both versions with stanzas indicating provenance, links to prompts, profiles for models, etc.
09.02.2026 17:23 —
👍 0
🔁 0
💬 1
📌 0
Open Datasets For Medieval Studies – ODFMS is a showcase of world-class research on the Middle Ages in dataset format.
I am very excited to report that a project I have been working on for 3+ years is finally seeing the light of day! Along with @byzcapp.bsky.social and Jesse Torgerson, we debut a new annual section in Digital Philology that proposes the dataset as a new genre of publication. odfms.hcommons.org
06.02.2026 20:20 —
👍 34
🔁 13
💬 4
📌 3
Quotation from Peter Shillingsburg's "Dank Cellar of Electronic Texts": "The world is being overwhelmed by texts of unknown provenance, with unknown corruptions, representing unidentified or misidentified versions."
And suddenly we find ourselves back in the world of Project Gutenberg-adjacent electronic texts. Are we back in Peter Shillingsburg's "Dank Cellar" again?
06.02.2026 14:11 —
👍 3
🔁 0
💬 0
📌 0
already done the unpleasant layout so the human can focus on the data. (foundhistory.bsky.social has written about this with regard to weather observations)
06.02.2026 14:00 —
👍 1
🔁 0
💬 0
📌 0
means that it's a little tiring or unpleasant to read. (Transkribus also performs really well--like better than an amateur human well--on pages with bleed-through.)
We also are seeing financial records (or other tabular/"boring") documents become easier for volunteers to correct, since the LLM has
06.02.2026 14:00 —
👍 1
🔁 0
💬 1
📌 0
it might require a second pass at transcription; the first to produce a visually accurate transcription, the second to produce a derivative text optimized for screen reader users.
This is technically straightforward, but I'm not sure that library systems have a good place to put these derivaties.
06.02.2026 13:54 —
👍 3
🔁 0
💬 0
📌 0
using Markdown (for us) or other whitespace. It looks great! But it doesn't follow reading order the same way that traditional OCR does, which makes it a lot harder for a screen reader user.
This may be solvable--we're still experimenting with our prompts to tamper the enthusiasm for layout--but
06.02.2026 13:54 —
👍 2
🔁 0
💬 1
📌 0
to grapple with some new challenges. Gemini really wants to preserve the layout of a page, so we see a lot of attempts at whitespace indentation to produce a typographic facsimile or such things. The problem for OCR applications is that--for tables and columns--it produces realistic-looking tables
06.02.2026 13:54 —
👍 2
🔁 0
💬 1
📌 0
One of the things that strikes me is how much better MMLLMs--Gemini 3 specificially--is for OCR than traditional OCR. We did some tests yesterday with agricultural catalogs in a private collection, and the differences in quality of word recognition were staggering.
That said, we're going to have
06.02.2026 13:54 —
👍 2
🔁 0
💬 1
📌 0
My brain just exploded. Of all the things to make me decide "well, that's enough Internet for the day", I wouldn't have expected this.
(But I will tell this story over cocktails tonight.)
21.12.2025 23:14 —
👍 2
🔁 0
💬 0
📌 0
YouTube video by fromthepage
Introducing Gemini 3.0 Support in FromThePage
This is the first time we've been willing to add LLM-supported transcription to FromThePage. Last week's webinar: www.youtube.com/watch?v=UhqR...
15.12.2025 14:23 —
👍 2
🔁 1
💬 0
📌 1
Congratulations to you and the team! Looking forward to following your work.
11.12.2025 18:10 —
👍 1
🔁 0
💬 0
📌 0
I'm not ready to discard the entire ecosystem we've built around making data openly available to possibly, hopefully slightly impair commercial AI training. Users uploading data (they often don't have the right to relicense for AI training) into AI models is a drop in the ocean.
11.12.2025 13:44 —
👍 2
🔁 1
💬 1
📌 0
deterrent even if you are correct about the law. Keeping material offline may be the only way to keep it out of the hands of actors less ethical than Gemini. (Which is, I think, your point.)
11.12.2025 13:53 —
👍 2
🔁 0
💬 1
📌 0
I'd love to see a citation for "scraping is illegal" when we are talking about material that is well out of copyright and published freely.
That said, having had my site brought down by Chinese bot-swarms designed to circumvent efforts to block them, I'm not sure that "illegal" is an effective
11.12.2025 13:53 —
👍 3
🔁 0
💬 1
📌 0
This matches my understanding of the data flow. The only way for the models to be improved from your correction is by scraping them post-publication.
11.12.2025 13:47 —
👍 1
🔁 0
💬 1
📌 0
Isn't that an objection to posting any image/transcription pairing online, regardless of how it was created?
I know from painful experience that most cultural heritage sites have been slammed by AI bots scraping image/transcription pairs from our servers to train their models.
11.12.2025 13:45 —
👍 1
🔁 0
💬 1
📌 0
But you're totally right that that the places we put "datasets" are very different from the places we put narrative material, so we run the risk of having post-transcription documents in entirely different repositories depending on its format.
11.12.2025 13:01 —
👍 1
🔁 0
💬 0
📌 0
My worry here is that tabular, numerical info is much easier to decipher and standardize, so it gets turned into a dataset without…
11.12.2025 12:49 —
👍 8
🔁 3
💬 1
📌 0
That's an interesting point, but I've seen the opposite in play running crowdsourcing projects: the narrative material has the most interest for volunteers, so the tabular data gets ignored. (A problem in archaeology field reports as well as natural history and financial records.)
11.12.2025 13:00 —
👍 1
🔁 0
💬 1
📌 0