@wjbmattingly.bsky.social
Digital Nomad · Historian · Data Scientist · NLP · Machine Learning Cultural Heritage Data Scientist at Yale Former Postdoc in the Smithsonian Maintainer of Python Tutorials for Digital Humanities https://linktr.ee/wjbmattingly
You can now process Hebrew archival documents with Qwen 3 VL =) --- Will be using this to finetune further on handwritten Hebrew. Metrics are on the test set that is fairly close in style and structure to the training data. I tested on out-of-training edge cases and it worked (Link to model below)
30.10.2025 13:16 — 👍 5 🔁 0 💬 1 📌 0Does anyone have a dataset of 1,000 + pages of handwritten text on Transkribus that they want to use for finetuning a VLM? If so, please let me know. This would be for any language and any script.
27.10.2025 17:56 — 👍 3 🔁 5 💬 0 📌 0More coming soon but finetuned Qwen 3 VL-8B on 150k lines of synthetic Yiddish typed and handwritten data. Results are pretty amazing. Even on the harder heldout set it gets a CER of 1% and a WER of 2%. Preparing page-level dataset and finetunes now, thanks to the John Locke Jr.
24.10.2025 20:14 — 👍 9 🔁 1 💬 0 📌 02B model: huggingface.co/small-models...
24.10.2025 14:59 — 👍 2 🔁 0 💬 0 📌 04B model: huggingface.co/small-models...
24.10.2025 14:59 — 👍 3 🔁 0 💬 1 📌 08B model: huggingface.co/small-models...
24.10.2025 14:59 — 👍 1 🔁 0 💬 1 📌 0Over the last 24 hours, I have finetuned three Qwen3-VL models (2B, 4B, and 8B) on the CATmuS dataset on @hf.co . The first version of the models are now available on the Small Models for GLAM organization with @danielvanstrien.bsky.social (Links below) Working on improving them further.
24.10.2025 14:59 — 👍 10 🔁 2 💬 1 📌 0Logs showing ocr progress
DeepSeek-OCR just got vLLM support 🚀
Currently processing @natlibscot.bsky.social's 27,915-page handbook collection with one command.
Processing at ~350 images/sec on A100
Using @hf.co Jobs + uv - zero setup batch OCR!
Will share final time + cost when done!
Want an easy way to edit the output from Dots.OCR? Introducing Dots.OCR editor, an easy way to edit outputs from the model.
Features:
1) Edit bounding boxes
2) Edit OCR
3) Edit reading order
4) Group sections (good for newspapers)
Vibe coded with Claude 4.5
github.com/wjbmattingly...
Small models work great for GLAM but there aren't enough examples!
With @wjbmattingly.bsky.social I'm launching small-models-for-glam on @hf.co to create/curate models that run on modest hardware and address GLAM use cases.
Follow the org to keep up-to-date!
huggingface.co/small-models...
🚨Job ALERT🚨! My old postdoc is available!
I cannot emphasize enough how much a life-altering position this was for me. It gave me the experience that I needed for my current role. As a postdoc, I was able to define my projects and acquire a lot of new skills as well as refine some I already had.
Excited to be co-editing a special issue of @dhquarterly.bsky.social on Artificial Intelligence for Digital Humanities: Research problems and critical approaches
dhq.digitalhumanities.org/news/news.html
We're inviting abstracts now - please feel free to reach out with any questions!
Ahh no worries!! Thanks! I hope you had a nice vacation
25.08.2025 14:55 — 👍 0 🔁 0 💬 0 📌 0Something I've realized over the last couple weeks with finetuning various VLMs is that we just need more data. Unfortunately, that takes a lot of time. That's why I'm returning to my synthetic HTR workflow. This will be packaged now and expanded to work with other low-resource languages. Stay tuned
14.08.2025 16:08 — 👍 10 🔁 0 💬 0 📌 0No problem! It's hard to fit a good answer in 300 characters =) Feel free to DM me any time.
13.08.2025 20:02 — 👍 0 🔁 0 💬 0 📌 0Also, if you are doing a full finetune vs LoRa adapters is another thing to consider. Also, depends on the model arch.
13.08.2025 20:00 — 👍 1 🔁 0 💬 0 📌 0I hate saying this, but it's true: it depends. For line-level medieval Latin (out of scope, but small problem size), 1-3k examples seems to be fine. For page level out of scope problems, it really becomes more challenging and very model dependent, 1-10k in my experience.
13.08.2025 19:59 — 👍 1 🔁 0 💬 1 📌 0I've been getting asked training scripts when a new VLM drops. Instead of scripts, I'm going to start updating this new Python package. It's not fancy. It's for full finetunes. This was how I first trained Qwen 2 VL last year.
13.08.2025 19:38 — 👍 14 🔁 2 💬 1 📌 0Let's go! Training LFM2-VL 1.6B on Catmus dataset on @hf.co now. Will start posting some benchmarks on this model soon.
13.08.2025 16:58 — 👍 4 🔁 1 💬 0 📌 0Training on full catmus now and the results after first checkpoint are very promising. Character and massive word-level improvement.
13.08.2025 15:53 — 👍 3 🔁 1 💬 1 📌 0Thanks!! =)
13.08.2025 15:52 — 👍 1 🔁 0 💬 0 📌 0Congrats on the new job!!
13.08.2025 15:13 — 👍 0 🔁 0 💬 0 📌 0LiquidAI cooked with LFM2-VL. At the risk of sounding like an X AI influencer, don't sleep on this model. I'm finetuning right now on Catmus. A small test over night on only 3k examples is showing remarkable improvement. Training now on 150k samples. I see this as potentially replacing TrOCR.
13.08.2025 15:01 — 👍 10 🔁 2 💬 1 📌 0New super lightweight VLM just dropped from Liquid AI in two flavors: 450M and 1.6B. Both models can work out-of-the-box with medieval Latin at the line level. I'm fine-tuning on Catmus/medieval right now on an h200.
12.08.2025 19:18 — 👍 9 🔁 3 💬 1 📌 0Awesome!!
12.08.2025 11:26 — 👍 1 🔁 0 💬 0 📌 0With #IMMARKUS, you can already use popular AI services for image transcription. Now, you can also use them for translation! Transcribe a historic source, select the annotation—and translate it with a click.
12.08.2025 09:48 — 👍 7 🔁 2 💬 1 📌 0GLM-4.5V with line-level transcription of medieval Latin in Caroline Miniscule. Inference was run through @hf.co Inferencevia Novita.
12.08.2025 01:50 — 👍 4 🔁 1 💬 0 📌 0Qwen 3-4B Thinking finetune nearly ready to share. It can convert unstructured natural language, non-linkedart JSON, and HTML into LinkedArt JSON.
11.08.2025 20:07 — 👍 5 🔁 0 💬 0 📌 0I need to get back to my Voynich work soon! I will finally have time in a couple months I think.
09.08.2025 03:36 — 👍 1 🔁 0 💬 0 📌 0