Semantic search, confidence filtering, updated weekly using Hugging Face Jobs.
Powered by a fine-tuned ModernBERT classifier. Full dataset stored in Lance format on the Hub with vector embeddings.
huggingface.co/spaces/libra...
@danielvanstrien.bsky.social
Machine Learning Librarian at @hf.co
Semantic search, confidence filtering, updated weekly using Hugging Face Jobs.
Powered by a fine-tuned ModernBERT classifier. Full dataset stored in Lance format on the Hub with vector embeddings.
huggingface.co/spaces/libra...
Datasets and benchmarks drive AI progress, but finding papers that introduce new ones means digging through thousands of arXiv abstracts.
Updated the Dataset Papers on ArXiv app to surface them: 52K+ papers classified as introducing new datasets from 212K CS papers.
IIIF Community Call: IIIF Illustration Detector, Wednesday, February 11 (9am PT / 12pm ET / 5pm UTC
Join us Feburary 11 for a demo of @danielvanstrien.bsky.social's IIIF Illustration Detector.
Zoom on the IIIF Community Calendar: iiif.io/community
image of a index card with a green bounding box prediction around the card contents
Built an object detector from zero-labelled data in one afternoon with help from Claude Code (it can do more than vibe code, TODO apps...)
SAM3 on HF Jobs โ correct the errors โ train YOLO โ repeat.
Three rounds: 31% โ 99% accuracy on historical index cards from @natlibscot.bsky.social
Model: huggingface.co/davanstrien/archival-index-card-detector
SAM3 script: huggingface.co/datasets/uv-...
image of a index card with a green bounding box prediction around the card contents
Built an object detector from zero-labelled data in one afternoon with help from Claude Code (it can do more than vibe code, TODO apps...)
SAM3 on HF Jobs โ correct the errors โ train YOLO โ repeat.
Three rounds: 31% โ 99% accuracy on historical index cards from @natlibscot.bsky.social
SCIENTISTS QUARREL OVER MARTIAN WOMEN One Says Ladies Have Two Thumbs and X-Ray Eyes ANOTHER SAYS BIG EARS --The Commercial Appeal, 22 Oct 1928
We used to do real science
12.01.2026 01:59 โ ๐ 1955 ๐ 588 ๐ฌ 40 ๐ 44Thanks! Could be cool to do a few more transformer.js + IIIF demos!
06.01.2026 10:09 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0Built a 2.5MB image classifier that runs in the browser in an evening with Claude Code.
I used a dataset I labelled in 2022 and left on @hf.co for 3 years ๐ฌ.
It finds illustrated pages in historical books. No server. No GPU.
cc @glenrobson.bsky.social! Finally got time to play with transformers.js and @iiif.bsky.social!
19.12.2025 12:08 โ ๐ 2 ๐ 0 ๐ฌ 1 ๐ 0Paste any IIIF manifest โ model classifies every page locally โ see where illustrations appear.
Part of small-models-for-glam: small, efficient models for cultural heritage work.
Not everything needs GPT-4!
Try it: huggingface.co/spaces/small-models-for-glam/iiif-illustration-detector
Built a 2.5MB image classifier that runs in the browser in an evening with Claude Code.
I used a dataset I labelled in 2022 and left on @hf.co for 3 years ๐ฌ.
It finds illustrated pages in historical books. No server. No GPU.
Thanks! If you haven't read it yet, you might also find arxiv.org/abs/2302.04844 interesting!
09.12.2025 13:22 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0Here's alt text for the meme: Alt text: "Flex Tape meme format. Top panel: Phil Swift (labeled 'Library Systems Vendor') aggressively spraying water representing 'Outdated systems, metadata issues, disjointed search and complex user needs.' Bottom panel: A hand slapping Flex Tape underwater, with the tape labeled 'AI-powered chat interface.'" This captures the joke that vendors are positioning AI chat as a quick fix for deep-seated library infrastructure problemsโa bit like slapping tape on a leak rather than fixing the plumbing.
Just posted my slides from the AI4LAM #FF2025 workshop on open source AI for GLAMs.
Probably slides on their own aren't that useful, but they do feature one of my growing collection of libraries-and-AI memes, so there's that danielvanstrien.xyz/slides.html
At the AI4LAM Fantastic Futures conference this week
Happy to chat about @hf.co, open source AI for GLAMs, or why cultural heritage should bet on small, focused models over closed-source giants!
DM or find me at breaks! #AI4LAM #FF2025
Explore some results here: huggingface.co/spaces/uv-sc....
21.11.2025 13:30 โ ๐ 1 ๐ 1 ๐ฌ 0 ๐ 0Screenshot of a simple app showing bounding boxes for photographs detected in historic newspaper images.
hf jobs uv run \ --flavor a100-large \ -s HF_TOKEN=HF_TOKEN \ https://huggingface.co/datasets/uv-scripts/sam3/raw/main/detect-objects.py \ -- davanstrien/newspapers-with-images-after-photography-big \ davanstrien/newspapers-photo-predictions \ --class-name "photograph" \ --confidence-threshold 0.4
Building datasets to train smaller, task-focused models used to be incredibly time-consuming.
Very excited to see SAM3 massively lower that barrier. Describe the class you want to detect and get annotated datasets automatically!
Try it yourself: huggingface.co/datasets/uv-...!
Very much looking forward to presenting at this tomorrow. I will be making my usual pitch that datasets are the foundational infrastructure for cultural heritage to benefit from and create useful AI models and tools.
Be warned, I did fire up the meme generator for my slides...
Over the last 24 hours, I have finetuned three Qwen3-VL models (2B, 4B, and 8B) on the CATmuS dataset on @hf.co . The first version of the models are now available on the Small Models for GLAM organization with @danielvanstrien.bsky.social (Links below) Working on improving them further.
24.10.2025 14:59 โ ๐ 10 ๐ 2 ๐ฌ 1 ๐ 0demo app here: huggingface.co/spaces/akhal...
23.10.2025 14:01 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0screenshot of latex output
huggingface.co/nanonets/Nan... might be worth a try for this. Can extract formulas into LaTeX
23.10.2025 14:01 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0hf jobs uv run --flavor a100-large --timeout 2h \ -s HF_TOKEN \ https://huggingface.co/datasets/uv-scripts/ocr/raw/main/deepseek-ocr-vllm.py \ NationalLibraryOfScotland/Britain-and-UK-Handbooks-Dataset \ davanstrien/handbooks-deep-ocr \ --resolution-mode base \ --batch-size 2048 \ --prompt-mode free
The command (using @hf.co Jobs - serverless GPU compute)
Full script at huggingface.co/datasets/uv-...
Logs showing ocr progress
DeepSeek-OCR just got vLLM support ๐
Currently processing @natlibscot.bsky.social's 27,915-page handbook collection with one command.
Processing at ~350 images/sec on A100
Using @hf.co Jobs + uv - zero setup batch OCR!
Will share final time + cost when done!
๐ค Sentence Transformers is joining @hf.co! ๐ค
This formalizes the existing maintenance structure, as I've personally led the project for the past two years on behalf of Hugging Face. I'm super excited about the transfer!
Details in ๐งต
OCR is one of AI's oldest challenges (first systems: early 1900s!)
Modern vision-language models have transformed what's possible: handwriting, 100+ languages, math formulas, tables, signature extraction...
New @hf.co guide on OCR
huggingface.co/blog/ocr-ope...
Anyone who says OCR is a solved problem has not worked with historic digitised newspapers!
20.10.2025 12:40 โ ๐ 20 ๐ 0 ๐ฌ 3 ๐ 0Small models work great for GLAM but there aren't enough examples!
With @wjbmattingly.bsky.social I'm launching small-models-for-glam on @hf.co to create/curate models that run on modest hardware and address GLAM use cases.
Follow the org to keep up-to-date!
huggingface.co/small-models...
Very nice work! IMO, this is the kind of topic that more libraries/GLAM/DH people should be working on. The training of these models is *relatively* simple. As always, the missing ingredient is readily accessible data.
15.10.2025 15:55 โ ๐ 3 ๐ 1 ๐ฌ 0 ๐ 0It's been brewing for months: @inriaparisnlp.bsky.social releases CoMMA (Corpus of Multilingual Medieval Archives) !
๐ 2.5bn tokens of mostly Latin and French texts
๐ฐ๏ธ 800โ1600 CE
๐ 23k manuscripts
๐ฅ๏ธ 18k on the reading interface: comma.inria.fr
๐ Paper: inria.hal.science/hal-05299220v1
(1/๐งต)