Daniel van Strien's Avatar

Daniel van Strien

@danielvanstrien.bsky.social

Machine Learning Librarian at @hf.co

4,450 Followers  |  2,412 Following  |  297 Posts  |  Joined: 19.05.2023  |  1.5234

Latest posts by danielvanstrien.bsky.social on Bluesky

Preview
ArXiv New ML Datasets - a Hugging Face Space by librarian-bots This tool lets you search arXiv computerโ€‘science papers that are predicted to present new machineโ€‘learning datasets. Enter a keyword or use semantic search, then narrow results by research category...

Semantic search, confidence filtering, updated weekly using Hugging Face Jobs.

Powered by a fine-tuned ModernBERT classifier. Full dataset stored in Lance format on the Hub with vector embeddings.

huggingface.co/spaces/libra...

09.02.2026 10:13 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Video thumbnail

Datasets and benchmarks drive AI progress, but finding papers that introduce new ones means digging through thousands of arXiv abstracts.

Updated the Dataset Papers on ArXiv app to surface them: 52K+ papers classified as introducing new datasets from 212K CS papers.

09.02.2026 10:13 โ€” ๐Ÿ‘ 6    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
IIIF Community Call: IIIF Illustration Detector, Wednesday, February 11 (9am PT / 12pm ET / 5pm UTC

IIIF Community Call: IIIF Illustration Detector, Wednesday, February 11 (9am PT / 12pm ET / 5pm UTC

Join us Feburary 11 for a demo of @danielvanstrien.bsky.social's IIIF Illustration Detector.

Zoom on the IIIF Community Calendar: iiif.io/community

03.02.2026 19:45 โ€” ๐Ÿ‘ 3    ๐Ÿ” 5    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
image of a index card with a green bounding box prediction around the card contents

image of a index card with a green bounding box prediction around the card contents

Built an object detector from zero-labelled data in one afternoon with help from Claude Code (it can do more than vibe code, TODO apps...)

SAM3 on HF Jobs โ†’ correct the errors โ†’ train YOLO โ†’ repeat.

Three rounds: 31% โ†’ 99% accuracy on historical index cards from @natlibscot.bsky.social

02.02.2026 16:43 โ€” ๐Ÿ‘ 10    ๐Ÿ” 3    ๐Ÿ’ฌ 3    ๐Ÿ“Œ 0
Preview
davanstrien/archival-index-card-detector ยท Hugging Face Weโ€™re on a journey to advance and democratize artificial intelligence through open source and open science.

Model: huggingface.co/davanstrien/archival-index-card-detector
SAM3 script: huggingface.co/datasets/uv-...

02.02.2026 16:43 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
image of a index card with a green bounding box prediction around the card contents

image of a index card with a green bounding box prediction around the card contents

Built an object detector from zero-labelled data in one afternoon with help from Claude Code (it can do more than vibe code, TODO apps...)

SAM3 on HF Jobs โ†’ correct the errors โ†’ train YOLO โ†’ repeat.

Three rounds: 31% โ†’ 99% accuracy on historical index cards from @natlibscot.bsky.social

02.02.2026 16:43 โ€” ๐Ÿ‘ 10    ๐Ÿ” 3    ๐Ÿ’ฌ 3    ๐Ÿ“Œ 0
SCIENTISTS QUARREL OVER MARTIAN WOMEN

One Says Ladies Have Two Thumbs and X-Ray Eyes

ANOTHER SAYS BIG EARS

--The Commercial Appeal, 22 Oct 1928

SCIENTISTS QUARREL OVER MARTIAN WOMEN One Says Ladies Have Two Thumbs and X-Ray Eyes ANOTHER SAYS BIG EARS --The Commercial Appeal, 22 Oct 1928

We used to do real science

12.01.2026 01:59 โ€” ๐Ÿ‘ 1955    ๐Ÿ” 588    ๐Ÿ’ฌ 40    ๐Ÿ“Œ 44

Thanks! Could be cool to do a few more transformer.js + IIIF demos!

06.01.2026 10:09 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Video thumbnail

Built a 2.5MB image classifier that runs in the browser in an evening with Claude Code.

I used a dataset I labelled in 2022 and left on @hf.co for 3 years ๐Ÿ˜ฌ.

It finds illustrated pages in historical books. No server. No GPU.

19.12.2025 12:08 โ€” ๐Ÿ‘ 85    ๐Ÿ” 18    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 1

cc @glenrobson.bsky.social! Finally got time to play with transformers.js and @iiif.bsky.social!

19.12.2025 12:08 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Preview
IIIF Illustration Detector - a Hugging Face Space by small-models-for-glam Find illustrated pages in digitized historical books

Paste any IIIF manifest โ†’ model classifies every page locally โ†’ see where illustrations appear.

Part of small-models-for-glam: small, efficient models for cultural heritage work.

Not everything needs GPT-4!

Try it: huggingface.co/spaces/small-models-for-glam/iiif-illustration-detector

19.12.2025 12:08 โ€” ๐Ÿ‘ 14    ๐Ÿ” 6    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Video thumbnail

Built a 2.5MB image classifier that runs in the browser in an evening with Claude Code.

I used a dataset I labelled in 2022 and left on @hf.co for 3 years ๐Ÿ˜ฌ.

It finds illustrated pages in historical books. No server. No GPU.

19.12.2025 12:08 โ€” ๐Ÿ‘ 85    ๐Ÿ” 18    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 1
Preview
The Gradient of Generative AI Release: Methods and Considerations As increasingly powerful generative AI systems are developed, the release method greatly varies. We propose a framework to assess six levels of access to generative AI systems: fully closed; gradual o...

Thanks! If you haven't read it yet, you might also find arxiv.org/abs/2302.04844 interesting!

09.12.2025 13:22 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Here's alt text for the meme:

Alt text: "Flex Tape meme format. Top panel: Phil Swift (labeled 'Library Systems Vendor') aggressively spraying water representing 'Outdated systems, metadata issues, disjointed search and complex user needs.' Bottom panel: A hand slapping Flex Tape underwater, with the tape labeled 'AI-powered chat interface.'"

This captures the joke that vendors are positioning AI chat as a quick fix for deep-seated library infrastructure problemsโ€”a bit like slapping tape on a leak rather than fixing the plumbing.

Here's alt text for the meme: Alt text: "Flex Tape meme format. Top panel: Phil Swift (labeled 'Library Systems Vendor') aggressively spraying water representing 'Outdated systems, metadata issues, disjointed search and complex user needs.' Bottom panel: A hand slapping Flex Tape underwater, with the tape labeled 'AI-powered chat interface.'" This captures the joke that vendors are positioning AI chat as a quick fix for deep-seated library infrastructure problemsโ€”a bit like slapping tape on a leak rather than fixing the plumbing.

Just posted my slides from the AI4LAM #FF2025 workshop on open source AI for GLAMs.

Probably slides on their own aren't that useful, but they do feature one of my growing collection of libraries-and-AI memes, so there's that danielvanstrien.xyz/slides.html

09.12.2025 10:13 โ€” ๐Ÿ‘ 7    ๐Ÿ” 3    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 0

At the AI4LAM Fantastic Futures conference this week

Happy to chat about @hf.co, open source AI for GLAMs, or why cultural heritage should bet on small, focused models over closed-source giants!

DM or find me at breaks! #AI4LAM #FF2025

01.12.2025 11:13 โ€” ๐Ÿ‘ 15    ๐Ÿ” 3    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
SAM3 Detection Browser - a Hugging Face Space by uv-scripts Explore images and their detected objects using the SAM3 model. Enter a dataset ID, select a split, adjust confidence thresholds, and view detailed object detections with bounding boxes.

Explore some results here: huggingface.co/spaces/uv-sc....

21.11.2025 13:30 โ€” ๐Ÿ‘ 1    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Screenshot of a simple app showing bounding boxes for photographs detected in historic newspaper images.

Screenshot of a simple app showing bounding boxes for photographs detected in historic newspaper images.

hf jobs uv run \
  --flavor a100-large \
  -s HF_TOKEN=HF_TOKEN \
  https://huggingface.co/datasets/uv-scripts/sam3/raw/main/detect-objects.py \
  -- davanstrien/newspapers-with-images-after-photography-big \
  davanstrien/newspapers-photo-predictions \
  --class-name "photograph" \
  --confidence-threshold 0.4

hf jobs uv run \ --flavor a100-large \ -s HF_TOKEN=HF_TOKEN \ https://huggingface.co/datasets/uv-scripts/sam3/raw/main/detect-objects.py \ -- davanstrien/newspapers-with-images-after-photography-big \ davanstrien/newspapers-photo-predictions \ --class-name "photograph" \ --confidence-threshold 0.4

Building datasets to train smaller, task-focused models used to be incredibly time-consuming.

Very excited to see SAM3 massively lower that barrier. Describe the class you want to detect and get annotated datasets automatically!

Try it yourself: huggingface.co/datasets/uv-...!

21.11.2025 13:30 โ€” ๐Ÿ‘ 49    ๐Ÿ” 12    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

Very much looking forward to presenting at this tomorrow. I will be making my usual pitch that datasets are the foundational infrastructure for cultural heritage to benefit from and create useful AI models and tools.

Be warned, I did fire up the meme generator for my slides...

05.11.2025 17:40 โ€” ๐Ÿ‘ 36    ๐Ÿ” 8    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Over the last 24 hours, I have finetuned three Qwen3-VL models (2B, 4B, and 8B) on the CATmuS dataset on @hf.co . The first version of the models are now available on the Small Models for GLAM organization with @danielvanstrien.bsky.social (Links below) Working on improving them further.

24.10.2025 14:59 โ€” ๐Ÿ‘ 10    ๐Ÿ” 2    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Preview
Nanonets-OCR2-3B - a Hugging Face Space by akhaliq Discover amazing ML apps made by the community

demo app here: huggingface.co/spaces/akhal...

23.10.2025 14:01 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
screenshot of latex output

screenshot of latex output

huggingface.co/nanonets/Nan... might be worth a try for this. Can extract formulas into LaTeX

23.10.2025 14:01 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
hf jobs uv run --flavor a100-large --timeout 2h \
    -s HF_TOKEN \
    https://huggingface.co/datasets/uv-scripts/ocr/raw/main/deepseek-ocr-vllm.py \
    NationalLibraryOfScotland/Britain-and-UK-Handbooks-Dataset \
    davanstrien/handbooks-deep-ocr \
    --resolution-mode base \
    --batch-size 2048 \
    --prompt-mode free

hf jobs uv run --flavor a100-large --timeout 2h \ -s HF_TOKEN \ https://huggingface.co/datasets/uv-scripts/ocr/raw/main/deepseek-ocr-vllm.py \ NationalLibraryOfScotland/Britain-and-UK-Handbooks-Dataset \ davanstrien/handbooks-deep-ocr \ --resolution-mode base \ --batch-size 2048 \ --prompt-mode free

The command (using @hf.co Jobs - serverless GPU compute)

Full script at huggingface.co/datasets/uv-...

22.10.2025 19:20 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Logs showing ocr progress

Logs showing ocr progress

DeepSeek-OCR just got vLLM support ๐Ÿš€

Currently processing @natlibscot.bsky.social's 27,915-page handbook collection with one command.

Processing at ~350 images/sec on A100

Using @hf.co Jobs + uv - zero setup batch OCR!

Will share final time + cost when done!

22.10.2025 19:20 โ€” ๐Ÿ‘ 17    ๐Ÿ” 2    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

๐Ÿค— Sentence Transformers is joining @hf.co! ๐Ÿค—

This formalizes the existing maintenance structure, as I've personally led the project for the past two years on behalf of Hugging Face. I'm super excited about the transfer!

Details in ๐Ÿงต

22.10.2025 13:04 โ€” ๐Ÿ‘ 30    ๐Ÿ” 6    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 1
Preview
Supercharge your OCR Pipelines with Open Models Weโ€™re on a journey to advance and democratize artificial intelligence through open source and open science.

OCR is one of AI's oldest challenges (first systems: early 1900s!)

Modern vision-language models have transformed what's possible: handwriting, 100+ languages, math formulas, tables, signature extraction...

New @hf.co guide on OCR

huggingface.co/blog/ocr-ope...

22.10.2025 08:58 โ€” ๐Ÿ‘ 29    ๐Ÿ” 6    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 1

Anyone who says OCR is a solved problem has not worked with historic digitised newspapers!

20.10.2025 12:40 โ€” ๐Ÿ‘ 20    ๐Ÿ” 0    ๐Ÿ’ฌ 3    ๐Ÿ“Œ 0
Video thumbnail

Small models work great for GLAM but there aren't enough examples!

With @wjbmattingly.bsky.social I'm launching small-models-for-glam on @hf.co to create/curate models that run on modest hardware and address GLAM use cases.

Follow the org to keep up-to-date!
huggingface.co/small-models...

16.10.2025 13:22 โ€” ๐Ÿ‘ 12    ๐Ÿ” 7    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Very nice work! IMO, this is the kind of topic that more libraries/GLAM/DH people should be working on. The training of these models is *relatively* simple. As always, the missing ingredient is readily accessible data.

15.10.2025 15:55 โ€” ๐Ÿ‘ 3    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
CoMMA

It's been brewing for months: @inriaparisnlp.bsky.social releases CoMMA (Corpus of Multilingual Medieval Archives) !

๐Ÿ“š 2.5bn tokens of mostly Latin and French texts
๐Ÿ•ฐ๏ธ 800โ†’1600 CE
๐Ÿ“œ 23k manuscripts
๐Ÿ–ฅ๏ธ 18k on the reading interface: comma.inria.fr
๐Ÿ” Paper: inria.hal.science/hal-05299220v1

(1/๐Ÿงต)

15.10.2025 14:51 โ€” ๐Ÿ‘ 58    ๐Ÿ” 24    ๐Ÿ’ฌ 3    ๐Ÿ“Œ 4

@danielvanstrien is following 20 prominent accounts