Daniel van Strien's Avatar

Daniel van Strien

@danielvanstrien.bsky.social

Machine Learning Librarian at @hf.co

4,282 Followers  |  2,411 Following  |  261 Posts  |  Joined: 19.05.2023  |  2.0354

Latest posts by danielvanstrien.bsky.social on Bluesky

Screenshot of a hf jobs uv run command with some flags and a URL pointing to a script.

Screenshot of a hf jobs uv run command with some flags and a URL pointing to a script.

Try it with one line of code via Jobs!

It processes images from any dataset and outputs a new dataset with extracted markdown - all using HF GPUs.

See the full OCR uv scripts collection: huggingface.co/datasets/uv-...

07.08.2025 15:16 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
numind/NuMarkdown-8B-Thinking ยท Hugging Face Weโ€™re on a journey to advance and democratize artificial intelligence through open source and open science.

Model here: huggingface.co/numind/NuMar...

07.08.2025 15:16 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Screenshot of an app showing an image from a page + model reasoning showing how the model is parsing the text and layout.

Screenshot of an app showing an image from a page + model reasoning showing how the model is parsing the text and layout.

What if OCR models could show you their thought process?

NuMarkdown-8B-Thinking from NuMind (YC S22) doesn't just extract text - it reasons through documents first.

Could be pretty valuable for weird historical documents?

Example here: davanstrien-ocr-time-capsule.static.hf.space/index.html?d...

07.08.2025 15:16 โ€” ๐Ÿ‘ 11    ๐Ÿ” 2    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Preview
uv-scripts/openai-oss ยท Datasets at Hugging Face Weโ€™re on a journey to advance and democratize artificial intelligence through open source and open science.

You can now generate synthetic data using OpenAIs GPT OSS models on @hf.co Jobs!

One command, no setup:

hf jobs uv run --flavor l4x4 [script-url] \
--input-dataset your/dataset \
--output-dataset your/output

Works on L4 GPUs โšก

huggingface.co/datasets/uv-...

06.08.2025 07:38 โ€” ๐Ÿ‘ 9    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
AI Carpentry? Helping learners make better choices with genAI. In two recent community discussion sessions, we explored what mental model of machine learning/deep learning we could teach to learners already familiar with the basics of programming, to help them sa...

My latest post on @carpentries.carpentries.org blog is a call to action for the community to engage in curriculum development for workshops about genAI. How can we help our target audience make more informed choices about when and how to use it?

carpentries.org/blog/2025/08...

05.08.2025 09:24 โ€” ๐Ÿ‘ 5    ๐Ÿ” 7    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 1
Screenshot of a plyabill with some OCR results on the right

Screenshot of a plyabill with some OCR results on the right

Iโ€™m continuing my experiments with VLM-based OCRโ€ฆ

How well do these models handle Victorian theatre playbills from @bldigischol.bsky.social?

RolmOCR vs traditional OCR on tricky playbills (ornate fonts, faded ink, DRAMATIC ALL CAPS!)

@hf.co Demo: huggingface.co/spaces/davan...

05.08.2025 09:17 โ€” ๐Ÿ‘ 12    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

It's often not documented, but "traditional" OCR in this case is whatever libraries and archives used in the past to generate some OCR. My goal with this work is mainly to see how much better VLMs might be (and in which situations), to get some better sense of when redoing OCR might be worth it.

04.08.2025 18:12 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Screenshot of the app showing a page from a book + different views of existing and new ocr.

Screenshot of the app showing a page from a book + different views of existing and new ocr.

Many VLM-based OCR models have been released recently. Are they useful for libraries and archives?

I made a quick Space to compare VLM OCR with "traditional" OCR using 11k Scottish exam papers from @natlibscot.bsky.social

huggingface.co/spaces/davanstrien/ocr-time-capsule

01.08.2025 15:09 โ€” ๐Ÿ‘ 48    ๐Ÿ” 15    ๐Ÿ’ฌ 4    ๐Ÿ“Œ 1

I'm planning to add more example datasets & OCR models using HF Jobs. Feel free to suggest collections to test with: I need image + existing OCR!

Even better: upload your GLAM datasets to @hf.co! ๐Ÿค—

01.08.2025 15:09 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
NationalLibraryOfScotland/Scottish-School-Exam-Papers ยท Datasets at Hugging Face Weโ€™re on a journey to advance and democratize artificial intelligence through open source and open science.

Based on this great collection: huggingface.co/datasets/NationalLibraryOfScotland/Scottish-School-Exam-Papers

You can browse visually (press V!), see quality metrics, and compare outputs side-by-side.

01.08.2025 15:09 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Screenshot of the app showing a page from a book + different views of existing and new ocr.

Screenshot of the app showing a page from a book + different views of existing and new ocr.

Many VLM-based OCR models have been released recently. Are they useful for libraries and archives?

I made a quick Space to compare VLM OCR with "traditional" OCR using 11k Scottish exam papers from @natlibscot.bsky.social

huggingface.co/spaces/davanstrien/ocr-time-capsule

01.08.2025 15:09 โ€” ๐Ÿ‘ 48    ๐Ÿ” 15    ๐Ÿ’ฌ 4    ๐Ÿ“Œ 1
Bar chart comparing โ€œThinkingโ€ mode win rates across 5 benchmarks for 4 model variants. The models are:
	โ€ข	Qwen3-30B-A3B-Thinking-2507 (Red)
	โ€ข	Qwen3-30B-A3B-Thinking (Blue)
	โ€ข	Qwen3-235B-A22B-Thinking (Gray)
	โ€ข	Gemini-2.5-Flash-Thinking (Beige)

Scores per Benchmark:

Benchmark	Qwen3-30B-A3B 2507	Qwen3-30B-A3B	Qwen3-235B-A22B	Gemini-2.5-Flash
GPQA	73.4	65.8	71.1	82.8
AIME25	85.0	70.9	81.5	72.0
LiveCodeBench	66.0	57.4	61.2	55.7
Arena-Hard v2	56.0	36.3	61.5	56.7
BFCL-v3	72.4	69.1	70.8	68.6

Highlights:
	โ€ข	Qwen3-30B-A3B-Thinking-2507 consistently achieves the highest scores in 4 out of 5 benchmarks.
	โ€ข	Gemini-2.5-Flash leads only in GPQA, outperforming others by a notable margin.
	โ€ข	Qwen3-235B-A22B is generally strong but slightly behind 2507 in most cases.
	โ€ข	Qwen3-30B-A3B (non-2507 variant) underperforms relative to the others, particularly on Arena-Hard v2.

These results suggest that the 2507 checkpoint of Qwen3-30B-A3B has a substantial performance advantage over other โ€œthinkingโ€ configurations.

Bar chart comparing โ€œThinkingโ€ mode win rates across 5 benchmarks for 4 model variants. The models are: โ€ข Qwen3-30B-A3B-Thinking-2507 (Red) โ€ข Qwen3-30B-A3B-Thinking (Blue) โ€ข Qwen3-235B-A22B-Thinking (Gray) โ€ข Gemini-2.5-Flash-Thinking (Beige) Scores per Benchmark: Benchmark Qwen3-30B-A3B 2507 Qwen3-30B-A3B Qwen3-235B-A22B Gemini-2.5-Flash GPQA 73.4 65.8 71.1 82.8 AIME25 85.0 70.9 81.5 72.0 LiveCodeBench 66.0 57.4 61.2 55.7 Arena-Hard v2 56.0 36.3 61.5 56.7 BFCL-v3 72.4 69.1 70.8 68.6 Highlights: โ€ข Qwen3-30B-A3B-Thinking-2507 consistently achieves the highest scores in 4 out of 5 benchmarks. โ€ข Gemini-2.5-Flash leads only in GPQA, outperforming others by a notable margin. โ€ข Qwen3-235B-A22B is generally strong but slightly behind 2507 in most cases. โ€ข Qwen3-30B-A3B (non-2507 variant) underperforms relative to the others, particularly on Arena-Hard v2. These results suggest that the 2507 checkpoint of Qwen3-30B-A3B has a substantial performance advantage over other โ€œthinkingโ€ configurations.

qwen3-30b-a3b-thinking-2507

no surprise, but today Qwen launched the thinking version of its laptop-sized MoE

tasty, as usual.

huggingface.co/Qwen/Qwen3-3...

30.07.2025 15:27 โ€” ๐Ÿ‘ 7    ๐Ÿ” 1    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 0
Bar chart comparing performance of five models across five benchmarks: GPQA, AIME25, LiveCodeBench v6, Arena-Hard v2, and BFCL-v3. Each model is color-coded:
	โ€ข	Red: Qwen3-30B-A3B-Instruct-2507
	โ€ข	Blue: Qwen3-30B-A3B Non-thinking
	โ€ข	Gray: Qwen3-235B-A22B Non-thinking
	โ€ข	Gold: Gemini-2.5-Flash Non-thinking
	โ€ข	Beige: OpenAI GPT-4o-0327

Benchmark results (highest score in each bolded):
	โ€ข	GPQA:
	โ€ข	Qwen3-30B-A3B-Instruct-2507: 70.4
	โ€ข	Qwen3-30B-A3B Non-thinking: 54.8
	โ€ข	Qwen3-235B-A22B: 62.9
	โ€ข	Gemini-2.5-Flash: 78.3
	โ€ข	GPT-4o-0327: 66.9
	โ€ข	AIME25:
	โ€ข	Qwen3-30B-A3B-Instruct-2507: 61.3
	โ€ข	Qwen3-30B-A3B Non-thinking: 21.6
	โ€ข	Qwen3-235B-A22B: 24.7
	โ€ข	Gemini-2.5-Flash: 61.6
	โ€ข	GPT-4o-0327: 66.7
	โ€ข	LiveCodeBench v6:
	โ€ข	Qwen3-30B-A3B-Instruct-2507: 43.2
	โ€ข	Qwen3-30B-A3B Non-thinking: 29.0
	โ€ข	Qwen3-235B-A22B: 32.9
	โ€ข	Gemini-2.5-Flash: 40.1
	โ€ข	GPT-4o-0327: 35.8
	โ€ข	Arena-Hard v2:
	โ€ข	Qwen3-30B-A3B-Instruct-2507: 69.0
	โ€ข	Qwen3-30B-A3B Non-thinking: 24.8
	โ€ข	Qwen3-235B-A22B: 52.0
	โ€ข	Gemini-2.5-Flash: 58.3
	โ€ข	GPT-4o-0327: 61.9
	โ€ข	BFCL-v3:
	โ€ข	Qwen3-30B-A3B-Instruct-2507: 65.1
	โ€ข	Qwen3-30B-A3B Non-thinking: 58.6
	โ€ข	Qwen3-235B-A22B: 68.0
	โ€ข	Gemini-2.5-Flash: 64.1
	โ€ข	GPT-4o-0327: 66.5

Note: The red โ€œInstructโ€ model consistently outperforms its blue โ€œNon-thinkingโ€ counterpart. GPT-4o and Gemini-2.5-Flash also show strong overall results. Arena-Hard v2 notes GPT-4.1 as evaluator.

Bar chart comparing performance of five models across five benchmarks: GPQA, AIME25, LiveCodeBench v6, Arena-Hard v2, and BFCL-v3. Each model is color-coded: โ€ข Red: Qwen3-30B-A3B-Instruct-2507 โ€ข Blue: Qwen3-30B-A3B Non-thinking โ€ข Gray: Qwen3-235B-A22B Non-thinking โ€ข Gold: Gemini-2.5-Flash Non-thinking โ€ข Beige: OpenAI GPT-4o-0327 Benchmark results (highest score in each bolded): โ€ข GPQA: โ€ข Qwen3-30B-A3B-Instruct-2507: 70.4 โ€ข Qwen3-30B-A3B Non-thinking: 54.8 โ€ข Qwen3-235B-A22B: 62.9 โ€ข Gemini-2.5-Flash: 78.3 โ€ข GPT-4o-0327: 66.9 โ€ข AIME25: โ€ข Qwen3-30B-A3B-Instruct-2507: 61.3 โ€ข Qwen3-30B-A3B Non-thinking: 21.6 โ€ข Qwen3-235B-A22B: 24.7 โ€ข Gemini-2.5-Flash: 61.6 โ€ข GPT-4o-0327: 66.7 โ€ข LiveCodeBench v6: โ€ข Qwen3-30B-A3B-Instruct-2507: 43.2 โ€ข Qwen3-30B-A3B Non-thinking: 29.0 โ€ข Qwen3-235B-A22B: 32.9 โ€ข Gemini-2.5-Flash: 40.1 โ€ข GPT-4o-0327: 35.8 โ€ข Arena-Hard v2: โ€ข Qwen3-30B-A3B-Instruct-2507: 69.0 โ€ข Qwen3-30B-A3B Non-thinking: 24.8 โ€ข Qwen3-235B-A22B: 52.0 โ€ข Gemini-2.5-Flash: 58.3 โ€ข GPT-4o-0327: 61.9 โ€ข BFCL-v3: โ€ข Qwen3-30B-A3B-Instruct-2507: 65.1 โ€ข Qwen3-30B-A3B Non-thinking: 58.6 โ€ข Qwen3-235B-A22B: 68.0 โ€ข Gemini-2.5-Flash: 64.1 โ€ข GPT-4o-0327: 66.5 Note: The red โ€œInstructโ€ model consistently outperforms its blue โ€œNon-thinkingโ€ counterpart. GPT-4o and Gemini-2.5-Flash also show strong overall results. Arena-Hard v2 notes GPT-4.1 as evaluator.

yesssss! a small update to Qwen3-30B-A3B

this has been one of my favorite local models, and now we get an even better version!

better instruction following, tool use & coding. Nice small MoE!

huggingface.co/Qwen/Qwen3-3...

29.07.2025 16:53 โ€” ๐Ÿ‘ 47    ๐Ÿ” 4    ๐Ÿ’ฌ 3    ๐Ÿ“Œ 2
Document page with text

Document page with text

Markdown formatted OCR output

Markdown formatted OCR output

HF Jobs just launched! ๐Ÿš€

One command VLM based OCR with uv Scripts:

hf jobs uv run [script] ufo-images ufo-text

Classified UFO docs โ†’ clean markdown. Zero setup!

Try it โ†’ huggingface.co/datasets/uv-...

29.07.2025 08:48 โ€” ๐Ÿ‘ 13    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

Working to port VLaMy to an entirely free mode where you can just cache all your data in the browser for a project. Slowly adding all the features from the full version to this user-free version. Available now on @hf.co @danielvanstrien.bsky.social
Link: huggingface.co/spaces/wjbma...

28.07.2025 17:11 โ€” ๐Ÿ‘ 21    ๐Ÿ” 7    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 0
Preview
Migrating the Hub from Git LFS to Xet Weโ€™re on a journey to advance and democratize artificial intelligence through open source and open science.

We've moved the first 20PB from Git LFS to Xet on @hf.co
without any interruptions. Now we're migrating the rest of the Hub. We got this far by focusing on the community first.

Here's a deep dive on the infra making this possible and what's next: huggingface.co/blog/migrati...

15.07.2025 15:16 โ€” ๐Ÿ‘ 5    ๐Ÿ” 2    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Video thumbnail

465 people. 122 languages. 58,185 annotations!

FineWeb-C v1 is complete! Communities worldwide have built their own educational quality datasets, proving that we don't need to wait for big tech to support languages.

Huge thanks to all who contributed!

huggingface.co/blog/davanst...

08.07.2025 12:07 โ€” ๐Ÿ‘ 33    ๐Ÿ” 11    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 0
Preview
data-is-better-together/fineweb-c ยท Datasets at Hugging Face Weโ€™re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co/datasets/dat...

08.07.2025 12:07 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Video thumbnail

465 people. 122 languages. 58,185 annotations!

FineWeb-C v1 is complete! Communities worldwide have built their own educational quality datasets, proving that we don't need to wait for big tech to support languages.

Huge thanks to all who contributed!

huggingface.co/blog/davanst...

08.07.2025 12:07 โ€” ๐Ÿ‘ 33    ๐Ÿ” 11    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 0
Preview
Vchitect/ShotBench ยท Datasets at Hugging Face Weโ€™re on a journey to advance and democratize artificial intelligence through open source and open science.

ShotBench: Cinematic Understanding Benchmark

- 3,572 expert QA pairs
- 3,049 images + 464 videos
- 200+ Oscar-nominated films
- 8 cinematography dimensions tested

huggingface.co/datasets/Vch...

07.07.2025 07:26 โ€” ๐Ÿ‘ 4    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
OCR Time Machine - a Hugging Face Space by davanstrien This tool extracts text from historical document images and corresponding XML files. You upload an image and optionally an XML file, choose an OCR model, and get the extracted text in both Markdown...

Added olmOCR to the OCR Time Machine!

@ai2.bsky.social's olmOCR (one of the OG VLM-based OCR models) still performs v well.

My takeaway from testing: there's no single "best" VLM for historical docs currently (maybe with a bit of fine-tuning, there could be ๐Ÿ˜‰)

huggingface.co/spaces/davan...

30.06.2025 14:11 โ€” ๐Ÿ‘ 5    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Not for this one, but I have found passing the original OCR as extra context can work quite well (especially for longer documents). It does make everything much slower. My own hunch is this would work better for generating new training data rather than something you could use for large collections.

25.06.2025 19:42 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

The space does extract the text from the XML to make it easier to compare the text otherwise it's quite hard to see the quality of the XML OCR output directly.

25.06.2025 09:32 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

Thanks! Will make the comparison clearer! At the moment the Space doesn't do "traditional" OCR but focuses on allowing someone to upload some existing OCR output they already have to see how much better/worse the VLM OCR would be.

25.06.2025 09:32 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Screenshot showing a document page image on the left with corresponding OCR output on the right of the page.

Screenshot showing a document page image on the left with corresponding OCR output on the right of the page.

Everyoneโ€™s dropping VLM-based OCR models latelyโ€ฆ
But are they actually better than traditional OCR engines, which output XML for historical docs?

I built OCR Time Machine to test it!

๐Ÿ“„ Upload image + ALTO/PAGE XML
โš–๏ธ Compare outputs side by side
๐Ÿ”— huggingface.co/spaces/davan...

24.06.2025 17:35 โ€” ๐Ÿ‘ 30    ๐Ÿ” 9    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 0

I would personally give it one or two more generations of releases (mostly to get to smaller models) and then seriously think about seeing what running these models at scale can enable with collections!

24.06.2025 17:35 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

tl;dr yes, I think the modern VLM-based OCR models are probably better in many cases (but targeting ALTO XML output might be essential for many libraries...)

24.06.2025 17:35 โ€” ๐Ÿ‘ 1    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Screenshot showing a document page image on the left with corresponding OCR output on the right of the page.

Screenshot showing a document page image on the left with corresponding OCR output on the right of the page.

Everyoneโ€™s dropping VLM-based OCR models latelyโ€ฆ
But are they actually better than traditional OCR engines, which output XML for historical docs?

I built OCR Time Machine to test it!

๐Ÿ“„ Upload image + ALTO/PAGE XML
โš–๏ธ Compare outputs side by side
๐Ÿ”— huggingface.co/spaces/davan...

24.06.2025 17:35 โ€” ๐Ÿ‘ 30    ๐Ÿ” 9    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 0

Yes, I agree there are a lot of big opportunities here!

20.06.2025 12:23 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

the semantic search for HuggingFace datasets mcp that @danielvanstrien.bsky.social made is really fun -- i just tried it in cursor's ai side bar. repo/info: github.com/davanstrien/...

19.06.2025 15:00 โ€” ๐Ÿ‘ 3    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

@danielvanstrien is following 20 prominent accounts