Screenshot of a hf jobs uv run command with some flags and a URL pointing to a script.
Try it with one line of code via Jobs!
It processes images from any dataset and outputs a new dataset with extracted markdown - all using HF GPUs.
See the full OCR uv scripts collection: huggingface.co/datasets/uv-...
07.08.2025 15:16 โ ๐ 3 ๐ 0 ๐ฌ 0 ๐ 0
Screenshot of an app showing an image from a page + model reasoning showing how the model is parsing the text and layout.
What if OCR models could show you their thought process?
NuMarkdown-8B-Thinking from NuMind (YC S22) doesn't just extract text - it reasons through documents first.
Could be pretty valuable for weird historical documents?
Example here: davanstrien-ocr-time-capsule.static.hf.space/index.html?d...
07.08.2025 15:16 โ ๐ 11 ๐ 2 ๐ฌ 1 ๐ 0
uv-scripts/openai-oss ยท Datasets at Hugging Face
Weโre on a journey to advance and democratize artificial intelligence through open source and open science.
You can now generate synthetic data using OpenAIs GPT OSS models on @hf.co Jobs!
One command, no setup:
hf jobs uv run --flavor l4x4 [script-url] \
--input-dataset your/dataset \
--output-dataset your/output
Works on L4 GPUs โก
huggingface.co/datasets/uv-...
06.08.2025 07:38 โ ๐ 9 ๐ 1 ๐ฌ 0 ๐ 0
Screenshot of a plyabill with some OCR results on the right
Iโm continuing my experiments with VLM-based OCRโฆ
How well do these models handle Victorian theatre playbills from @bldigischol.bsky.social?
RolmOCR vs traditional OCR on tricky playbills (ornate fonts, faded ink, DRAMATIC ALL CAPS!)
@hf.co Demo: huggingface.co/spaces/davan...
05.08.2025 09:17 โ ๐ 12 ๐ 1 ๐ฌ 0 ๐ 0
It's often not documented, but "traditional" OCR in this case is whatever libraries and archives used in the past to generate some OCR. My goal with this work is mainly to see how much better VLMs might be (and in which situations), to get some better sense of when redoing OCR might be worth it.
04.08.2025 18:12 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0
Screenshot of the app showing a page from a book + different views of existing and new ocr.
Many VLM-based OCR models have been released recently. Are they useful for libraries and archives?
I made a quick Space to compare VLM OCR with "traditional" OCR using 11k Scottish exam papers from @natlibscot.bsky.social
huggingface.co/spaces/davanstrien/ocr-time-capsule
01.08.2025 15:09 โ ๐ 48 ๐ 15 ๐ฌ 4 ๐ 1
I'm planning to add more example datasets & OCR models using HF Jobs. Feel free to suggest collections to test with: I need image + existing OCR!
Even better: upload your GLAM datasets to @hf.co! ๐ค
01.08.2025 15:09 โ ๐ 2 ๐ 0 ๐ฌ 0 ๐ 0
Screenshot of the app showing a page from a book + different views of existing and new ocr.
Many VLM-based OCR models have been released recently. Are they useful for libraries and archives?
I made a quick Space to compare VLM OCR with "traditional" OCR using 11k Scottish exam papers from @natlibscot.bsky.social
huggingface.co/spaces/davanstrien/ocr-time-capsule
01.08.2025 15:09 โ ๐ 48 ๐ 15 ๐ฌ 4 ๐ 1
Bar chart comparing โThinkingโ mode win rates across 5 benchmarks for 4 model variants. The models are:
โข Qwen3-30B-A3B-Thinking-2507 (Red)
โข Qwen3-30B-A3B-Thinking (Blue)
โข Qwen3-235B-A22B-Thinking (Gray)
โข Gemini-2.5-Flash-Thinking (Beige)
Scores per Benchmark:
Benchmark Qwen3-30B-A3B 2507 Qwen3-30B-A3B Qwen3-235B-A22B Gemini-2.5-Flash
GPQA 73.4 65.8 71.1 82.8
AIME25 85.0 70.9 81.5 72.0
LiveCodeBench 66.0 57.4 61.2 55.7
Arena-Hard v2 56.0 36.3 61.5 56.7
BFCL-v3 72.4 69.1 70.8 68.6
Highlights:
โข Qwen3-30B-A3B-Thinking-2507 consistently achieves the highest scores in 4 out of 5 benchmarks.
โข Gemini-2.5-Flash leads only in GPQA, outperforming others by a notable margin.
โข Qwen3-235B-A22B is generally strong but slightly behind 2507 in most cases.
โข Qwen3-30B-A3B (non-2507 variant) underperforms relative to the others, particularly on Arena-Hard v2.
These results suggest that the 2507 checkpoint of Qwen3-30B-A3B has a substantial performance advantage over other โthinkingโ configurations.
qwen3-30b-a3b-thinking-2507
no surprise, but today Qwen launched the thinking version of its laptop-sized MoE
tasty, as usual.
huggingface.co/Qwen/Qwen3-3...
30.07.2025 15:27 โ ๐ 7 ๐ 1 ๐ฌ 2 ๐ 0
Bar chart comparing performance of five models across five benchmarks: GPQA, AIME25, LiveCodeBench v6, Arena-Hard v2, and BFCL-v3. Each model is color-coded:
โข Red: Qwen3-30B-A3B-Instruct-2507
โข Blue: Qwen3-30B-A3B Non-thinking
โข Gray: Qwen3-235B-A22B Non-thinking
โข Gold: Gemini-2.5-Flash Non-thinking
โข Beige: OpenAI GPT-4o-0327
Benchmark results (highest score in each bolded):
โข GPQA:
โข Qwen3-30B-A3B-Instruct-2507: 70.4
โข Qwen3-30B-A3B Non-thinking: 54.8
โข Qwen3-235B-A22B: 62.9
โข Gemini-2.5-Flash: 78.3
โข GPT-4o-0327: 66.9
โข AIME25:
โข Qwen3-30B-A3B-Instruct-2507: 61.3
โข Qwen3-30B-A3B Non-thinking: 21.6
โข Qwen3-235B-A22B: 24.7
โข Gemini-2.5-Flash: 61.6
โข GPT-4o-0327: 66.7
โข LiveCodeBench v6:
โข Qwen3-30B-A3B-Instruct-2507: 43.2
โข Qwen3-30B-A3B Non-thinking: 29.0
โข Qwen3-235B-A22B: 32.9
โข Gemini-2.5-Flash: 40.1
โข GPT-4o-0327: 35.8
โข Arena-Hard v2:
โข Qwen3-30B-A3B-Instruct-2507: 69.0
โข Qwen3-30B-A3B Non-thinking: 24.8
โข Qwen3-235B-A22B: 52.0
โข Gemini-2.5-Flash: 58.3
โข GPT-4o-0327: 61.9
โข BFCL-v3:
โข Qwen3-30B-A3B-Instruct-2507: 65.1
โข Qwen3-30B-A3B Non-thinking: 58.6
โข Qwen3-235B-A22B: 68.0
โข Gemini-2.5-Flash: 64.1
โข GPT-4o-0327: 66.5
Note: The red โInstructโ model consistently outperforms its blue โNon-thinkingโ counterpart. GPT-4o and Gemini-2.5-Flash also show strong overall results. Arena-Hard v2 notes GPT-4.1 as evaluator.
yesssss! a small update to Qwen3-30B-A3B
this has been one of my favorite local models, and now we get an even better version!
better instruction following, tool use & coding. Nice small MoE!
huggingface.co/Qwen/Qwen3-3...
29.07.2025 16:53 โ ๐ 47 ๐ 4 ๐ฌ 3 ๐ 2
Document page with text
Markdown formatted OCR output
HF Jobs just launched! ๐
One command VLM based OCR with uv Scripts:
hf jobs uv run [script] ufo-images ufo-text
Classified UFO docs โ clean markdown. Zero setup!
Try it โ huggingface.co/datasets/uv-...
29.07.2025 08:48 โ ๐ 13 ๐ 1 ๐ฌ 0 ๐ 0
Working to port VLaMy to an entirely free mode where you can just cache all your data in the browser for a project. Slowly adding all the features from the full version to this user-free version. Available now on @hf.co @danielvanstrien.bsky.social
Link: huggingface.co/spaces/wjbma...
28.07.2025 17:11 โ ๐ 21 ๐ 7 ๐ฌ 2 ๐ 0
Migrating the Hub from Git LFS to Xet
Weโre on a journey to advance and democratize artificial intelligence through open source and open science.
We've moved the first 20PB from Git LFS to Xet on @hf.co
without any interruptions. Now we're migrating the rest of the Hub. We got this far by focusing on the community first.
Here's a deep dive on the infra making this possible and what's next: huggingface.co/blog/migrati...
15.07.2025 15:16 โ ๐ 5 ๐ 2 ๐ฌ 1 ๐ 0
465 people. 122 languages. 58,185 annotations!
FineWeb-C v1 is complete! Communities worldwide have built their own educational quality datasets, proving that we don't need to wait for big tech to support languages.
Huge thanks to all who contributed!
huggingface.co/blog/davanst...
08.07.2025 12:07 โ ๐ 33 ๐ 11 ๐ฌ 2 ๐ 0
465 people. 122 languages. 58,185 annotations!
FineWeb-C v1 is complete! Communities worldwide have built their own educational quality datasets, proving that we don't need to wait for big tech to support languages.
Huge thanks to all who contributed!
huggingface.co/blog/davanst...
08.07.2025 12:07 โ ๐ 33 ๐ 11 ๐ฌ 2 ๐ 0
Vchitect/ShotBench ยท Datasets at Hugging Face
Weโre on a journey to advance and democratize artificial intelligence through open source and open science.
ShotBench: Cinematic Understanding Benchmark
- 3,572 expert QA pairs
- 3,049 images + 464 videos
- 200+ Oscar-nominated films
- 8 cinematography dimensions tested
huggingface.co/datasets/Vch...
07.07.2025 07:26 โ ๐ 4 ๐ 0 ๐ฌ 0 ๐ 0
OCR Time Machine - a Hugging Face Space by davanstrien
This tool extracts text from historical document images and corresponding XML files. You upload an image and optionally an XML file, choose an OCR model, and get the extracted text in both Markdown...
Added olmOCR to the OCR Time Machine!
@ai2.bsky.social's olmOCR (one of the OG VLM-based OCR models) still performs v well.
My takeaway from testing: there's no single "best" VLM for historical docs currently (maybe with a bit of fine-tuning, there could be ๐)
huggingface.co/spaces/davan...
30.06.2025 14:11 โ ๐ 5 ๐ 0 ๐ฌ 0 ๐ 0
Not for this one, but I have found passing the original OCR as extra context can work quite well (especially for longer documents). It does make everything much slower. My own hunch is this would work better for generating new training data rather than something you could use for large collections.
25.06.2025 19:42 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0
The space does extract the text from the XML to make it easier to compare the text otherwise it's quite hard to see the quality of the XML OCR output directly.
25.06.2025 09:32 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0
Thanks! Will make the comparison clearer! At the moment the Space doesn't do "traditional" OCR but focuses on allowing someone to upload some existing OCR output they already have to see how much better/worse the VLM OCR would be.
25.06.2025 09:32 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
Screenshot showing a document page image on the left with corresponding OCR output on the right of the page.
Everyoneโs dropping VLM-based OCR models latelyโฆ
But are they actually better than traditional OCR engines, which output XML for historical docs?
I built OCR Time Machine to test it!
๐ Upload image + ALTO/PAGE XML
โ๏ธ Compare outputs side by side
๐ huggingface.co/spaces/davan...
24.06.2025 17:35 โ ๐ 30 ๐ 9 ๐ฌ 2 ๐ 0
I would personally give it one or two more generations of releases (mostly to get to smaller models) and then seriously think about seeing what running these models at scale can enable with collections!
24.06.2025 17:35 โ ๐ 2 ๐ 0 ๐ฌ 0 ๐ 0
tl;dr yes, I think the modern VLM-based OCR models are probably better in many cases (but targeting ALTO XML output might be essential for many libraries...)
24.06.2025 17:35 โ ๐ 1 ๐ 1 ๐ฌ 1 ๐ 0
Screenshot showing a document page image on the left with corresponding OCR output on the right of the page.
Everyoneโs dropping VLM-based OCR models latelyโฆ
But are they actually better than traditional OCR engines, which output XML for historical docs?
I built OCR Time Machine to test it!
๐ Upload image + ALTO/PAGE XML
โ๏ธ Compare outputs side by side
๐ huggingface.co/spaces/davan...
24.06.2025 17:35 โ ๐ 30 ๐ 9 ๐ฌ 2 ๐ 0
Yes, I agree there are a lot of big opportunities here!
20.06.2025 12:23 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
the semantic search for HuggingFace datasets mcp that @danielvanstrien.bsky.social made is really fun -- i just tried it in cursor's ai side bar. repo/info: github.com/davanstrien/...
19.06.2025 15:00 โ ๐ 3 ๐ 1 ๐ฌ 0 ๐ 0
Entrepreneur; critical optimist about AI and tech; European and world citizen; pescatarian; TheyCorrect; Revisely; Dutch Edtech; European Edtech Alliance; education; product; useful
Shadow of self https://www.linkedin.com/in/jeroenfransen
Scholar and content creator interested in a wide range of topics. (They/He/Mx. ) #science #politics #philosophy #nature ๐ณ๏ธโ๐(๐ฎ๐ฑ+๐ต๐ธ) https://blog.danielgoldman.us/about
Rated Mature. Actually Immature.
Youโll find me in the wild. Exploring Scotland's beauty, one adventure at a time. Advocate for mental health awareness & suicide prevention. Embracing nature's healing touch.
A one-day multidisciplinary conference on the future of comics, technology, and creativity. 4 September 2025, City St George's, University of London. Submissions deadline: 10 July 2025. https://comicsandai.org/. Reposts are not endorsements.
Applied Mathematician
Macro data refinement scientist
https://kslote1.github.io
Everyone should care.
My opinions are my own
Metadata nerd, tech head ๐พ GLAMR data๐บ AU+ANZ #AI4LAM #DigitalHumanities #DataScience #LODLAM #OpenScience #Analytics #AI #DataEthics enthusiast. She/her. Own views.
@ingridbmason@ausglam.space @ingridbmason@hcommons.social
A research center at Harvard working to strengthen societyโs connection to knowledge by advancing our access to and understanding of the data that shapes AI.
A think tank that develops strategies and policies for cultivating #DigitalCommons.
https://openfuture.eu/
Top news and commentary for technology's leaders, from all around the web.
This account shares top-level Techmeme headlines. Visit https://techmeme.com/ for full context.
The official resistance team of the U.S. Digital Service. We are the Builders. #AltGov
Send your stories: submissions@wethebuilders.org
https://www.wethebuilders.org/
https://givebutter.com/WTB
Public Access to Public Data is a Public Good. We want to ensure our data are not gone forever. Read more about our efforts: https://www.datarescueproject.org/press/
Europeโs cultural heritage online http://europeana.eu
Resources for cultural professionals http://pro.europeana.eu
Funded by the European Union
I'm working mostly on data tools, preserving scholarly communications at https://archive.org, I love libraries, co-host of https://golangleipzig.space
Postdoc @ The University of Edinburgh
Bayesian | Data Science | Machine Learning | Health Data
Data science + noisy music | btskinner.io | he/him
๐ Digital Adriatic is a transnational oral history project mapping the voices and memories of the Upper Adriatic in the short twentieth century.
digitaladriatic.eu
#DigitalHistory #PublicHistory #OralHistory
Research Associate | https://arojascastro.github.io/
Canadian in Hong Kong.
Work: digital historian @ HKU.
Research: medieval Korea, Neo-Confucianism, historical networks, data centres, infrastructure studies, and historical applications of AI.
Team: Big Data Studies Lab https://bigdatastudies.net/
Autistic. ADHD.. Democratic socialist.๐ Lives on the west coast of Sweden. 1 young adult son. Not a fan of the current politicians and leadership in the US. ๐
โพ๏ธ health care is a human right!
CS PhD student @UW working on ML for nanopore protein sequencing