Andi's Avatar

Andi

@andimara.bsky.social

Multimodal research @huggingface

785 Followers  |  129 Following  |  49 Posts  |  Joined: 23.11.2024  |  2.5142

Latest posts by andimara.bsky.social on Bluesky

Preview
nanoVLM: The simplest repository to train your VLM in pure PyTorch Weโ€™re on a journey to advance and democratize artificial intelligence through open source and open science.

Read the blog: huggingface.co/blog/nanovlm

21.05.2025 13:10 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
GitHub - huggingface/nanoVLM: The simplest, fastest repository for training/finetuning small-sized VLMs. The simplest, fastest repository for training/finetuning small-sized VLMs. - huggingface/nanoVLM

Train your Vision-Language Model in just two commands:
> git clone github.com/huggingface/...
> python train.py

21.05.2025 13:10 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

New Blog๐Ÿ“–โœจ:
nanoVLM: The simplest way to train your own Vision-Language Model in pure PyTorch explained step-by-step!
Easy to read, even easier to use. Train your first VLM today!

21.05.2025 13:10 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Camera Interaction App

Link: webml-community-smolvlm-realtime-webgpu.static.hf.space/index.html

14.05.2025 15:39 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Video thumbnail

Real-time SmolVLM in a web browser with transformers.js.

All running locally with no installs. Just open the website.

14.05.2025 15:39 โ€” ๐Ÿ‘ 3    ๐Ÿ” 1    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 0
Preview
Paper page - SmolVLM: Redefining small and efficient multimodal models Join the discussion on this paper page

If youโ€™re intoย efficient multimodal models, youโ€™ll love this one.
Check out the paper: huggingface.co/papers/2504....

08.04.2025 15:12 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

๐Ÿ“ฑย Real-world Efficiency: We've created an app using SmolVLM on an iPhone 15 and got real-time inference directly from its camera!
๐ŸŒย Browser-based Inference? Yep!ย We get lightning-fast inference speeds of 40-80 tokens per second directly in a web browser. No tricks, just compact, efficient models!

08.04.2025 15:12 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

๐ŸŒŸย State-of-the-Art Performance, SmolVLM comes in three powerful yet compact sizesโ€”256M, 500M, and 2.2B parametersโ€”each setting new SOTA benchmarks for their hardware constraints in image and video understanding.

08.04.2025 15:12 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

โœจย Less CoT, more efficiency: Turns out, too much Chain-of-Thought (CoT) data actually hurts performance in small models. They dumb
โœจย Longer videos, better results: Increasing video length during training enhanced performance on both video and image tasks.

08.04.2025 15:12 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

โœจย System prompts and special tokens are key: Introducing system prompts and dedicated media intro/outro tokens significantly boosted our compact VLMโ€™s performanceโ€”especially for video tasks.

08.04.2025 15:12 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

โœจย Pixel shuffling magic: Aggressively pixel shuffling helped our compact VLMs "see" better, same performance with sequences 16x shorter!
โœจย Learned positional tokens FTW: For compact models, learned positional tokens significantly outperform raw text tokens, enhancing efficiency and accuracy.

08.04.2025 15:12 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

โœจย Smaller is smarter with SigLIP: Surprise! Smaller LLMs didn't benefit from the usual large SigLIP (400M). Instead, we use the 80M base SigLIP that performs equally well at just 20% of the original size!

08.04.2025 15:12 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

Here are the coolest insights from our experiments:
โœจย Longer context = Big wins: Increasing the context length from 2K to 16K gave our tiny VLMs a 60% performance boost!

08.04.2025 15:12 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Preview
Paper page - SmolVLM: Redefining small and efficient multimodal models Join the discussion on this paper page

Today, we share the tech report forย SmolVLM: Redefining small and efficient multimodal models.
๐Ÿ”ฅ Explaining how to create a tiny 256M VLM that uses less than 1GB of RAM and outperforms our 80B models from 18 months ago!
huggingface.co/papers/2504....

08.04.2025 15:12 โ€” ๐Ÿ‘ 22    ๐Ÿ” 8    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 1
Preview
ds4sd/SmolDocling-256M-preview ยท Hugging Face Weโ€™re on a journey to advance and democratize artificial intelligence through open source and open science.

SmolDocling isย available todayย ๐Ÿ—๏ธ

๐Ÿ”— Model:ย huggingface.co/ds4sd/SmolDo...
๐Ÿ“– Paper: huggingface.co/papers/2503....
๐Ÿค— Space: huggingface.co/spaces/ds4sd...

Try it and let us know what you think! ๐Ÿ’ฌ

17.03.2025 15:53 โ€” ๐Ÿ‘ 6    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

At only 256M parameters, SmolDoclingย outperformsย much larger models on key document conversion tasks:
๐Ÿ–‹๏ธย Full-page transcription: Beats models 27ร— bigger!
๐Ÿ“‘ย Equations: Matches or beats leading models like GOT
๐Ÿ’ปย Code recognition: We introduce the first benchmark for code OCR

17.03.2025 15:53 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

What makes it unique?
๐Ÿ“Œ Handles everything a document has:ย tables, charts, code, equations, lists, and more
๐Ÿ“Œ Works beyond scientific papersโ€”supportsย business docs, patents, and forms
๐Ÿ“Œ It runs with less than 1GB of RAM, so running at large batch sizes is super cheap!

17.03.2025 15:53 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

How does SmolDocling beat models 27ร— bigger? SmolDocling transforms any documentย intoย structured metadata withย DocTags, being SOTA in:

โœ… Full-page conversion
โœ… Layout identification
โœ… Equations, tables, charts, plots, code OCR

17.03.2025 15:53 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

๐Ÿš€ We just droppedย SmolDocling: a 256M open-source vision LM for complete document OCR! ๐Ÿ“„โœจ
Lightning fast, process a page inย 0.35 sec onย consumer GPU using < 500MB VRAM โšก
SOTA in document conversion, beating every competing model we tested (including models 27x more params) ๐Ÿคฏ
But how? ๐Ÿงถโฌ‡๏ธ

17.03.2025 15:53 โ€” ๐Ÿ‘ 35    ๐Ÿ” 10    ๐Ÿ’ฌ 3    ๐Ÿ“Œ 1
Preview
A Deepdive into Aya Vision: Advancing the Frontier of Multilingual Multimodality Weโ€™re on a journey to advance and democratize artificial intelligence through open source and open science.

Extremely bullish on @CohereForAI's Aya Vision (8B & 32B) - new SOTA open-weight VLMs

- 8B wins up to 81% of the time in its class, better than Gemini Flash
- 32B beats Llama 3.2 90B!
- Integrated on @hf.co from Day 0!

Check out their blog! huggingface.co/blog/aya-vis...

05.03.2025 14:38 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Me too! Highlight of my career so far :)

31.01.2025 15:21 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Preview
smollm/vision at main ยท huggingface/smollm Everything about the SmolLM2 and SmolVLM family of models - huggingface/smollm

And that was why we didn't release this before. It's live research code. Most gets rewritten fairly often, and some parts have been the same for years.
It works, it manages to produce SOTA results at 256M and 80B sizes, but it's not production code.
Go check it out:
github.com/huggingface/...

31.01.2025 15:06 โ€” ๐Ÿ‘ 3    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

And it also has a bunch of bugs like this one in our modeling_vllama3.py file. We start from a pretrained LLM, but for some reason the weights of the head are not loaded from the model. I still don't know why :(

31.01.2025 15:06 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 0
Post image

The codebase is full of interesting insights like this one in our dataset.py file: How do you get reproducible randomness in different processes across different machines?
Start different random number generators based on a tuple (seed, rank)!

31.01.2025 15:06 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 0
Post image

Post training, you can run the evaluation on all of these tasks by running:
sbatch vision/experiments/evaluation/vloom/async_evals_tr_346/run_evals_0_shots_val_2048 . slurm

31.01.2025 15:06 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

Launching the training for SmolVLM 256M is as simple as:
./vision/experiments/pretraining/vloom/tr_341_smolvlm_025b_1st_stage/01_launch . sh
Then we use wandb to track the losses.
Check out the file to find out details!

31.01.2025 15:06 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

Fuck it, today we're open-sourcing the codebase used to train SmolVLM from scratch on 256 H100s ๐Ÿ”ฅ
Inspired by our team's effort to open-source DeepSeek's R1, we are releasing the training and evaluation code on top of the weights ๐Ÿซก
Now you can train any SmolVLMโ€”or create your own custom VLMs!

31.01.2025 15:06 โ€” ๐Ÿ‘ 24    ๐Ÿ” 5    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 0
Preview
SmolVLM - a Hugging Face Space by HuggingFaceTB Discover amazing ML apps made by the community

Links :D
Demo: huggingface.co/spaces/Huggi...
Models: huggingface.co/collections/...
Blog: huggingface.co/blog/smolervlm

23.01.2025 13:33 โ€” ๐Ÿ‘ 4    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

SmolVLM upgrades:
โ€ข New vision encoder: Smaller but higher res.
โ€ข Improved data mixtures: better OCR and doc understanding.
โ€ข Higher pixels/token: 4096 vs. 1820 = more efficient.
โ€ข Smart tokenization: Faster training and better performance. ๐Ÿš€

Better, faster, smarter.

23.01.2025 13:33 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

We have partnered with IBM 's Docling to build amazing smol models for document understanding. Our early results are amazing. Stay tuned for future releases!

23.01.2025 13:33 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

@andimara is following 18 prominent accounts