llama.cpp has vision language model support now! β€οΈβπ₯
get started with sota VLMs (gemma 3, Qwen2.5VL, InternVL3 & more) and serve them wherever you want π€©
learn more github.com/ggml-org/lla... π
@merve.bsky.social
proud mediterrenean π§Ώ open-sourceress at hugging face π€ multimodality, zero-shot vision, vision language models, transformers
llama.cpp has vision language model support now! β€οΈβπ₯
get started with sota VLMs (gemma 3, Qwen2.5VL, InternVL3 & more) and serve them wherever you want π€©
learn more github.com/ggml-org/lla... π
If you want to β¨ speed-up & harden β¨ your RAG pipelines, use visual document retrieval models β¬οΈ
We have shipped a how-to guide for VDR models in Hugging Face transformers π€π huggingface.co/docs/transfo...
here's a good blog on successful DSE model MCDSE, compression and more huggingface.co/blog/marco/a...
15.04.2025 16:27 β π 3 π 0 π¬ 0 π 0Why do people sleep on DSE multimodal retrieval models? π
They're just like ColPali, but highly scalable, fast and you can even make them more efficient with binarization or matryoshka with little degradation πͺβ‘οΈ
I collected some here huggingface.co/collections/...
I'm so hooked on @hf.co Inference Providers (specifically Qwen2.5-VL-72B) for multimodal agentic workflows with smolagents π₯Ή
get started ‡οΈ
> filter models provided by different providers
> test them through widget or Python/JS/cURL
my weekly summary on what's released in open AI is up on @hf.co huggingface.co/posts/merve/...
collection is here huggingface.co/collections/...
fan-favorite open-source PDF rendering model OlmOCR goes faster and more efficient β‘οΈ
RolmOCR-7B follows same recipe with OlmOCR, builds on Qwen2.5VL with training set modifications and improves accuracy & performance π€
huggingface.co/reducto/Rolm...
Hello friends ππΌ
If visit Turkey this summer, know that millions of Turkish people are doing a boycott, once a week not buying anything and rest of the week only buying necessities
if you have plans, here's a post that summarizes where you should buy stuff from www.instagram.com/share/BADrkS...
SmolVLM paper is out and it's packed with great findings on training a good smol vision LM!
Andi summarized them below, give it a read if you want to see more insights π€
the model also has impressive OCR capabilities β¬οΈ
11.04.2025 19:10 β π 5 π 0 π¬ 0 π 0we'll give this model a test on agentic capabilities but here's an example from paper:
11.04.2025 19:09 β π 2 π 0 π¬ 1 π 0This model consists of a dynamic res handling MoonViT encoder, a projection layer and a 16B MoE decoder (with 2.8B active params)
the paper introduces an interesting pre-training pipeline to handle long context and the model saw 4.4T tokens arxiv.org/pdf/2504.07491
DO NOT SLEEP ON THIS MODEL
Kimi-VL-A3B-Thinking is the first ever capable open-source reasoning VLM with MIT license β€οΈ
> it has only 2.8B activated params π
> it's agentic π₯ works on GUIs
> surpasses gpt-4o
I've put it to test (see below ‡οΈ) huggingface.co/spaces/moons...
InternVL3 is out π₯
> 7 ckpts with various sizes (1B to 78B)
> Built on InternViT encoder and Qwen2.5VL decoder, improves on Qwen2.5VL
> Can do reasoning, document tasks, extending to tool use and agentic capabilities π€
> easily use with Hugging Face transformers π€ huggingface.co/collections/...
Model Context Protocol has prompt injection security problems
simonwillison.net/2025/Apr/9/m...
Xet infra now backs 1000s of repos on @hf.co , which means we get to put on our researcher hats and peer into the bytes π π€
Xet clients chunk files (~64KB) and skip uploads of duplicate content, but what if those chunks are already in _another_ repo? We skip those too.
SmolVLM paper is out and it's packed with great findings on training a good smol vision LM!
Andi summarized them below, give it a read if you want to see more insights π€
X'in politikalarΔ± sebebiyle iΕimle alakalΔ± post'larΔ± burada da paylaΕΔ±yor olacaΔΔ±m, takip edebilirsiniz π
06.04.2025 11:51 β π 31 π 1 π¬ 1 π 0icymi I shipped a tutorial on fine-tuning vision language models on videos β―οΈ
learn how to fine-tune SmolVLM2 on Video Feedback dataset π github.com/merveenoyan/...
All the multimodal document retrieval models (ColPali, DSE et al) are now under visual document retrieval at @hf.co ππ€
take your favorite VDR model out for multimodal RAG π€
Smol but mighty:
β’ 256M delivers 80% of the performance of our 2.2B model.
β’ 500M hits 90%.
Both beat our SOTA 80B model from 17 months ago! π
Efficiency π€ Performance
Explore the collection here: huggingface.co/collections/...
Blog: huggingface.co/blog/smolervlm
Introducing the smollest VLMs yet! π€
SmolVLM (256M & 500M) runs on <1GB GPU memory.
Fine-tune it on your laptop and run it on your toaster. π
Even the 256M model outperforms our Idefics 80B (Aug '23).
How small can we go? π
Everything that was released passed week in open AI π€
> Link to all models, datasets, demos huggingface.co/collections/...
> Text-readable version is here huggingface.co/posts/merve/...
Learn more from their blog post here huggingface.co/blog/vdr-2b-... π
13.01.2025 11:12 β π 8 π 1 π¬ 1 π 0there's a new multimodal retrieval model in town π€
@llamaindex.bsky.social released vdr-2b-multi-v1
> uses 70% less image tokens, yet outperforming other dse-qwen2 based models
> 3x faster inference with less VRAM π¨
> shrinkable with matryoshka πͺ
huggingface.co/collections/...
What a week to open the year in open ML, all the things released at @hf.co π€
Here's everything released, find text-readable version here huggingface.co/posts/merve/...
All models are here huggingface.co/collections/...
ViTPose -- best open-source pose estimation model just landed to @hf.co transformers πΊπ»ππ»
π Model collection: huggingface.co/collections/...
π Notebook on how to use: colab.research.google.com/drive/1e8fcb...
π Try it here: huggingface.co/spaces/hysts...
The model is very interesting, it has different encoders for different modalities each (visual prompt, text prompt, image and video) then it concatenates these to feed into LLM π¬
the output segmentation tokens are passed to SAM2, to sort of match text (captions or semantic classes) to masks ‡οΈ
ByteDance just dropped SA2VA: a new family of vision LMs combining Qwen2VL/InternVL and SAM2 with MIT license π
The models are capable of tasks involving vision-language understanding and visual referrals (referring segmentation) both for images and videos β―οΈ
see the blog and our docs for more insights around native agentic skills of LLMs and getting started with smolagents, courtesy of the amazing
@m--ric.bsky.social
> Blog: hf.co/blog/smolage...
> Quickstart: huggingface.co/docs/smolage...