merve's Avatar

merve

@merve.bsky.social

proud mediterrenean 🧿 open-sourceress at hugging face πŸ€— multimodality, zero-shot vision, vision language models, transformers

8,391 Followers  |  675 Following  |  242 Posts  |  Joined: 14.04.2023  |  2.3688

Latest posts by merve.bsky.social on Bluesky

Post image

llama.cpp has vision language model support now! ❀️‍πŸ”₯

get started with sota VLMs (gemma 3, Qwen2.5VL, InternVL3 & more) and serve them wherever you want 🀩
learn more github.com/ggml-org/lla... πŸ“–

11.05.2025 07:46 β€” πŸ‘ 44    πŸ” 5    πŸ’¬ 2    πŸ“Œ 0
Post image

If you want to ✨ speed-up & harden ✨ your RAG pipelines, use visual document retrieval models ⬇️

We have shipped a how-to guide for VDR models in Hugging Face transformers πŸ€—πŸ“– huggingface.co/docs/transfo...

02.05.2025 09:49 β€” πŸ‘ 27    πŸ” 3    πŸ’¬ 3    πŸ“Œ 0
Preview
Visually Multilingual: Introducing mcdse-2b A Blog post by Marco Cimolai on Hugging Face

here's a good blog on successful DSE model MCDSE, compression and more huggingface.co/blog/marco/a...

15.04.2025 16:27 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

Why do people sleep on DSE multimodal retrieval models? πŸ‘€

They're just like ColPali, but highly scalable, fast and you can even make them more efficient with binarization or matryoshka with little degradation πŸͺ†βš‘️

I collected some here huggingface.co/collections/...

15.04.2025 16:26 β€” πŸ‘ 12    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Video thumbnail

I'm so hooked on @hf.co Inference Providers (specifically Qwen2.5-VL-72B) for multimodal agentic workflows with smolagents πŸ₯Ή

get started ‡️
> filter models provided by different providers
> test them through widget or Python/JS/cURL

15.04.2025 14:59 β€” πŸ‘ 10    πŸ” 2    πŸ’¬ 0    πŸ“Œ 0
Post image

my weekly summary on what's released in open AI is up on @hf.co huggingface.co/posts/merve/...

collection is here huggingface.co/collections/...

14.04.2025 12:24 β€” πŸ‘ 18    πŸ” 1    πŸ’¬ 0    πŸ“Œ 1
Post image

fan-favorite open-source PDF rendering model OlmOCR goes faster and more efficient ⚑️

RolmOCR-7B follows same recipe with OlmOCR, builds on Qwen2.5VL with training set modifications and improves accuracy & performance 🀝

huggingface.co/reducto/Rolm...

14.04.2025 08:51 β€” πŸ‘ 16    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Login β€’ Instagram Welcome back to Instagram. Sign in to check out what your friends, family & interests have been capturing & sharing around the world.

Hello friends πŸ‘‹πŸΌ

If visit Turkey this summer, know that millions of Turkish people are doing a boycott, once a week not buying anything and rest of the week only buying necessities

if you have plans, here's a post that summarizes where you should buy stuff from www.instagram.com/share/BADrkS...

12.04.2025 08:05 β€” πŸ‘ 28    πŸ” 2    πŸ’¬ 0    πŸ“Œ 0

SmolVLM paper is out and it's packed with great findings on training a good smol vision LM!

Andi summarized them below, give it a read if you want to see more insights 🀠

09.04.2025 13:38 β€” πŸ‘ 29    πŸ” 4    πŸ’¬ 0    πŸ“Œ 0
Post image

the model also has impressive OCR capabilities ⬇️

11.04.2025 19:10 β€” πŸ‘ 5    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

we'll give this model a test on agentic capabilities but here's an example from paper:

11.04.2025 19:09 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

This model consists of a dynamic res handling MoonViT encoder, a projection layer and a 16B MoE decoder (with 2.8B active params)

the paper introduces an interesting pre-training pipeline to handle long context and the model saw 4.4T tokens arxiv.org/pdf/2504.07491

11.04.2025 19:08 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

DO NOT SLEEP ON THIS MODEL

Kimi-VL-A3B-Thinking is the first ever capable open-source reasoning VLM with MIT license ❀️
> it has only 2.8B activated params πŸ‘
> it's agentic πŸ”₯ works on GUIs
> surpasses gpt-4o

I've put it to test (see below ‡️) huggingface.co/spaces/moons...

11.04.2025 19:08 β€” πŸ‘ 30    πŸ” 2    πŸ’¬ 1    πŸ“Œ 0
Post image

InternVL3 is out πŸ’₯

> 7 ckpts with various sizes (1B to 78B)
> Built on InternViT encoder and Qwen2.5VL decoder, improves on Qwen2.5VL
> Can do reasoning, document tasks, extending to tool use and agentic capabilities πŸ€–
> easily use with Hugging Face transformers πŸ€— huggingface.co/collections/...

11.04.2025 13:35 β€” πŸ‘ 12    πŸ” 2    πŸ’¬ 0    πŸ“Œ 0
Preview
Model Context Protocol has prompt injection security problems As more people start hacking around with implementations of MCP (the Model Context Protocol, a new standard for making tools available to LLM-powered systems) the security implications of tools built ...

Model Context Protocol has prompt injection security problems
simonwillison.net/2025/Apr/9/m...

09.04.2025 13:01 β€” πŸ‘ 118    πŸ” 23    πŸ’¬ 10    πŸ“Œ 3
Preview
From Chunks to Blocks: Accelerating Uploads and Downloads on the Hub We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Xet infra now backs 1000s of repos on @hf.co , which means we get to put on our researcher hats and peer into the bytes πŸ‘€ πŸ€“

Xet clients chunk files (~64KB) and skip uploads of duplicate content, but what if those chunks are already in _another_ repo? We skip those too.

09.04.2025 15:19 β€” πŸ‘ 18    πŸ” 5    πŸ’¬ 1    πŸ“Œ 0

SmolVLM paper is out and it's packed with great findings on training a good smol vision LM!

Andi summarized them below, give it a read if you want to see more insights 🀠

09.04.2025 13:38 β€” πŸ‘ 29    πŸ” 4    πŸ’¬ 0    πŸ“Œ 0

X'in politikaları sebebiyle işimle alakalı post'ları burada da paylaşıyor olacağım, takip edebilirsiniz 😊

06.04.2025 11:51 β€” πŸ‘ 31    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Preview
smol-vision/Fine_tune_SmolVLM2_on_Video.ipynb at main Β· merveenoyan/smol-vision Recipes for shrinking, optimizing, customizing cutting edge vision models. πŸ’œ - merveenoyan/smol-vision

icymi I shipped a tutorial on fine-tuning vision language models on videos ⏯️

learn how to fine-tune SmolVLM2 on Video Feedback dataset πŸ“– github.com/merveenoyan/...

06.03.2025 15:43 β€” πŸ‘ 33    πŸ” 3    πŸ’¬ 1    πŸ“Œ 0
Post image

All the multimodal document retrieval models (ColPali, DSE et al) are now under visual document retrieval at @hf.co πŸ“πŸ€—

take your favorite VDR model out for multimodal RAG 🀝

26.02.2025 11:39 β€” πŸ‘ 19    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

Smol but mighty:
β€’ 256M delivers 80% of the performance of our 2.2B model.
β€’ 500M hits 90%.
Both beat our SOTA 80B model from 17 months ago! πŸŽ‰

Efficiency 🀝 Performance

Explore the collection here: huggingface.co/collections/...
Blog: huggingface.co/blog/smolervlm

23.01.2025 13:33 β€” πŸ‘ 16    πŸ” 2    πŸ’¬ 1    πŸ“Œ 0
Post image

Introducing the smollest VLMs yet! 🀏
SmolVLM (256M & 500M) runs on <1GB GPU memory.
Fine-tune it on your laptop and run it on your toaster. πŸš€
Even the 256M model outperforms our Idefics 80B (Aug '23).
How small can we go? πŸ‘€

23.01.2025 13:33 β€” πŸ‘ 48    πŸ” 7    πŸ’¬ 1    πŸ“Œ 2
Post image

Everything that was released passed week in open AI 🀠

> Link to all models, datasets, demos huggingface.co/collections/...
> Text-readable version is here huggingface.co/posts/merve/...

17.01.2025 15:28 β€” πŸ‘ 32    πŸ” 3    πŸ’¬ 1    πŸ“Œ 1
Preview
Visual Document Retrieval Goes Multilingual We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Learn more from their blog post here huggingface.co/blog/vdr-2b-... πŸ“–

13.01.2025 11:12 β€” πŸ‘ 8    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Post image

there's a new multimodal retrieval model in town 🀠
@llamaindex.bsky.social released vdr-2b-multi-v1
> uses 70% less image tokens, yet outperforming other dse-qwen2 based models
> 3x faster inference with less VRAM πŸ’¨
> shrinkable with matryoshka πŸͺ†
huggingface.co/collections/...

13.01.2025 11:11 β€” πŸ‘ 46    πŸ” 2    πŸ’¬ 1    πŸ“Œ 1
Post image

What a week to open the year in open ML, all the things released at @hf.co 🀠

Here's everything released, find text-readable version here huggingface.co/posts/merve/...

All models are here huggingface.co/collections/...

10.01.2025 14:51 β€” πŸ‘ 21    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
Video thumbnail

ViTPose -- best open-source pose estimation model just landed to @hf.co transformers πŸ•ΊπŸ»πŸ’ƒπŸ»

πŸ”– Model collection: huggingface.co/collections/...

πŸ”– Notebook on how to use: colab.research.google.com/drive/1e8fcb...

πŸ”– Try it here: huggingface.co/spaces/hysts...

09.01.2025 14:27 β€” πŸ‘ 68    πŸ” 8    πŸ’¬ 1    πŸ“Œ 0
Post image


The model is very interesting, it has different encoders for different modalities each (visual prompt, text prompt, image and video) then it concatenates these to feed into LLM πŸ’¬

the output segmentation tokens are passed to SAM2, to sort of match text (captions or semantic classes) to masks ‡️

09.01.2025 12:00 β€” πŸ‘ 11    πŸ” 3    πŸ’¬ 0    πŸ“Œ 0
Post image

ByteDance just dropped SA2VA: a new family of vision LMs combining Qwen2VL/InternVL and SAM2 with MIT license πŸ’—

The models are capable of tasks involving vision-language understanding and visual referrals (referring segmentation) both for images and videos ⏯️

09.01.2025 12:00 β€” πŸ‘ 59    πŸ” 8    πŸ’¬ 3    πŸ“Œ 2
Post image

see the blog and our docs for more insights around native agentic skills of LLMs and getting started with smolagents, courtesy of the amazing
@m--ric.bsky.social

> Blog: hf.co/blog/smolage...
> Quickstart: huggingface.co/docs/smolage...

31.12.2024 15:38 β€” πŸ‘ 11    πŸ” 2    πŸ’¬ 0    πŸ“Œ 0

@merve is following 20 prominent accounts