Imanol Miranda's Avatar

Imanol Miranda

@imirandam.bsky.social

PhD student at HiTZ Zentroa (@hitz-zentroa.bsky.social) / IXA Group and the University of Basque Country (@upvehu.bsky.social).

14 Followers  |  39 Following  |  6 Posts  |  Joined: 20.03.2025  |  1.6033

Latest posts by imirandam.bsky.social on Bluesky

Key takeaway: Adding simple structure at inference-time, through image crops and text segments, is a powerful, training-free way to improve Vision-Language Compositionality performance.

Joint work with @Ander Salaberria @eagirre.bsky.social @gazkune.bsky.social @hitz-zentroa.bsky.social

18.06.2025 11:28 โ€” ๐Ÿ‘ 2    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Our analysis shows that:
1. There is room to improve the quality of extracted text segments.
2. Our method achieves significant performance gains in Winoground's non-trivial instances.
3. Isolated image crops can lose size and quantity information, leaving room for improvement.

18.06.2025 11:28 โ€” ๐Ÿ‘ 2    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

Why are image crops crucial? ๐Ÿค” We found that simply adding text segments isn't enough. The biggest performance gains come when text segments are paired with image crops, proving the power of serial image computing.

18.06.2025 11:28 โ€” ๐Ÿ‘ 1    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

We've evaluated it across three diverse datasets: BiVLC, Winoground (171 instances), and BiSCoR-Ctrl. See the significant improvements by inference-time approach (ITA) on three existing models:

18.06.2025 11:28 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

Our approach is straightforward yet effective:
1. Divide the image into smaller crops.
2. Extract text segments capturing objects, attributes and relations.
3. Use the VLM to find image crops that best fit the text segments.
4. Aggregate matching similarities for the final score.

18.06.2025 11:28 โ€” ๐Ÿ‘ 2    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

#newHitzPaper
Can a simple inference-time approach unlock better Vision-Language Compositionality?๐Ÿคฏ
Our latest paper shows how adding structure at inference significantly boosts performance in popular dual-encoder VLMs on different datasets.

Read more: arxiv.org/abs/2506.09691

18.06.2025 11:28 โ€” ๐Ÿ‘ 6    ๐Ÿ” 3    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 1

@imirandam is following 20 prominent accounts