Zihan's Avatar

Zihan

@zhweng.bsky.social

PhD Student @mcgill.ca | {Biological,Artificial} Neural Networks

5 Followers  |  6 Following  |  9 Posts  |  Joined: 01.12.2025  |  1.6275

Latest posts by zhweng.bsky.social on Bluesky

Preview
Caption This, Reason That: VLMs Caught in the Middle Vision-Language Models (VLMs) have shown remarkable progress in visual understanding in recent years. Yet, they still lag behind human capabilities in specific visual tasks such as counting or relatio...

9/9
A huge shoutout to my co-authors @lucasmgomez.bsky.social,
@taylorwwebb.bsky.social, and @bashivan.bsky.social!
Check out the full paper for the deep dive into VLM cognitive profiles at arxiv.org/abs/2505.21538
See you in San Deigo! ๐Ÿ”๏ธ #AI #VLM #NeurIPS2025

01.12.2025 16:43 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

8/9
Our work suggests that future VLM improvements shouldn't just focus on larger encoders, but on better Visual Chain-of-Thought and integration strategies to overcome the "Perception-Reasoning" disconnect.

01.12.2025 16:43 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image Post image

7/9
Does this generalize? Yes.
Fine-tuning on our cognitive tasks correlated with improvements on established benchmarks like MMMU-Pro and VQAv2. ๐Ÿ“Š

01.12.2025 16:43 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

6/9
We didn't stop there. We fine-tuned Qwen2.5 on our Composite Visual Reasoning (CVR) tasks.
๐Ÿ”น 1k training samples yielded large gains.
๐Ÿ”น 100k samples pushed performance even higher.

01.12.2025 16:43 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

5/9
This suggests a major bottleneck in current VLMs: Chain-of-Thought (CoT) needs to be better grounded in visual features.
Models are "Caught in the Middle"โ€”they possess the visual info and the reasoning capacity, but fail to connect them without an explicit text bridge.

01.12.2025 16:43 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

4/9
Is the vision encoder causing this gap? No.
We tested Self-Captioning (SC): The model describes the image, then answers the prompt using its own caption.
๐Ÿ‘‰ Qwen2.5-VL-7B Spatial Perception accuracy went from 44% (Base) โ†’ 73% (SC). ๐Ÿ“ˆ

01.12.2025 16:43 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

3/9
The Diagnosis? ๐Ÿฅ
VLMs have distinct cognitive profiles.
โœ… Perception: Strong at identifying what an object is (Category).
โŒ Spatial: Terrible at identifying where it is (Location).
โŒ Attention: They struggle to ignore distractors.

01.12.2025 16:43 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

2/9
Human intelligence is built on core abilities: Perception, Attention, and Memory.
Existing VLM benchmarks (MMMU, etc.) test high-level reasoning. We went deeper. We built the PAM Dataset to isolate these low-level cognitive abilities in models like GPT-4o and Qwen2.5-VL.

01.12.2025 16:43 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

1/9
๐ŸšจThrilled to share "Caption This, Reason That", a #NeurIPS2025 Spotlight! ๐Ÿ”ฆ
Meet us at #2112, 3 Dec 11 a.m.
We analyze VLM limitations through the lens of Cognitive Science (Perception, Attention, Memory) and propose a simple "Self-Captioning" method that boosts spatial reasoning by ~18%.
๐Ÿงต๐Ÿ‘‡

01.12.2025 16:43 โ€” ๐Ÿ‘ 7    ๐Ÿ” 2    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 2

@zhweng is following 6 prominent accounts