To read about, evaluate models, and use VSI-Bench, see:
Webpage: vision-x-nyu.github.io/thinking-in-...
ArXiv: arxiv.org/pdf/2412.14171
Eval Code: github.com/vision-x-nyu...
VSI-Bench: huggingface.co/datasets/nyu...
[n/n]
@anjaliwgupta.bsky.social
To read about, evaluate models, and use VSI-Bench, see:
Webpage: vision-x-nyu.github.io/thinking-in-...
ArXiv: arxiv.org/pdf/2412.14171
Eval Code: github.com/vision-x-nyu...
VSI-Bench: huggingface.co/datasets/nyu...
[n/n]
It was an honor and a pleasure to collaborate with and learn from @drfeifei.bsky.social @saining.bsky.social, Jihan Yang, Shusheng Yang, and Rilyn Han! I believe this is just the beginning for visual-spatial intelligence (and my PhD ๐) and emphasizes the importance of vision in MLLMs. [7/n]
23.12.2024 22:51 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0Prompting for โcognitive maps,โ a concept introduced by Edward Tolman in the โ40s for the unified representation of spatial environments brains build, we find MLLMs have a local spatial bias and that explicitly remembering spaces improves relational distance abilities. [6/n]
23.12.2024 22:48 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0What does it mean to โthink in spaceโ? We analyze spatial intelligence linguistically and visually.
We analyze self-explanations to attribute VSI-Bench performance to visual-spatial capabilities and find that spatial and linguistic intelligence are very distinct. [5/n]
VSI-Bench tests configuration, measurement estimation, and spatiotemporal abilities across 5k+ Video QA pairs and eight task types.
We evaluate VSI-Bench on open- and closed-source MLLMs and find that MLLMs exhibit competitiveโthough subhumanโvisual-spatial intelligence. [4/n]
We propose VSI-Bench, a 3D video-based visual-spatial intelligence benchmark designed for MLLMs.
Video mirrors how humans perceive spaces continually and temporally, and, by repurposing 3D reconstruction datasets, these videos encompass and test complete indoor scenes. [3/n]
What is visual-spatial intelligence? Visual-spatial intelligence entails perceiving and mentally manipulating spatial relationships. It requires visual perception, temporal processing, linguistic intelligence (to understand questions), and spatial reasoning. [2/n]
23.12.2024 22:46 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0Visual-spatial intelligenceโwe rely on it to perceive, interact, and navigate our everyday spaces. To what capacity do MLLMs possess it? Do they mirror how humans think and reason about space?
Presenting โThinking in Space: How Multimodal Models See, Remember, and Recall Spacesโ! [1/n]