's Avatar

@anjaliwgupta.bsky.social

13 Followers  |  24 Following  |  8 Posts  |  Joined: 21.11.2024  |  1.5235

Latest posts by anjaliwgupta.bsky.social on Bluesky

Preview
Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces We introduce VSI-Bench, a novel benchmark of over 5,000 video-based visual-spatial intelligence questions, to evaluate and probe MLLMs, which revealed that their emerging spatial reasoning and local w...

To read about, evaluate models, and use VSI-Bench, see:

Webpage: vision-x-nyu.github.io/thinking-in-...
ArXiv: arxiv.org/pdf/2412.14171
Eval Code: github.com/vision-x-nyu...
VSI-Bench: huggingface.co/datasets/nyu...
[n/n]

23.12.2024 22:53 โ€” ๐Ÿ‘ 2    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

It was an honor and a pleasure to collaborate with and learn from @drfeifei.bsky.social @saining.bsky.social, Jihan Yang, Shusheng Yang, and Rilyn Han! I believe this is just the beginning for visual-spatial intelligence (and my PhD ๐Ÿ˜‰) and emphasizes the importance of vision in MLLMs. [7/n]

23.12.2024 22:51 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image Post image

Prompting for โ€œcognitive maps,โ€ a concept introduced by Edward Tolman in the โ€˜40s for the unified representation of spatial environments brains build, we find MLLMs have a local spatial bias and that explicitly remembering spaces improves relational distance abilities. [6/n]

23.12.2024 22:48 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image Post image

What does it mean to โ€œthink in spaceโ€? We analyze spatial intelligence linguistically and visually.

We analyze self-explanations to attribute VSI-Bench performance to visual-spatial capabilities and find that spatial and linguistic intelligence are very distinct. [5/n]

23.12.2024 22:47 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image Post image

VSI-Bench tests configuration, measurement estimation, and spatiotemporal abilities across 5k+ Video QA pairs and eight task types.

We evaluate VSI-Bench on open- and closed-source MLLMs and find that MLLMs exhibit competitiveโ€”though subhumanโ€”visual-spatial intelligence. [4/n]

23.12.2024 22:46 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

We propose VSI-Bench, a 3D video-based visual-spatial intelligence benchmark designed for MLLMs.

Video mirrors how humans perceive spaces continually and temporally, and, by repurposing 3D reconstruction datasets, these videos encompass and test complete indoor scenes. [3/n]

23.12.2024 22:46 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

What is visual-spatial intelligence? Visual-spatial intelligence entails perceiving and mentally manipulating spatial relationships. It requires visual perception, temporal processing, linguistic intelligence (to understand questions), and spatial reasoning. [2/n]

23.12.2024 22:46 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Visual-spatial intelligenceโ€“we rely on it to perceive, interact, and navigate our everyday spaces. To what capacity do MLLMs possess it? Do they mirror how humans think and reason about space?

Presenting โ€œThinking in Space: How Multimodal Models See, Remember, and Recall Spacesโ€! [1/n]

23.12.2024 22:45 โ€” ๐Ÿ‘ 10    ๐Ÿ” 4    ๐Ÿ’ฌ 7    ๐Ÿ“Œ 0

@anjaliwgupta is following 18 prominent accounts