@anjaliwgupta - Bluesky Profile

Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces We introduce VSI-Bench, a novel benchmark of over 5,000 video-based visual-spatial intelligence questions, to evaluate and probe MLLMs, which revealed that their emerging spatial reasoning and local w...

To read about, evaluate models, and use VSI-Bench, see:

Webpage: vision-x-nyu.github.io/thinking-in-...
ArXiv: arxiv.org/pdf/2412.14171
Eval Code: github.com/vision-x-nyu...
VSI-Bench: huggingface.co/datasets/nyu...
[n/n]

23.12.2024 22:53 — 👍 2 🔁 1 💬 0 📌 0

It was an honor and a pleasure to collaborate with and learn from @drfeifei.bsky.social @saining.bsky.social, Jihan Yang, Shusheng Yang, and Rilyn Han! I believe this is just the beginning for visual-spatial intelligence (and my PhD 😉) and emphasizes the importance of vision in MLLMs. [7/n]

23.12.2024 22:51 — 👍 0 🔁 0 💬 0 📌 0

Prompting for “cognitive maps,” a concept introduced by Edward Tolman in the ‘40s for the unified representation of spatial environments brains build, we find MLLMs have a local spatial bias and that explicitly remembering spaces improves relational distance abilities. [6/n]

23.12.2024 22:48 — 👍 0 🔁 0 💬 0 📌 0

What does it mean to “think in space”? We analyze spatial intelligence linguistically and visually.

We analyze self-explanations to attribute VSI-Bench performance to visual-spatial capabilities and find that spatial and linguistic intelligence are very distinct. [5/n]

23.12.2024 22:47 — 👍 0 🔁 0 💬 0 📌 0

VSI-Bench tests configuration, measurement estimation, and spatiotemporal abilities across 5k+ Video QA pairs and eight task types.

We evaluate VSI-Bench on open- and closed-source MLLMs and find that MLLMs exhibit competitive—though subhuman—visual-spatial intelligence. [4/n]

23.12.2024 22:46 — 👍 0 🔁 0 💬 0 📌 0

We propose VSI-Bench, a 3D video-based visual-spatial intelligence benchmark designed for MLLMs.

Video mirrors how humans perceive spaces continually and temporally, and, by repurposing 3D reconstruction datasets, these videos encompass and test complete indoor scenes. [3/n]

23.12.2024 22:46 — 👍 0 🔁 0 💬 0 📌 0

What is visual-spatial intelligence? Visual-spatial intelligence entails perceiving and mentally manipulating spatial relationships. It requires visual perception, temporal processing, linguistic intelligence (to understand questions), and spatial reasoning. [2/n]

23.12.2024 22:46 — 👍 0 🔁 0 💬 0 📌 0

Visual-spatial intelligence–we rely on it to perceive, interact, and navigate our everyday spaces. To what capacity do MLLMs possess it? Do they mirror how humans think and reason about space?

Presenting “Thinking in Space: How Multimodal Models See, Remember, and Recall Spaces”! [1/n]

23.12.2024 22:45 — 👍 10 🔁 4 💬 7 📌 0

Latest posts by anjaliwgupta.bsky.social on Bluesky

@anjaliwgupta is following 18 prominent accounts