Zory Zhang @zoryzhang - Bluesky Profile

#CoreCognition #LLM #multimodal #GrowAI We spent 3 years to curate 1503 classic experiments spanning 12 core concepts in human cognitive development and evaluated on 230 MLLMs with 11 different prompts for 5 times to get over 3.8 millions inference data points.

A thread (1/n) - #ICML2025 ✅

30.06.2025 06:07 — 👍 13 🔁 9 💬 1 📌 0

Beautiful to see this initiative from a group of like minded PhD students collaborating together! 🚀

11.06.2025 23:49 — 👍 9 🔁 4 💬 1 📌 0

GrowAI Team: @growai.bsky.social

12.06.2025 17:04 — 👍 1 🔁 0 💬 0 📌 0

New Paper Alert ‼️ Current VLMs completely fail human gaze understanding 🙀 and scaling does NO help ‼️

However, humans, since an extremely age 🧒, are extremely sensitive to other people's gaze 🙄 👀

No mentors, no labs, only pre-doc students, 111 VLMs, and we did it 😎

11.06.2025 23:21 — 👍 6 🔁 5 💬 1 📌 1

With the amazing GrowAI team: Pinyuan Feng (equally contributed), Bingyang Wang, Tianwei Zhao, Suyang Yu, Qingying Gao, @hokin.bsky.social , Ziqiao Ma, Yijiang Li, & Dezhi Luo.

🧵11/11 🎉

12.06.2025 17:03 — 👍 2 🔁 0 💬 1 📌 0

GrowAI Growing AI like a Child, at Scale Humans never "learn" intelligence. Humans develop intelligence. Biological lives on this planet take heavy advantage of intelligent primitives embedded in their genes...

Thank you for reading 😋

GrowAI Team Present.
growing-ai-like-a-child.github.io

Arxiv: arxiv.org/abs/2506.05412
Project page: grow-ai-like-a-child.github.io/gaze/
Stimuli: osf.io/kyaeu
Code: github.com/grow-ai-like...
🧵10/11

12.06.2025 17:03 — 👍 2 🔁 0 💬 1 📌 0

Besides understanding VLMs, this explanation also suggests that VLM training should include more embodied social interaction, such that natural human-AI interaction can stem from next-token/frame-prediction training. We also recommend a better learning curriculum design📚.
🧵9/11

12.06.2025 17:03 — 👍 2 🔁 0 💬 1 📌 0

We leave this explanation open for further investigation and conclude that this work shows how controlled studies can complement benchmarking by providing aspects that explanations need to account for, as a way to constrain the hypothesis space to better understand VLMs🌟.
🧵8/11

12.06.2025 17:03 — 👍 2 🔁 0 💬 1 📌 0

Surprisingly, their accuracy does not differ between front views and side views, while humans do (p<0.001). VLMs may rely on 👺head orientation rather than 👀eye gaze direction, making them "robust" to side views that increase the geometric ambiguity of eye direction.
🧵7/11

12.06.2025 17:03 — 👍 2 🔁 0 💬 1 📌 0

On the other hand, the performance of Gemini 1.5 Pro, GPT-4o, InternLM, Qwen2.5, and GLM becomes closer to the chance level as difficulty increases (with increasing proximity and number of objects). They likely employ heuristics that break down under difficult conditions.
🧵6/11

12.06.2025 17:03 — 👍 2 🔁 0 💬 1 📌 0

Before that, we need to establish baselines. 65 human participants were presented with MC questions like the one below. Their performance degrades 📉 with increasing proximity, increasing number of objects, and when the camera view switches from the front to the side.
🧵5/11

12.06.2025 17:03 — 👍 2 🔁 0 💬 1 📌 0

In addition to the chance-level accuracy, VLMs responded with every possible answer almost equally frequently. Are they random guessers? 🤡 Spoiler: top-tier VLMs are not, as we further analyzed how their performance varies with respect to the controlled variables. 🤗
🧵4/11

12.06.2025 17:03 — 👍 2 🔁 0 💬 1 📌 0

We found that humans excel at gaze inference (~91% accuracy), but 94 of 111 VLMs performed about as well as if they had guessed randomly without looking at the images (~42%) 😲. Even the best, like GPT-4o, hit only ~50%. Bigger (or newer) VLMs are not better. 🫤
🧵3/11

12.06.2025 17:03 — 👍 2 🔁 0 💬 1 📌 0

We systematically manipulated variables across 900 evaluation stimuli: View (left/right/front), Proximity (1-3 scale), Number of objects (2-4), etc., and tested 65 human participants (45 stimuli per person) and 111 VLMs on it.
🧵2/11

12.06.2025 17:03 — 👍 2 🔁 0 💬 1 📌 0

👁️ 𝐂𝐚𝐧 𝐕𝐢𝐬𝐢𝐨𝐧 𝐋𝐚𝐧𝐠𝐮𝐚𝐠𝐞 𝐌𝐨𝐝𝐞𝐥𝐬 (𝐕𝐋𝐌𝐬) 𝐈𝐧𝐟𝐞𝐫 𝐇𝐮𝐦𝐚𝐧 𝐆𝐚𝐳𝐞 𝐃𝐢𝐫𝐞𝐜𝐭𝐢𝐨𝐧?
Knowing where someone looks is key to a Theory of Mind. We test 111 VLMs and 65 humans to compare their inferences.
Project page: grow-ai-like-a-child.github.io/gaze/
🧵1/11

12.06.2025 17:03 — 👍 3 🔁 0 💬 1 📌 1

Sam is 100% correct on this. Indeed, human babies have essential cognitive priors such as permanence, continuity, and boundary of objects, 3D Euclidean understanding of space, etc.

We spent 2 years to systematically to examine and show the lack of such in MLLMs: arxiv.org/abs/2410.10855

24.05.2025 05:55 — 👍 21 🔁 5 💬 0 📌 0

Zory Zhang

Latest posts by zoryzhang.bsky.social on Bluesky

@zoryzhang is following 20 prominent accounts