Hokin's Avatar

Hokin

@hokin.bsky.social

Philosopher, Scientist, Engineer https://hokindeng.github.io/

45 Followers  |  69 Following  |  90 Posts  |  Joined: 12.02.2024  |  2.1199

Latest posts by hokin.bsky.social on Bluesky

You are such a monster

20.11.2025 20:12 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Congratulations to @tomerullman.bsky.social on official release! For everyone, my disagreements on this paper has also already been accepted by NeurIPS SpaVLE this year.

link: arxiv.org/abs/2510.20835

18.11.2025 05:15 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Rethinking the Simulation vs. Rendering Dichotomy: No Free Lunch in Spatial World Modelling Spatial world models, representations that support flexible reasoning about spatial relations, are central to developing computational models that could operate in the physical world, but their precis...

Congratulations on the official release! My disagreements on this paper is also already accepted by NeurIPS SpaVLE this year

link: arxiv.org/abs/2510.20835

18.11.2025 05:13 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

developmental embodiedment 😎

#DevelopmentalEmbodiedment #GrowAI

07.11.2025 05:38 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

congratulations

07.11.2025 00:22 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

what type of pen are you using

07.11.2025 00:14 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
GitHub - hokindeng/VMEvalKit: This is a framework for evaluating reasoning in foundational Video Models. This is a framework for evaluating reasoning in foundational Video Models. - GitHub - hokindeng/VMEvalKit: This is a framework for evaluating reasoning in foundational Video Models.

VMEvalKit is 100% open source. We're building this in public with everyone. Plz join us ‼️

πŸ‘‰ Slack: join.slack.com/t/growingail...
πŸ‘‰ Early Results: grow-ai-like-a-child.com/video-reason/
πŸ“„ Paper: github.com/hokindeng/VM...
πŸ‘‰ GitHub: github.com/hokindeng/VM...

The age of video reasoning is here 🎬🧠

04.11.2025 23:39 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
GitHub - hokindeng/VMEvalKit: This is a framework for evaluating reasoning in foundational Video Models. This is a framework for evaluating reasoning in foundational Video Models. - GitHub - hokindeng/VMEvalKit: This is a framework for evaluating reasoning in foundational Video Models.

VMEvalKit is 100% open source. We're building this in public with everyone. Plz join us ‼️

πŸ‘‰ Slack: join.slack.com/t/growingail...
πŸ‘‰ GitHub: github.com/hokindeng/VM...
πŸ‘‰ Early Results: grow-ai-like-a-child.com/video-reason/
πŸ“„ Paper: github.com/hokindeng/VM...

The age of video reasoning is here 🎬🧠

04.11.2025 22:01 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image Post image Post image Post image

While failure cases clearly show idiosyncratic patterns πŸ§©πŸ€”, we currently lack a principled framework to systematically analyze or interpret them πŸ”. We invite everyone to explore these examples πŸ§ͺ, as they may offer valuable clues for future research directions πŸ’‘πŸ§ πŸš€.

04.11.2025 21:56 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Video thumbnail

Here is a generated video for solving the Raven's Matrices from video models. For more, checkout grow-ai-like-a-child.com/video-reason/

04.11.2025 21:55 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Raven's Matrices is the one of standard tasks in testing IQ in humans, which require subjects to find patterns and regularities. Intriguingly, video models are able to solve them quite well !

04.11.2025 21:53 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Video thumbnail

Here is an example of testing mental rotation in video models. For more, checkout grow-ai-like-a-child.com/video-reason/

04.11.2025 21:52 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

For testing mental rotation, we give them an {n}-voxel structure with some tilted camera views (20-40Β° elevation) and ask them to horizontally rotate with exactly 180Β° azimuth change. The hard part is 1) don't deform 2) rotate the right degree. Interesting, some models are able to do it quite well.

04.11.2025 21:52 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Video thumbnail

Here is a video example. For more, checkout grow-ai-like-a-child.com/video-reason/

04.11.2025 21:49 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

For the Sudoku problems, the video models need to fill the gap with the correct number in order to have each row and column all have 1, 2, 3. Surprisingly, this is the easiest task for video model.

04.11.2025 21:49 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Video thumbnail

Here is an example of generated video from the models solving the maze problem. Checkout more at grow-ai-like-a-child.com/video-reason/

04.11.2025 21:48 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

In the maze problems, video models need to generate videos where navigate the green dot 🟒 to the red flags 🚩 . And they are also able to do it quite well ~

04.11.2025 21:48 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Video thumbnail

Here is a generated video for solving the Chess problem. For more examples, checkout: grow-ai-like-a-child.com/video-reason/

04.11.2025 21:45 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Let's see some examples. Video models are able to figure out what are the checkmate moves in the following problems.

04.11.2025 21:45 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Idiosyncratic behavioral patterns exist.

For example, Sora-2 somehow figures out how to solve Chess problems. But all other models do not have such ability.

Veo 3 and 3.1 actually are able to do mental rotation quite well, but really fail on the maze problems.

04.11.2025 21:44 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Tasks also exhibit clear difficulty hierarchy, with Sudoku being the easiest and mental rotation being the hardest, across all models.

04.11.2025 21:38 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Models exhibit clear performance hierarchy with Sora-2 currently being the best model.

04.11.2025 21:37 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

The basic of VMEvalkit is a Task Pair unit:

1️⃣ Initial image: unsolved puzzle
2️⃣ Text instruction: β€œSolve this ...”
3️⃣ Final image: correct solution (hidden during generation)

Models see (1)+(2), we compare their output to (3). Simple and straight-forward βœ…

04.11.2025 21:36 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

‼️ Video models start to reason, let's build-in-public scaled eval together πŸš€

github.com/hokindeng/VM... (Apache 2.0) offers
1⃣One-click inference across ALL available models
2⃣Unified API & datasets & auto resume + error handling + eval
3⃣Plug new models and tasks in <5 lines of code

a thread (1/n)

04.11.2025 21:34 β€” πŸ‘ 4    πŸ” 1    πŸ’¬ 2    πŸ“Œ 0
Preview
Rethinking the Simulation vs. Rendering Dichotomy: No Free Lunch in Spatial World Modelling Spatial world models, representations that support flexible reasoning about spatial relations, are central to developing computational models that could operate in the physical world, but their precise mechanistic underpinnings are nuanced by the borrowing of underspecified or misguided accounts of human cognition. This paper revisits the simulation versus rendering dichotomy and draws on evidence from aphantasia to argue that fine-grained perceptual content is critical for model-based spatial reasoning. Drawing on recent research into the neural basis of visual awareness, we propose that spatial simulation and perceptual experience depend on shared representational geometries captured by higher-order indices of perceptual relations. We argue that recent developments in embodied AI support this claim, where rich perceptual details improve performance on physics-based world engagements. To this end, we call for the development of architectures capable of maintaining structured perceptual representations as a step toward spatial world modelling in AI.

Our paper is now available at arxiv.org/abs/2510.20835. For anyone interested, we’d love to hang out and chat πŸ’¬πŸ§ƒ

#EmbodiedAI #SpatialReasoning #NeuroAI #CognitiveScience #SpatialReasoning

03.11.2025 00:16 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Third, in embodied AIs, explicit simulators (MuJoCo/Isaac/Genesis) are vital but brittle alone. Implicit world models (VIP, R3M, visual pretraining) supply perceptual structure that boosts generalization, long-horizon planning, and sim-to-real.

03.11.2025 00:16 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

However, it's necessary that visual and spatial mental content co-construct conscious experiences rather than run on isolated tracks.

03.11.2025 00:16 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Second, it makes to sound like the dorsal stream, where the "mujoco" software of our brain lies, almost becomes a "zombie" stream, namely with no participation of our conscious experience.

03.11.2025 00:15 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

This first lies in the different interpretation of us in neuro-clinical literature of aphantasia. People with aphantasia can solve mental rotation yet report no visual imagery. We interpret this as a gating/decoding issueβ€”not absence of "rendering" in the brain.

03.11.2025 00:15 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

We argue an alternative: robust spatial reasoning needs fine-grained perceptual content and higher-order relational indices. There’s no free lunch: coarse abstractions into "language-of-thought" like representations won’t yield human-like spatial competence.

03.11.2025 00:15 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

@hokin is following 20 prominent accounts