Danny Sawyer's Avatar

Danny Sawyer

@dannypsawyer.bsky.social

AI researcher @GoogleDeepMind. PhD @Caltech. Interested in autonomous exploration and self-improvement, both in humans and embodied AI agents. Views my own.

6 Followers  |  2 Following  |  14 Posts  |  Joined: 09.10.2025  |  1.909

Latest posts by dannypsawyer.bsky.social on Bluesky

Proud to have been a part of this project! SIMA 2 brings us several steps closer to AGI in the real world with a Gemini-based agent that can reason, generalize, and self-improve in both seen and unseen 3D worlds, including new environments generated by Genie 3!

13.11.2025 17:17 β€” πŸ‘ 8    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0

Thanks to all the authors! @janexwang 13/13

10.10.2025 17:10 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Can foundation models actively gather information in interactive environments to test hypotheses? Foundation models excel at single-turn reasoning but struggle with multi-turn exploration in dynamic environments, a requirement for many real-world challenges. We evaluated these models on their abil...

In summary, our work provides a deeper understanding of the exploration and adaptation capabilities of frontier models. We show that these skills, while not yet robust, can be elicited.

Read the full paper for all the details!
arxiv.org/abs/2412.06438
#NeurIPS2025 12/13

10.10.2025 17:10 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

This reveals that a major frontier for foundation agents isn't just acting, but reflecting. The ability to improve through adaptive strategies over time is challenging, but not fundamentally out of reach.

Benchmarks like Alchemy are crucial for measuring this progress. 11/13

10.10.2025 17:10 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

We took it a step further: strategy adaptation. We silently changed the environment's rules mid-episode.

We found some models, like Gemini 2.5 and Claude 3.7, when aided by summarization, could detect the change and successfully adapt their strategy, recovering performance. 10/13

10.10.2025 17:10 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

With the summarization prompt, a latent meta-learning ability emerged. Models now showed significant score improvement across trials.

The act of summarizing forced them to consolidate their knowledge, enabling them to form and execute better strategies in later trials. 8/13

10.10.2025 17:10 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

This led to our key insight. We hypothesized the models weren't actively distilling principles from their long action history.

So, we prompted them to write a summary of their findings after each trial. The effect was dramatic. 8/13

10.10.2025 17:10 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

But in the complex Alchemy environment, performance faltered. Without guidance, even the most powerful models showed no significant improvement across trials.

They gathered data but failed to integrate it into a better strategy. Meta-learning did not occur naturally. 7/13

10.10.2025 17:10 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

In the simple Feature World tasks, most models performed near-optimally. They are highly efficient at gathering information when the goal is straightforward.

This shows the challenge isn't basic, single-turn reasoning. They can select informative actions in the moment. 6/13

10.10.2025 17:10 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

2️⃣ Alchemy: A multi-trial environment that requires agents to deduce latent causal rules and improve their strategy over time. The rules are random, but stay the same across trials.

This isolates different facets of exploration from Feature World. 5/13

10.10.2025 17:10 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

We evaluated models in two environments:
1️⃣ Feature World (both text-based and 3D in Construction Lab): A stateless setting to test raw information-gathering efficiency. 4/13

10.10.2025 17:10 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

These patterns of failures offer interesting insights into how foundation models function, and also point toward ways to unlock these core embodied exploration abilities. 3/13

10.10.2025 17:10 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Video thumbnail

We benchmarked variants of GPT, Claude, and Gemini on exploration in several embodied environments. Surprisingly, although most models did well on stateless, single-turn tasks, many had critical limitations in adaptation and meta-learning in stateful, multi-turn tasks. 2/13

10.10.2025 17:10 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Video thumbnail

Happy to announce that our work has been accepted to workshops on Multi-turn Interactions and Embodied World Models at #NeurIPS2025! Frontier foundation models are incredible, but how well can they explore in interactive environments?
PaperπŸ‘‡
arxiv.org/abs/2412.06438
🧡1/13

10.10.2025 17:10 β€” πŸ‘ 6    πŸ” 2    πŸ’¬ 1    πŸ“Œ 1

@dannypsawyer is following 2 prominent accounts