Gabe Grand's Avatar

Gabe Grand

@gabegrand.bsky.social

PhD student @csail.mit.edu πŸ€– & 🧠

60 Followers  |  126 Following  |  25 Posts  |  Joined: 27.10.2025  |  2.0807

Latest posts by gabegrand.bsky.social on Bluesky

Post image

Paper + code + interactive demos: gabegrand.github.io/battleship βš“οΈπŸŽ―

27.10.2025 19:17 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Special shoutout to @valeriopepe.bsky.social (co-first author), who is super talented and currently on the PhD job market!

27.10.2025 19:17 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Thanks to Valerio Pepe, Josh Tenenbaum, and Jacob Andreas for long-horizon collaboration and planning: this line of Battleship work has been *2 years* in the making!

27.10.2025 19:17 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Bottom line: The future of AI-driven discovery isn't just bigger modelsβ€”it's smarter inference. By combining LMs with rational planning strategies, we can build agents that ask better questions, make better decisions, and collaborate effectively with humans.

27.10.2025 19:17 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 2    πŸ“Œ 0

Why does this matter? Discovery-driven AI (scientific experiments, theorem proving, drug discovery) requires hitting needles in combinatorially vast haystacks. If we want agents that explore rationally, we need to go beyond prompting.

27.10.2025 19:17 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Key takeaway: Current LMs aren’t rational information seekers: they struggle to ground answers in context, generate informative queries, and balance exploration vs. exploitation. But Bayesian inference at test time can dramatically close these gapsβ€”efficiently.

27.10.2025 19:17 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Does this generalize? YES. We replicated on "Guess Who?" from TextArena and saw similar gains: GPT-4o (61.7% β†’ 90.0%), Llama-4-Scout (30.0% β†’ 72.4%). The framework works across information-seeking domains with combinatorial hypothesis spaces.

27.10.2025 19:17 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Deciding when to explore vs. act is also key. Skilled players (humans + GPT-5) spread out questions over the course of the game. Weak LMs spam all 15 upfront. The key isn't asking MOREβ€”it's asking BETTER questions at the RIGHT time. Quality > quantity.

27.10.2025 19:17 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Here's the kicker: asking high-EIG questions alone doesn't guarantee wins. Weaker models struggle to convert information into good moves. Bayes-Mβ€”which explicitly marginalizes over beliefsβ€”is crucial for translating questions into action.

27.10.2025 19:17 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Our approach leverages inference scaling to enable models to ask more informative questions. Bayes-Q boosts EIG by up to 0.227 bits (94.2% of the theoretical ceiling) and virtually eliminates redundant questions (18.5% β†’ 0.2% for Llama-4-Scout).

27.10.2025 19:17 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

In head-to-head comparisons, both GPT-4o and Llama-4-Scout now beat GPT-5 while costing 2.8x and 99.7x less, respectively.

27.10.2025 19:17 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

With all three Bayesian components (+Bayes-QMD), Llama-4-Scout jumps from near-random guessing (0.367 F1) to super-human level (0.764 F1). GPT-4o sees similar gains (0.450 β†’ 0.782 F1). The deltas are really striking.

27.10.2025 19:17 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

We developed three Bayesian strategies inspired by Bayesian Experimental Design (BED):
❓ Question (Bayes-Q): Optimizes expected info gain (EIG)
🎯 Move (Bayes-M): Maximizes hit probability
βš–οΈ Decision (Bayes-D): Decides when to ask vs. shoot using one-step lookahead

27.10.2025 19:17 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

In our second set of experiments, we turned to the challenge of building rational question-asking agents to play the Captain role.

27.10.2025 19:17 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

We find that having models write Python functions to answer questions boosts accuracy by +14.7% (absolute p.p.), and complements CoT reasoning.

27.10.2025 19:17 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Video thumbnail

One useful trick to improve answering accuracy is to use code generation. Code grounds reasoning in executable logic, not just vibes.

27.10.2025 19:17 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Post image

Many LMs really struggle with questions that require grounding answers in the board and dialogue context. GPT-4o drops from 72.8% β†’ 60.4% accuracy on context-dependent questions. Llama-4-Scout: 68.0% β†’ 54.0%. Humans? Basically flat (92.8% vs 91.9%).

27.10.2025 19:17 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Overall, humans are really reliable at answering questions on BattleshipQA (92.5% accuracy). In contrast, LM accuracy ranges widelyβ€”from near-random (52.5%, GPT-4o-mini) to human-level (92.8%, o3-mini). But there's a catch…

27.10.2025 19:17 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

In our first experiment, we looked at QA accuracy in the Spotter role – this is an important sanity-check for how well players (humans & agents) can understand and reason about the game state.

27.10.2025 19:17 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

To understand how people strategize & collaborate, we ran a two-player synchronous human study (N=42) and collected full action trajectories and chat dialogues. Our β€œBattleshipQA” dataset provides a rich, multimodal benchmark for comparing human and agent behavior.

27.10.2025 19:17 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

We created β€œCollaborative Battleship”—a two-player game where a Captain (who only sees a partial board) must balance asking questions vs. taking shots, while a Spotter (who sees everything) can only answer Yes/No. It's deceptively simple but cognitively demanding.

27.10.2025 19:17 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

But LMs are trained to *answer* queries, not *ask* them. Can they learn to explore intelligently?

27.10.2025 19:17 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Many high-stakes AI applications require asking data-driven questionsβ€”think scientific discovery, medical diagnosis, or drug development.

27.10.2025 19:17 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Video thumbnail

Do AI agents ask good questions? We built β€œCollaborative Battleship” to find outβ€”and discovered that weaker LMs + Bayesian inference can beat GPT-5 at 1% of the cost.

Paper, code & demos: gabegrand.github.io/battleship

Here's what we learned about building rational information-seeking agents... πŸ§΅πŸ”½

27.10.2025 19:17 β€” πŸ‘ 19    πŸ” 8    πŸ’¬ 1    πŸ“Œ 2

Hello! Late to the party, but still excited to join this brave blue world πŸŒŽπŸ¦‹

27.10.2025 19:07 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

@gabegrand is following 20 prominent accounts