Davide Paglieri's Avatar

Davide Paglieri

@dpaglieri.bsky.social

PhD Student at UCL. Previously AI Research Engineer at Bending Spoons

264 Followers  |  43 Following  |  13 Posts  |  Joined: 21.11.2024  |  1.6034

Latest posts by dpaglieri.bsky.social on Bluesky

Preview
GitHub - balrog-ai/BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games Benchmarking Agentic LLM and VLM Reasoning On Games - balrog-ai/BALROG

๐Ÿš€ BALROG is open submission! We welcome submission of new foundation models and new agentic pipelines.

Check it out here:
github.com/balrog-ai/BA...

16.01.2025 11:30 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

This suggests that high performance on popular static benchmarks does not necessarily translate to dynamic agentic tasks, and training data contamination may also play a role.

๐Ÿ†•BALROG introduces a new type of agentic benchmark designed to be robust to train data contamination.

16.01.2025 11:30 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

๐ŸšจThis week's new entry on balrogai.com is Microsoft Phi-4 (14B model)

While Phi-4 excels on benchmarks like math competitions, BALROG reveals that Phi-4 falls short as an agent. More research on how to improve agentic performance is needed.

16.01.2025 11:30 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
BALROG BALROG: Benchmarking Agentic LLM/VLM Reasoning On Games

Interested in submitting to BALROG? Check out the instructions here!

balrogai.com/submit.html

Some big models we are looking to evaluate:

OpenAI O1
Gemini 2.0 Flash
Grok-2
Llama-3.1-405B
Pixtral-120B
Mistral-Large (123B)

If you have resources to contribute, feel free to reach out!

12.12.2024 11:30 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Llama-3.3-70B-it ๐Ÿซค -> Not as good as the 3.1-70B version on BALROG's tasks.

Claude 3.5 Haikuโœจ -> A little gem, the best of the smaller closed-source models. It even gets 1.1% progression on NetHack! ๐Ÿฐ Was it trained on NLE? ๐Ÿค”

Mistral-Nemo-it ๐Ÿ†— -> Okay for its size (12B)

12.12.2024 11:30 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

๐ŸšจBALROG leaderboard update

This week's new entries on balrogai.com are:

Llama 3.3 70B Instruct ๐Ÿซค
Claude 3.5 Haikuโœจ
Mistral-Nemo-it (12B) ๐Ÿ†—

Github: github.com/balrog-ai/BA...

12.12.2024 11:30 โ€” ๐Ÿ‘ 9    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

I'm excited to share a new paper: "Mastering Board Games by External and Internal Planning with Language Models"

storage.googleapis.com/deepmind-med...

(also soon to be up on Arxiv, once it's been processed there)

05.12.2024 07:49 โ€” ๐Ÿ‘ 76    ๐Ÿ” 13    ๐Ÿ’ฌ 4    ๐Ÿ“Œ 7
Video thumbnail

Introducing ๐ŸงžGenie 2 ๐Ÿงž - our most capable large-scale foundation world model, which can generate a diverse array of consistent worlds, playable for up to a minute. We believe Genie 2 could unlock the next wave of capabilities for embodied agents ๐Ÿง .

04.12.2024 16:01 โ€” ๐Ÿ‘ 234    ๐Ÿ” 60    ๐Ÿ’ฌ 15    ๐Ÿ“Œ 30

It's great to see BALROG featured on Jack Clark's Import AI newsletter!

Check out what he had to say about it here:
jack-clark.net

And check out BALROG's leaderboard on balrogai.com

04.12.2024 09:37 โ€” ๐Ÿ‘ 6    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Do you know what rating youโ€™ll give after reading the intro? Are your confidence scores 4 or higher? Do you not respond in rebuttal phases? Are you worried how it will look if your rating is the only 8 among 3โ€™s? This thread is for you.

27.11.2024 17:25 โ€” ๐Ÿ‘ 78    ๐Ÿ” 20    ๐Ÿ’ฌ 4    ๐Ÿ“Œ 3

Excited to announce "BALROG: a Benchmark for Agentic LLM and VLM Reasoning On Games" led b UCL DARK's @dpaglieri.bsky.social! Douwe Kiela plot below is maybe the scariest for AI progress โ€” LLM benchmarks are saturating at an accelerating rate. BALROG to the rescue. This will keep us busy for years.

22.11.2024 11:27 โ€” ๐Ÿ‘ 125    ๐Ÿ” 15    ๐Ÿ’ฌ 3    ๐Ÿ“Œ 1
Post image Post image Post image

This may sound odd, but game-based benchmarks are some of the most useful for AI, since we have human scores and they require reasoning, planning & vision

The hardest of all is Nethack. No AI is close, and I suspect that an AI that can fairly win/ascend would need to be AGI-ish. Paper: balrogai.com

23.11.2024 04:31 โ€” ๐Ÿ‘ 186    ๐Ÿ” 19    ๐Ÿ’ฌ 7    ๐Ÿ“Œ 3

Your LLM shall not pass! ๐Ÿง™โ€โ™‚๏ธ

... unless it's really good in reasoning and games!

Check out this new amazing benchmark BALROG ๐Ÿ‘พ from @dpaglieri.bsky.social and team ๐Ÿ‘‡

21.11.2024 16:47 โ€” ๐Ÿ‘ 6    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
BALROG BALROG: Benchmarking Agentic LLM/VLM Reasoning On Games

๐Ÿšจ BALROG is LIVE ๐Ÿšจ

๐Ÿ”— Website with leaderboard: balrogai.com
๐Ÿ“ฐ Paper: arxiv.org/abs/2411.13543
๐Ÿ“œ Code: github.com/balrog-ai/BA...

No more excuses about saturated or lack of Agentic LLM/VLM benchmarks. BALROG is here!

21.11.2024 16:24 โ€” ๐Ÿ‘ 6    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 1
Post image

The ultimate test? NetHack ๐Ÿฐ

This beast remains unsolved: the best model, o1-preview, achieved just 1.5% average progression. BALROG pushes boundaries, uncovering where LLMs/VLMs struggle the most. Will your model fare better? ๐Ÿค”

Theyโ€™re nowhere near capable enough yet!

21.11.2024 16:24 โ€” ๐Ÿ‘ 11    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

And the results are in!

๐Ÿค– GPT-4o leads the pack in LLM performance
๐Ÿ‘๏ธ Claude 3.5 Sonnet shines as the top VLM
๐Ÿ“ˆ LLaMA models show scaling laws from 1B to 70B, holding their own impressively!

๐Ÿง  Curious about how your model stacks up? Submit now!

21.11.2024 16:24 โ€” ๐Ÿ‘ 5    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

What makes BALROG unique?

โœ…Easy evaluation for LLM/VLM agents locally or via popular APIs
โœ…Highly parallel, efficient setup
โœ…Supports zero-shot eval & more complex strategies

Itโ€™s plug-and-play for anyone benchmarking LLMs/VLMs. ๐Ÿ› ๏ธ๐Ÿš€

21.11.2024 16:24 โ€” ๐Ÿ‘ 6    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Video thumbnail

BALROG brings together 6 challenging RL environments, including Crafter, BabaIsAI and the notoriously challenging NetHack.

BALROG is designed to give meaningful signal for both weak and strong models, making it a game-changer for the wider AI community. ๐Ÿ•น๏ธ #AIResearch

21.11.2024 16:24 โ€” ๐Ÿ‘ 8    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

Tired of saturated benchmarks? Want scope for a significant leap in capabilities?

๐Ÿ”ฅ Introducing BALROG: a Benchmark for Agentic LLM and VLM Reasoning On Games!

BALROG is a challenging benchmark for LLM agentic capabilities, designed to stay relevant for years to come.

1/๐Ÿงต

21.11.2024 16:24 โ€” ๐Ÿ‘ 96    ๐Ÿ” 20    ๐Ÿ’ฌ 4    ๐Ÿ“Œ 7

@dpaglieri is following 20 prominent accounts