๐ BALROG is open submission! We welcome submission of new foundation models and new agentic pipelines.
Check it out here:
github.com/balrog-ai/BA...
@dpaglieri.bsky.social
PhD Student at UCL. Previously AI Research Engineer at Bending Spoons
๐ BALROG is open submission! We welcome submission of new foundation models and new agentic pipelines.
Check it out here:
github.com/balrog-ai/BA...
This suggests that high performance on popular static benchmarks does not necessarily translate to dynamic agentic tasks, and training data contamination may also play a role.
๐BALROG introduces a new type of agentic benchmark designed to be robust to train data contamination.
๐จThis week's new entry on balrogai.com is Microsoft Phi-4 (14B model)
While Phi-4 excels on benchmarks like math competitions, BALROG reveals that Phi-4 falls short as an agent. More research on how to improve agentic performance is needed.
Interested in submitting to BALROG? Check out the instructions here!
balrogai.com/submit.html
Some big models we are looking to evaluate:
OpenAI O1
Gemini 2.0 Flash
Grok-2
Llama-3.1-405B
Pixtral-120B
Mistral-Large (123B)
If you have resources to contribute, feel free to reach out!
Llama-3.3-70B-it ๐ซค -> Not as good as the 3.1-70B version on BALROG's tasks.
Claude 3.5 Haikuโจ -> A little gem, the best of the smaller closed-source models. It even gets 1.1% progression on NetHack! ๐ฐ Was it trained on NLE? ๐ค
Mistral-Nemo-it ๐ -> Okay for its size (12B)
๐จBALROG leaderboard update
This week's new entries on balrogai.com are:
Llama 3.3 70B Instruct ๐ซค
Claude 3.5 Haikuโจ
Mistral-Nemo-it (12B) ๐
Github: github.com/balrog-ai/BA...
I'm excited to share a new paper: "Mastering Board Games by External and Internal Planning with Language Models"
storage.googleapis.com/deepmind-med...
(also soon to be up on Arxiv, once it's been processed there)
Introducing ๐งGenie 2 ๐ง - our most capable large-scale foundation world model, which can generate a diverse array of consistent worlds, playable for up to a minute. We believe Genie 2 could unlock the next wave of capabilities for embodied agents ๐ง .
04.12.2024 16:01 โ ๐ 234 ๐ 60 ๐ฌ 15 ๐ 30It's great to see BALROG featured on Jack Clark's Import AI newsletter!
Check out what he had to say about it here:
jack-clark.net
And check out BALROG's leaderboard on balrogai.com
Do you know what rating youโll give after reading the intro? Are your confidence scores 4 or higher? Do you not respond in rebuttal phases? Are you worried how it will look if your rating is the only 8 among 3โs? This thread is for you.
27.11.2024 17:25 โ ๐ 78 ๐ 20 ๐ฌ 4 ๐ 3Excited to announce "BALROG: a Benchmark for Agentic LLM and VLM Reasoning On Games" led b UCL DARK's @dpaglieri.bsky.social! Douwe Kiela plot below is maybe the scariest for AI progress โ LLM benchmarks are saturating at an accelerating rate. BALROG to the rescue. This will keep us busy for years.
22.11.2024 11:27 โ ๐ 125 ๐ 15 ๐ฌ 3 ๐ 1This may sound odd, but game-based benchmarks are some of the most useful for AI, since we have human scores and they require reasoning, planning & vision
The hardest of all is Nethack. No AI is close, and I suspect that an AI that can fairly win/ascend would need to be AGI-ish. Paper: balrogai.com
Your LLM shall not pass! ๐งโโ๏ธ
... unless it's really good in reasoning and games!
Check out this new amazing benchmark BALROG ๐พ from @dpaglieri.bsky.social and team ๐
๐จ BALROG is LIVE ๐จ
๐ Website with leaderboard: balrogai.com
๐ฐ Paper: arxiv.org/abs/2411.13543
๐ Code: github.com/balrog-ai/BA...
No more excuses about saturated or lack of Agentic LLM/VLM benchmarks. BALROG is here!
The ultimate test? NetHack ๐ฐ
This beast remains unsolved: the best model, o1-preview, achieved just 1.5% average progression. BALROG pushes boundaries, uncovering where LLMs/VLMs struggle the most. Will your model fare better? ๐ค
Theyโre nowhere near capable enough yet!
And the results are in!
๐ค GPT-4o leads the pack in LLM performance
๐๏ธ Claude 3.5 Sonnet shines as the top VLM
๐ LLaMA models show scaling laws from 1B to 70B, holding their own impressively!
๐ง Curious about how your model stacks up? Submit now!
What makes BALROG unique?
โ
Easy evaluation for LLM/VLM agents locally or via popular APIs
โ
Highly parallel, efficient setup
โ
Supports zero-shot eval & more complex strategies
Itโs plug-and-play for anyone benchmarking LLMs/VLMs. ๐ ๏ธ๐
BALROG brings together 6 challenging RL environments, including Crafter, BabaIsAI and the notoriously challenging NetHack.
BALROG is designed to give meaningful signal for both weak and strong models, making it a game-changer for the wider AI community. ๐น๏ธ #AIResearch
Tired of saturated benchmarks? Want scope for a significant leap in capabilities?
๐ฅ Introducing BALROG: a Benchmark for Agentic LLM and VLM Reasoning On Games!
BALROG is a challenging benchmark for LLM agentic capabilities, designed to stay relevant for years to come.
1/๐งต