David Schlangen @davidschlangen

Bonus post advertising this other thread through the medium of "memes" which I've been told is what you have to do on social media.

20.07.2025 11:44 — 👍 1 🔁 0 💬 0 📌 0

(That animation in the first post? That's claude trying, and failing, to fully explore a maze in the MapWorld game.)

20.07.2025 11:21 — 👍 0 🔁 0 💬 0 📌 0

GitHub - clp-research/clembench: Collection of games to be run with the clemcore framework Collection of games to be run with the clemcore framework - clp-research/clembench

We'd love for other people to use it to test the interaction / agentic abilities of their models, and/or to build new fun and challenging games / interactions!
github.com/clp-research...
github.com/clp-research...
»

20.07.2025 11:21 — 👍 1 🔁 0 💬 1 📌 0

A Third Paradigm for LLM Evaluation: Dialogue Game-Based Evaluation using clembench There are currently two main paradigms for evaluating large language models (LLMs), reference-based evaluation and preference-based evaluation. The first, carried over from the evaluation of machine l...

Thanks to a recent short-term grant, we've been able to focus on code quality and ease of use for benchmarking and extensibility. (Exploring new games is a fun programming lab activity, which we've run several times by now!) Here's a writeup of the current state: arxiv.org/abs/2507.08491
»

20.07.2025 11:21 — 👍 0 🔁 0 💬 1 📌 0

Playpen: An Environment for Exploring Learning Through Conversational Interaction Interaction between learner and feedback-giver has come into focus recently for post-training of Large Language Models (LLMs), through the use of reward models that judge the appropriateness of a mode...

clembench now spans abstract (e.g., wordle) and concrete tasks (simulated household); language and l+vision; and benchmarking, learning (playpen), and user simulation (clem:todd).
arxiv.org/abs/2504.08590
arxiv.org/abs/2505.05445
»

20.07.2025 11:21 — 👍 0 🔁 0 💬 1 📌 0

It's great to see the idea of using games / interactions to evaluate LLMs gain traction, with textarena.ai and now ARC-AGI-3 being latest entrants.
This is something we've been exploring since early 2023 with clembench ( clembench.github.io ), which we've been continuously maintaining & extending. »

20.07.2025 11:21 — 👍 2 🔁 0 💬 1 📌 1

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks There is an increasing trend towards evaluating NLP models with LLMs instead of human judgments, raising questions about the validity of these evaluations, as well as their reproducibility in the case...

📄 [ACL 2025 main] LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks (doi.org/10.48550/arX...)

18.07.2025 10:19 — 👍 9 🔁 4 💬 1 📌 0

Ha, yes, I'm quite pleased as well with how that turned out. It's nothing fancy, just a nice font, colouring (obviously), fbox, and rotate.

30.05.2025 10:42 — 👍 1 🔁 0 💬 0 📌 0

The list of authors from the paper.

This was the outcome of a collaboration that started last year at an ELLIS workshop, and that has brought together many labs (and many master's and PhD students, and PIs).

Much more remains to be explored in "learning in interaction" -- maybe by you?

🤖🧠 #NLP #AI #LLM

29.05.2025 20:40 — 👍 2 🔁 0 💬 0 📌 0

Playpen: An Environment for Exploring Learning Through Conversational Interaction Interaction between learner and feedback-giver has come into focus recently for post-training of Large Language Models (LLMs), through the use of reward models that judge the appropriateness of a mode...

Oh yes, here's the link to the actual pre-print: arxiv.org/abs/2504.08590

29.05.2025 20:40 — 👍 1 🔁 1 💬 1 📌 0

GitHub - lm-playpen/playpen: All you need to get started with the LM Playpen Environment for Learning in Interaction. All you need to get started with the LM Playpen Environment for Learning in Interaction. - lm-playpen/playpen

We release the framework and the baseline training setups to foster research in the promising new direction of learning in (synthetic) interaction which we believe will provide more effective ways of post-training agentic conversational LLMs. github.com/lm-playpen/p...

29.05.2025 20:40 — 👍 3 🔁 0 💬 1 📌 0

Table 3 from the paper linked in a post below.

We find that imitation learning through SFT improves performance on unseen game instances, but does not generalise to new games and negatively impacts other skills -- while interactive learning with GRPO shows balanced improvements without loss of skills.

29.05.2025 20:40 — 👍 3 🔁 0 💬 1 📌 0

Triangulating LLM Progress through Benchmarks, Games, and Cognitive Tests We examine three evaluation paradigms: standard benchmarks (e.g., MMLU and BBH), interactive games (e.g., Signalling Games or Taboo), and cognitive tests (e.g., for working memory or theory of mind). ...

Together with the learning environment, we also define an experimental setup combining gameplay evaluation on unseen games and traditional NLP benchmarks such as MMLU following (Momente’ et al. 2025) arxiv.org/abs/2502.14359

29.05.2025 20:40 — 👍 3 🔁 0 💬 1 📌 0

Diagram showing an interaction triangle "interlocutor A -- world -- interlocutor B", except that this is mediated by GM (the "Game Master"), and that A is a learner wrapped around an LLM, and B also is a wrapper around a (non-learning) LLM.

Playpen is a training environment for post-training LLMs through learning in interaction, by self-play of "dialogue games": goal-oriented language-based activities that generate verifiable rewards.

29.05.2025 20:40 — 👍 2 🔁 0 💬 1 📌 0

Title of the paper, with a colourful "playpen" logo

🚨 New pre-print! (Well, new & much improved version in any case.) 🚨
If you're interested in LLM post-training techniques and in how to make LLMs better "language users", read this thread, introducing the "LM Playpen".

29.05.2025 20:40 — 👍 14 🔁 5 💬 3 📌 0

The University of Potsdam invites applications for 5 postdoc positions, incl. Cognitive Sciences, incl. NLP (esp. cognitive).

These are fairly independent research positions that will allow the candidate to build their own profile. Dln June 2nd.

Details: tinyurl.com/pd-potsdam-2...

#NLProc #AI 🤖🧠

21.05.2025 15:53 — 👍 2 🔁 2 💬 0 📌 0

The World Is Wooing U.S. Researchers Shunned by Trump

There's indeed suddenly a bit of flexibility in a system that's not exactly known for that.. If there's anyone (post-doc, tenure-track, or more senior) in the #NLP space currently in the US who'd like to explore possiblities in Potsdam, contact me.

🤖🧠

www.nytimes.com/2025/05/14/b...

14.05.2025 12:02 — 👍 1 🔁 0 💬 0 📌 0

"We ablated both algorithm and hyperparameter choices [...]"

When did "to ablate" take on the meaning "to systematically vary"? I've noticed this only recently, but it's seems to be super common now.

07.05.2025 21:04 — 👍 2 🔁 0 💬 1 📌 0

Titlepage of the paper linked in the post.

Update 2: New pre-print! Outcome of an ELLIS workshop last year, & more than a year of discussions and work, across labs and countries: Meet the Playpen, an environment for exploring learning in dialogic interaction.

arxiv.org/abs/2504.08590

1/2

15.04.2025 18:51 — 👍 4 🔁 1 💬 1 📌 0

A Playschool for LLMs

[Sneak preview: If you're wondering where this is going, have a secret look at lm-playschool.github.io -- and stay tuned for more info!]

3/2

15.04.2025 18:51 — 👍 0 🔁 0 💬 0 📌 0

Table 1 from that paper.

Nice baseline results as well: learning via SFT from transcripts does a bit, but only "real"(-ish) learning in interaction (GRPO) generalises. (Basically, you want to see the whole row being green in this table.)

2/2

15.04.2025 18:51 — 👍 1 🔁 0 💬 1 📌 0

Titlepage of the paper linked in the post.

Update 2: New pre-print! Outcome of an ELLIS workshop last year, & more than a year of discussions and work, across labs and countries: Meet the Playpen, an environment for exploring learning in dialogic interaction.

arxiv.org/abs/2504.08590

1/2

15.04.2025 18:51 — 👍 4 🔁 1 💬 1 📌 0

This is only a subset of the models on the leaderboard, visit the site to see all 32 models, and also the results for the multimodal version of the benchmark.

15.04.2025 18:37 — 👍 0 🔁 0 💬 0 📌 0

Screenshot of leaderboard as linked in post.

Update 1: New models added to our dialogue game-based agentic LLM leaderboard. TL;DR: GPT-4.1 as good as 4o, but much cheaper. Llama4 indeed not very good (decisively worse than 3.2 70B!). OLMo decent, but there's still a secret sauce that only closed labs have.

clembench.github.io

15.04.2025 18:35 — 👍 1 🔁 0 💬 1 📌 0

Nicola Horst, Davide Mazzaccara, Antonia Schmidt, Michael Sullivan, Filippo Moment\`e, Luca Franceschetti, Philipp Sadler, Sherzod Hakimov, Alberto Testoni, ...
Playpen: An Environment for Exploring Learning Through Conversational Interaction
https://arxiv.org/abs/2504.08590

14.04.2025 05:36 — 👍 1 🔁 2 💬 0 📌 0

Wenn die Grünen verhandeln könnten, würden am Tag vor einer Ankündigung über eine Einigung zur Schuldenbremse Söder und Dobrindt ankündigen, dass sie sich für immer aus der Bundespolitik heraushalten werden (und dass die CSU nie wieder einen Verkehrsminister stellen wird).

06.03.2025 16:05 — 👍 0 🔁 0 💬 0 📌 0

Press release by my Uni about our benchmark for LLMs as agents, which is now out in v2.0.
Check it out here: clembench.github.io

06.03.2025 11:01 — 👍 2 🔁 0 💬 0 📌 0

Not yet, but we’ll keep you posted of new developments. Thanks for your interest!

21.02.2025 09:35 — 👍 1 🔁 0 💬 0 📌 0

Cool! I just sent you an email.

20.02.2025 10:08 — 👍 0 🔁 0 💬 0 📌 0

Happy to see increasing interest in exploring social interaction as a learning environment!

Along similar lines: We’re preparing a (complementary) challenge that will focus on exploring interaction for post-training, coming with a rich interaction environment to get things started. Stay tuned!

19.02.2025 21:50 — 👍 6 🔁 1 💬 2 📌 0

David Schlangen

Latest posts by davidschlangen.bsky.social on Bluesky

@davidschlangen is following 19 prominent accounts