Our new paper in #PNAS (bit.ly/4fcWfma) presents a surprising findingβwhen words change meaning, older speakers rapidly adopt the new usage; inter-generational differences are often minor.
w/ Michelle Yang, βͺ@sivareddyg.bsky.socialβ¬ , @msonderegger.bsky.socialβ¬ and @dallascard.bsky.socialβ¬π(1/12)
29.07.2025 12:05 β π 31 π 16 π¬ 3 π 2
πΎ Full-time research assistant position (1 year) with @sebschu.bsky.social and me! πΎ
We're looking for someone to join the research agent evaluation team, starting Fall 2025. Application link to be available soon, but feel free to send us your CV and/or come talk to us at #ACL2025. π§΅
25.07.2025 17:08 β π 10 π 4 π¬ 1 π 0
This work was done by our amazing team: @nedwards99.bsky.social, @yukyunglee.bsky.social, Yujun (Audrey) Mao, and Yulu Qin. And as always, it was super fun co-directing this with @najoung.bsky.social. We also thank Max Nadeau and Ajeya Cotra for initial advice and support.
02.07.2025 15:39 β π 1 π 0 π¬ 1 π 0
RExBench: A benchmark of machine learning research extensions for evaluating coding agents
Think your agent can do better? Check out the paper, download the data, and submit your agent to our leaderboard:
πWebsite: rexbench.com
πPaper: arxiv.org/abs/2506.22598
02.07.2025 15:39 β π 1 π 0 π¬ 1 π 0
We note that the current set of RexBench tasks is NOT extremely challenging for a PhD student-level domain expert. We hope to release a more challenging set of tasks in the near future, and would be excited about community contributions, so please reach out if you are interested! π«΅
02.07.2025 15:39 β π 1 π 0 π¬ 1 π 0
What makes an extension difficult for agents?
Statistically, tasks with more lines of change in the gold solution were harder. Meanwhile, repo size and popularity had marginal effects. Qualitatively, the performance aligned poorly with human-expert perceived difficulty!
02.07.2025 15:39 β π 3 π 0 π¬ 1 π 0
Figure showing the comparison of results, depending on level of hints for each agent.
What if we give them hints?
We provided two levels of human-written hints. L1: information localization (e.g., files to edit) & L2: step-by-step guidance. With hints, the best agentβs performance improves to 39%, showing that substantial human guidance is still needed.
02.07.2025 15:39 β π 1 π 0 π¬ 1 π 0
Result plot showing final success rate, execution success rate, and file recall for each of the agents. Final success rate was still only around 25% for the best agent.
Results! All agents we tested struggled on RExBench.
The best-performing agents (OpenHands + Claude 3.7 Sonnet and Claude Code) only had a 25% average success rate across 3 runs. But we were still impressed that the top agents achieved end-to-end success on several tasks!Res
02.07.2025 15:39 β π 2 π 0 π¬ 1 π 0
The execution outcomes are evaluated against expert implementations of the extensions. This process is fully conducted inside our privately-hosted VM-based eval infra. This eval design and the target being novel extensions make RexBench highly resistant to data contamination.
02.07.2025 15:39 β π 1 π 0 π¬ 1 π 0
We created 12 realistic extensions of existing AI research and tested 9 agents built upon aider, Claude Code (βͺ@anthropic.comβ¬) and OpenHands.
The agents get papers, code, & extension hypotheses as inputs and produce code edits. The edited code is then executed.
02.07.2025 15:39 β π 1 π 0 π¬ 1 π 0
Why do we focus on extensions?
New research builds on prior work, so understanding existing research & building upon it is a key capacity for autonomous research agents. Many research coding benchmarks focus on replication, but we wanted to target *novel* research extensions.
02.07.2025 15:39 β π 1 π 0 π¬ 1 π 0
Screenshot of the RExBench preprint title page.
Can coding agents autonomously implement AI research extensions?
We introduce RExBench, a benchmark that tests if a coding agent can implement a novel experiment based on existing research and code.
Finding: Most agents we tested had a low success rate, but there is promise!
02.07.2025 15:39 β π 12 π 4 π¬ 1 π 2
π¨βπ³ Web Agents @mila-quebec.bsky.social
π @mcgill-nlp.bsky.social
PhD-ing at McGill Linguistics + Mila, working under Prof. Siva Reddy. Mostly computational linguistics, with some NLP; habitually disappointed Arsenal fan
Assistant Professor at @cs.ubc.caβ¬ and βͺ@vectorinstitute.aiβ¬ working on Natural Language Processing. Book: https://lostinautomatictranslation.com/
Associate Professor and Computational Linguist @ University of Augsburg, Germany
Professor for Natural Language Processing (@utn_nuremberg), CoNLL co-chair 2025, organizer of LSDSem & UnImplicit workshops, expert in misunderstandings.
Linguist, cognitive scientist at University of Stuttgart. I study language and how we understand it one word at a time.
#codiworkshop, #NLP, #discourse
Computational psycholinguistics PhD student @NYU lingusitics | first gen!
Ph.D. Candidate at MIT | Brain and Cognitive Sciences
ckauf.com
Multimodal Communication and Learning in Social Interactions (CoCoDev team). Associate Professor of Computer/Cognitive Science at Aix-Marseille University.
afourtassi.github.io
Associate Professor at GroNLP ( @gronlp.bsky.social⬠) #NLP | Multilingualism | Interpretability | Language Learning in Humans vs NeuralNets | Mum^2
Head of the InClow research group: https://inclow-lm.github.io/
Asst Prof. @ UCSD | PI of LeMπN Lab | Former Postdoc at ETH ZΓΌrich, PhD @ NYU | computational linguistics, NLProc, CogSci, pragmatics | he/him π³οΈβπ
alexwarstadt.github.io
Cognitive Scientist at Max Planck, Professor of Psychology
https://www.falkhuettig.com/
Author of 'Looking Ahead: The New Science of the Predictive Mind' published by Cambridge University Press on 6 March 2025.
Co-Founder and Chief Scientist | Venture Partner | Professor | Mentor | Proud mom of 3 β¨
PhD student @mainlp.bsky.social (@cislmu.bsky.social, LMU Munich). Interested in language variation & change, currently working on NLP for dialects and low-resource languages.
verenablaschke.github.io
Researcher in Neuroscience & AI
CNRS, Ecole Normale SupΓ©rieure, PSL
currently detached to Meta
PhD Student at the ILLC / UvA doing work at the intersection of (mechanistic) interpretability and cognitive science. Current Anthropic Fellow.
hannamw.github.io