I’ll be giving a talk at @eth-ai-center.bsky.social tomorrow, March 10, at 11:30am on LLM benchmarking incentives.
Spoiler: today’s benchmarking incentives can produce unreliable model rankings, but we can fix that!
Meet me at the Benchmarking workshop (sites.google.com/view/benchma...) at EurIPS on Saturday: We’ll present two works on errors in LLM-as-Judge and their impacts on benchmarking and test-time-scaling:
Excited to be at #Neurips2025 this week to present our paper "Monoculture or Multiplicity: Which is it?", joint work with Moritz Hardt.
📄 Paper #1000: openreview.net/pdf?id=DO5Lt...
📍 Wed, Dec 3, 2025 • 4:30 PM – 7:30 PM
Feel free to come by and reach out!
A short 🧵.
Joint work w/ Safwan Hossain and Yiling Chen.
Paper link: arxiv.org/pdf/2508.03289
Come and drop by our poster session for more details!
4/4
Key takeaway: A statistical decision rule doesn't just make decisions; it determines who shows up to be evaluated.
3/n
We build a game-theoretic model of this interaction and show that there exists a computable critical p-value threshold α that cleanly structures false positives & negative errors. Empirically, it aligns with FDA data!
2/n
I'll be @neuripsconf.bsky.social presenting Strategic Hypothesis Testing (spotlight!)
tldr: Many high-stakes decisions (e.g., drug approval) rely on p-values, but people submitting evidence respond strategically even w/o p-hacking. Can we characterize this behavior & how policy shapes it?
1/n
We have an amazing line of keynote speakers: @iaugenstein.bsky.social, José Hernández-Orallo, @gaelvaroquaux.bsky.social, Laura Weidinger, and Emine Yilmaz.
Submit your work and join us in Copenhagen 🇩🇰!
We (w/ Moritz Hardt, Olawale Salaudeen and
@joavanschoren.bsky.social) are organizing the Workshop on the Science of Benchmarking & Evaluating AI @euripsconf.bsky.social 2025 in Copenhagen!
📢 Call for Posters: rb.gy/kyid4f
📅 Deadline: Oct 10, 2025 (AoE)
🔗 More info: rebrand.ly/bg931sf