Meet me at the Benchmarking workshop (sites.google.com/view/benchma...) at EurIPS on Saturday: We’ll present two works on errors in LLM-as-Judge and their impacts on benchmarking and test-time-scaling:
05.12.2025 08:57 — 👍 7 🔁 3 💬 1 📌 0@milago.bsky.social
PhD student in Machine Learning @ MPI-IS Tübingen, Tübingen AI Center, IMPRS-IS
Meet me at the Benchmarking workshop (sites.google.com/view/benchma...) at EurIPS on Saturday: We’ll present two works on errors in LLM-as-Judge and their impacts on benchmarking and test-time-scaling:
05.12.2025 08:57 — 👍 7 🔁 3 💬 1 📌 0At #NeurIPS in San Diego this week? Interested in XAI, causality, or performative prediction? Come visit our poster!
💬 Performative Validity of Recourse Explanations
📆 Wednesday, 4.30 pm, Poster Session 2
w/ Hidde Fokkema, Timo Freiesleben, Celestine Mendler-Dünner, Ulrike von Luxburg
Attending #Neurips2025? Get your personalized Scholar Inbox conference program now to easily navigate the poster sessions and find what you are looking for:
www.scholar-inbox.com/conference/n...
I'll be @neuripsconf.bsky.social presenting Strategic Hypothesis Testing (spotlight!)
tldr: Many high-stakes decisions (e.g., drug approval) rely on p-values, but people submitting evidence respond strategically even w/o p-hacking. Can we characterize this behavior & how policy shapes it?
1/n
The empirical landscape sits between the two extremes.
- Model similarity is high, yet disagreements let individuals find recourse by switching models.
- Systemic exclusion is rare, yet more likely than under strong multiplicity.
- Even in a single model, prompt variations induce multiplicity.
We evaluate 50 LLMs (various sizes & providers) across 6 tasks to assess how well each narrative fits the current LLM landscape, assuming that decision makers will increasingly rely on these models for consequential predictions.
02.12.2025 15:57 — 👍 1 🔁 0 💬 1 📌 0There are two narratives about model ecosystems that grew out of the algorithmic fairness debate:
1. Monoculture: models converge toward homogeneity.
2. Multiplicity: many models solve tasks similarly but disagree on individual predictions, creating outcome variation.
Excited to be at #Neurips2025 this week to present our paper "Monoculture or Multiplicity: Which is it?", joint work with Moritz Hardt.
📄 Paper #1000: openreview.net/pdf?id=DO5Lt...
📍 Wed, Dec 3, 2025 • 4:30 PM – 7:30 PM
Feel free to come by and reach out!
A short 🧵.