In light of the discussions about LLM-generated ICLR reviews, I recently wondered whether a similar dynamic might play out for LLMs: While pre-training objectives promote approximate indistinguishability of generated text, more and more heavy post-training might make detection a lot easier...
12.12.2025 18:03 β π 1 π 0 π¬ 0 π 0
Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data
High quality annotations are increasingly a bottleneck in the explosively growing machine learning ecosystem. Scalable evaluation methods that avoid costly annotation have therefore become an importan...
In the second paper (arxiv.org/abs/2410.13341), we show that LLM judges weaker than the models they evaluate are of limited use for benchmarking, even if their judgments are processed in a statistically optimal way. Correspondingly, we cannot rely on LLM judges for evaluating frontier models.
05.12.2025 08:57 β π 3 π 0 π¬ 1 π 0
ROC-n-reroll: How verifier imperfection affects test-time scaling
Test-time scaling aims to improve language model performance by leveraging additional compute during inference. Many works have empirically studied techniques such as Best-of-N (BoN) and Rejection Sam...
In the first paper (arxiv.org/abs/2507.12399), we characterize how LLM judge errors affect test-time-scaling via Best-of-N based on the verifier ROC curve. Our results point towards more efficient alternatives to Best-of-N, and explain why scaling laws for test-time-scaling are unreliable.
05.12.2025 08:57 β π 2 π 0 π¬ 1 π 0
Meet me at the Benchmarking workshop (sites.google.com/view/benchma...) at EurIPS on Saturday: Weβll present two works on errors in LLM-as-Judge and their impacts on benchmarking and test-time-scaling:
05.12.2025 08:57 β π 7 π 3 π¬ 1 π 0
I'll be @neuripsconf.bsky.social presenting Strategic Hypothesis Testing (spotlight!)
tldr: Many high-stakes decisions (e.g., drug approval) rely on p-values, but people submitting evidence respond strategically even w/o p-hacking. Can we characterize this behavior & how policy shapes it?
1/n
01.12.2025 20:31 β π 17 π 4 π¬ 1 π 0
Also, from time to time, the wrong proofs it suggests for more complicated things seem to contain non-trivial insights and are "fixable".
25.10.2025 15:41 β π 1 π 0 π¬ 0 π 0
Not much of a step up compared to the o1/o3 "thinking" versions of GPT-4. But quite a big step compared to base GPT-4. It still makes a lot of mistakes, but often produces correct proofs for simple Lemmata (not so much for more complicated stuff).
25.10.2025 15:38 β π 1 π 1 π¬ 1 π 0
Vivian Nastl and Ricardo Dominguez-Olmedo receive 2025 Google Ph.D. Fellowship
Program supports exceptional graduate students working on innovative research in computer science and related fields
Congratulations also to Vivian Nastl (supervised by Moritz Hardt) and Ricardo Dominguez-Olmedo (Moritz Hardt and Bernhard SchΓΆlkopf) for winning 2025 Global Google PhD fellowships.
Find out more about their work here: is.mpg.de/en/news/vivi...
@maxplanckcampus.bsky.social @unituebingen.bsky.social
24.10.2025 09:33 β π 5 π 2 π¬ 0 π 0
Assuming all problems are actually solvable...
17.10.2025 21:58 β π 0 π 0 π¬ 0 π 0
Is that not trivially true, since LLMs assign nonzero probability to any possible string?
17.10.2025 21:58 β π 0 π 0 π¬ 1 π 0
We (w/ Moritz Hardt, Olawale Salaudeen and
@joavanschoren.bsky.social) are organizing the Workshop on the Science of Benchmarking & Evaluating AI @euripsconf.bsky.social 2025 in Copenhagen!
π’ Call for Posters: rb.gy/kyid4f
π
Deadline: Oct 10, 2025 (AoE)
π More info: rebrand.ly/bg931sf
22.09.2025 13:45 β π 21 π 7 π¬ 1 π 0
Do you have a list of the best ones? I vaguely recall reading things in this direction, but cannot really remember specific titles.
21.09.2025 20:11 β π 1 π 0 π¬ 0 π 0
Wouldnβt it be great to have questions about LM internals answered in plain English? Thatβs the promise of verbalization interpretability. Unfortunately, our new paper shows that evaluating these methods is nuancedβand verbalizers might not tell us what we hope they do. π§΅π1/8
17.09.2025 19:19 β π 26 π 8 π¬ 1 π 1
The focus on evaluating checkpoints during a training run rather than different trained models is super interesting!
17.09.2025 05:16 β π 1 π 0 π¬ 1 π 0
How Benchmark Prediction from Fewer Data Misses the Mark
Large language model (LLM) evaluation is increasingly costly, prompting interest in methods that speed up evaluation by shrinking benchmark datasets. Benchmark prediction (also called efficient LLM ev...
Interesting work! Can you comment a bit on what you do different compared to previous IRT-based LLM evaluation methods?
We recently did some work confirming IRTs efficacy for in-distribution models, but also found it to be quite brittle when it comes to novel models arxiv.org/abs/2506.07673
17.09.2025 05:11 β π 1 π 0 π¬ 2 π 0
I guess in terms of the notation from section 4 in the paper, does this plot Type X risk, or Type X Error Feasibility rate?
14.09.2025 14:52 β π 0 π 0 π¬ 0 π 0
, at least for large n. So I am trying to understand whether the asymptotics kick in a lot slower than I would have thought, or whether I am missing something else about the setup., at least for large n.
14.09.2025 14:44 β π 0 π 0 π¬ 0 π 0
Thank you! Do I understand correctly that these results are independent/orthogonal from the success hacking ones? I guess my confusion stems from asymptotic theory for PPI (and by extension seemingly for DSL) suggesting that both type 1 and type 2 errors should be lower/at most very similar
14.09.2025 14:44 β π 0 π 0 π¬ 1 π 0
Are the reported errors for the case of selecting the model with the most significant results, post-hoc?
12.09.2025 19:18 β π 0 π 0 π¬ 1 π 0
Interesting work! Can you comment a bit more on the setup for the regression correction methods? As far as I understand, PPI++ (which should be quite similar to DSL) relatively reliably reduces variance compared to ground truth only, while remaining quite close to unbiased.
12.09.2025 19:18 β π 0 π 0 π¬ 2 π 0
Does anyone have background on this plot, compared to the 32% performance for o3-mini-high with tool use claimed by OpenAI in January? #GPT5 #GPT-5
openai.com/index/introd...
openai.com/index/openai...
08.08.2025 09:28 β π 1 π 0 π¬ 0 π 0
Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data
High quality annotations are increasingly a bottleneck in the explosively growing machine learning ecosystem. Scalable evaluation methods that avoid costly annotation have therefore become an importan...
Super interesting field, but worth keeping in mind that this usually only buys you a relatively small fraction of "extra ground truth labels" (this does not cover active sampling strategies, but I haven not seen them yielding much larger improvements in practice, either) arxiv.org/abs/2410.13341
23.07.2025 13:28 β π 2 π 0 π¬ 0 π 0
Do you have a source re: attendance requirement? π
17.07.2025 17:28 β π 0 π 0 π¬ 1 π 0
Not sure this can ethically be done retroactively (due to participant consent). But given that 20% of data is shared with model providers, privacy concerns with instead sharing this data publically in the future seem surmountable.
10.05.2025 08:59 β π 0 π 0 π¬ 0 π 0
How to Fix the Chatbot Arena? Release All Data
New blogpost by my colleague Ricardo, arguing that instead of limiting data collection from big labs, LMArena should publicly release all data for everyone. ricardodominguez.github.io/blogs/arena....
10.05.2025 08:59 β π 1 π 0 π¬ 1 π 0
Is this just the prompts, or do model providers get information about whether or not they won (and the competing response)?
30.04.2025 14:55 β π 0 π 0 π¬ 1 π 0
Shout out to my colleagues Ricardo Dominguez-Olmedo, Vivian Nastl and Moritz Hardt! If youβd like to chat at the conference, send me a message, or visit us at one of the poster sessions!
24.04.2025 01:36 β π 0 π 0 π¬ 0 π 0
24.04.2025 01:36 β π 0 π 0 π¬ 1 π 0
Driven by industry progress, inspired by provocative leadership, plus don't mind a good pair of shoes or a great @PennStateFball scoreboard either.
Computer Science PhD student & Knight-Hennessy scholar at @stanford.edu.
Prev.: @ox.ac.uk with @rhodeshouse.ox.ac.uk, @harvard.edu '23, @maxplanck.de, @ethz.ch, IBM Research.
Theory CS for Trustworthy AI
https://silviacasacuberta.com
PhD Candidate in Machine Learning at the Max Planck Institute for Intelligent Systems
Assistant professor at University of Pennsylvania. Machine learning, optimization, robustness & interpretability.
Home page: https://www.cis.upenn.edu/~exwong/
Lab page: https://brachiolab.github.io/
Research blog: https://debugml.github.io/
We are a researcher community developing scientifically grounded research outputs and robust deployment infrastructure for broader impact evaluations.
https://evalevalai.com/
Technical AI Policy Researcher at HuggingFace @hf.co π€. Current focus: Responsible AI, AI for Science, and @eval-eval.bsky.socialβ¬!
Senior Researcher at Oxford University.
Author β The Precipice: Existential Risk and the Future of Humanity.
tobyord.com
Journalist. Redaktionsleitung @zeit.de β Abbildung zeigt Sonderausstattung zum Mehrpreis. he/him
πΈ Marzena Skubatz
Reverse engineering neural networks at Anthropic. Previously Distill, OpenAI, Google Brain.Personal account.
Anti-cynic. Towards a weirder future. Reinforcement Learning, Autonomous Vehicles, transportation systems, the works. Asst. Prof at NYU
https://emerge-lab.github.io
https://www.admonymous.co/eugenevinitsky
Professor and Head of Machine Learning Department at Carnegie Mellon. Board member OpenAI. Chief Technical Advisor Gray Swan AI. Chief Expert Bosch Research.
Security and Privacy of Machine Learning at UofT, Vector Institute, and Google π¨π¦π«π·πͺπΊ Co-Director of Canadian AI Safety Institute (CAISI) Research Program at CIFAR. Opinions mine
PhD student in Machine Learning @ MPI-IS TΓΌbingen, TΓΌbingen AI Center, IMPRS-IS
AI professor. Director, Foundations of Cooperative AI Lab at Carnegie Mellon. Head of Technical AI Engagement, Institute for Ethics in AI (Oxford). Author, "Moral AI - And How We Get There."
https://www.cs.cmu.edu/~conitzer/
I study algorithms/learning/data applied to democracy/markets/society. Asst. professor at Cornell Tech. https://gargnikhil.com/. Helping building personalized Bluesky research feed: https://bsky.app/profile/paper-feed.bsky.social/feed/preprintdigest
Princeton computer science prof. I write about the societal impact of AI, tech ethics, & social media platforms. https://www.cs.princeton.edu/~arvindn/
BOOK: AI Snake Oil. https://www.aisnakeoil.com/
EurIPS is a community-organized, NeurIPS-endorsed conference in Copenhagen where you can present papers accepted at @neuripsconf.bsky.social
eurips.cc
π PhD student at the Max Planck Institute for Intelligent Systems
π¬ Safe and robust AI, algorithms and society
π https://andrefcruz.github.io
π researcher in π©πͺ, from π΅πΉ