Florian Dorner @flodorner - Bluesky Profile

Also, from time to time, the wrong proofs it suggests for more complicated things seem to contain non-trivial insights and are "fixable".

25.10.2025 15:41 — 👍 1 🔁 0 💬 0 📌 0

Not much of a step up compared to the o1/o3 "thinking" versions of GPT-4. But quite a big step compared to base GPT-4. It still makes a lot of mistakes, but often produces correct proofs for simple Lemmata (not so much for more complicated stuff).

25.10.2025 15:38 — 👍 1 🔁 1 💬 1 📌 0

Vivian Nastl and Ricardo Dominguez-Olmedo receive 2025 Google Ph.D. Fellowship Program supports exceptional graduate students working on innovative research in computer science and related fields

Congratulations also to Vivian Nastl (supervised by Moritz Hardt) and Ricardo Dominguez-Olmedo (Moritz Hardt and Bernhard Schölkopf) for winning 2025 Global Google PhD fellowships.
Find out more about their work here: is.mpg.de/en/news/vivi...

@maxplanckcampus.bsky.social @unituebingen.bsky.social

24.10.2025 09:33 — 👍 4 🔁 2 💬 0 📌 0

The viral "Definition of AGI" paper tells you to read fake references which do not exist!

Proof: different articles present at the specified journal/volume/page number, and their titles exist nowhere on any searchable repository.

Take this as a warning to not use LMs to generate your references!

18.10.2025 00:54 — 👍 158 🔁 36 💬 6 📌 16

Assuming all problems are actually solvable...

17.10.2025 21:58 — 👍 0 🔁 0 💬 0 📌 0

Is that not trivially true, since LLMs assign nonzero probability to any possible string?

17.10.2025 21:58 — 👍 0 🔁 0 💬 1 📌 0

We (w/ Moritz Hardt, Olawale Salaudeen and
@joavanschoren.bsky.social) are organizing the Workshop on the Science of Benchmarking & Evaluating AI @euripsconf.bsky.social 2025 in Copenhagen!

📢 Call for Posters: rb.gy/kyid4f
📅 Deadline: Oct 10, 2025 (AoE)
🔗 More info: rebrand.ly/bg931sf

22.09.2025 13:45 — 👍 21 🔁 7 💬 1 📌 0

Do you have a list of the best ones? I vaguely recall reading things in this direction, but cannot really remember specific titles.

21.09.2025 20:11 — 👍 1 🔁 0 💬 0 📌 0

Wouldn’t it be great to have questions about LM internals answered in plain English? That’s the promise of verbalization interpretability. Unfortunately, our new paper shows that evaluating these methods is nuanced—and verbalizers might not tell us what we hope they do. 🧵👇1/8

17.09.2025 19:19 — 👍 19 🔁 5 💬 1 📌 0

The focus on evaluating checkpoints during a training run rather than different trained models is super interesting!

17.09.2025 05:16 — 👍 1 🔁 0 💬 1 📌 0

How Benchmark Prediction from Fewer Data Misses the Mark Large language model (LLM) evaluation is increasingly costly, prompting interest in methods that speed up evaluation by shrinking benchmark datasets. Benchmark prediction (also called efficient LLM ev...

Interesting work! Can you comment a bit on what you do different compared to previous IRT-based LLM evaluation methods?

We recently did some work confirming IRTs efficacy for in-distribution models, but also found it to be quite brittle when it comes to novel models arxiv.org/abs/2506.07673

17.09.2025 05:11 — 👍 1 🔁 0 💬 2 📌 0

I guess in terms of the notation from section 4 in the paper, does this plot Type X risk, or Type X Error Feasibility rate?

14.09.2025 14:52 — 👍 0 🔁 0 💬 0 📌 0

, at least for large n. So I am trying to understand whether the asymptotics kick in a lot slower than I would have thought, or whether I am missing something else about the setup., at least for large n.

14.09.2025 14:44 — 👍 0 🔁 0 💬 0 📌 0

Thank you! Do I understand correctly that these results are independent/orthogonal from the success hacking ones? I guess my confusion stems from asymptotic theory for PPI (and by extension seemingly for DSL) suggesting that both type 1 and type 2 errors should be lower/at most very similar

14.09.2025 14:44 — 👍 0 🔁 0 💬 1 📌 0

Are the reported errors for the case of selecting the model with the most significant results, post-hoc?

12.09.2025 19:18 — 👍 0 🔁 0 💬 1 📌 0

Interesting work! Can you comment a bit more on the setup for the regression correction methods? As far as I understand, PPI++ (which should be quite similar to DSL) relatively reliably reduces variance compared to ground truth only, while remaining quite close to unbiased.

12.09.2025 19:18 — 👍 0 🔁 0 💬 2 📌 0

Does anyone have background on this plot, compared to the 32% performance for o3-mini-high with tool use claimed by OpenAI in January? #GPT5 #GPT-5

openai.com/index/introd...
openai.com/index/openai...

08.08.2025 09:28 — 👍 1 🔁 0 💬 0 📌 0

Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data High quality annotations are increasingly a bottleneck in the explosively growing machine learning ecosystem. Scalable evaluation methods that avoid costly annotation have therefore become an importan...

Super interesting field, but worth keeping in mind that this usually only buys you a relatively small fraction of "extra ground truth labels" (this does not cover active sampling strategies, but I haven not seen them yielding much larger improvements in practice, either) arxiv.org/abs/2410.13341

23.07.2025 13:28 — 👍 2 🔁 0 💬 0 📌 0

Do you have a source re: attendance requirement? 👀

17.07.2025 17:28 — 👍 0 🔁 0 💬 1 📌 0

Not sure this can ethically be done retroactively (due to participant consent). But given that 20% of data is shared with model providers, privacy concerns with instead sharing this data publically in the future seem surmountable.

10.05.2025 08:59 — 👍 0 🔁 0 💬 0 📌 0

How to Fix the Chatbot Arena? Release All Data

New blogpost by my colleague Ricardo, arguing that instead of limiting data collection from big labs, LMArena should publicly release all data for everyone. ricardodominguez.github.io/blogs/arena....

10.05.2025 08:59 — 👍 1 🔁 0 💬 1 📌 0

Is this just the prompts, or do model providers get information about whether or not they won (and the competing response)?

30.04.2025 14:55 — 👍 0 🔁 0 💬 1 📌 0

Shout out to my colleagues Ricardo Dominguez-Olmedo, Vivian Nastl and Moritz Hardt! If you’d like to chat at the conference, send me a message, or visit us at one of the poster sessions!

24.04.2025 01:36 — 👍 0 🔁 0 💬 0 📌 0

24.04.2025 01:36 — 👍 0 🔁 0 💬 1 📌 0

Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data High quality annotations are increasingly a bottleneck in the explosively growing machine learning ecosystem. Scalable evaluation methods that avoid costly annotation have therefore become an importan...

Tomorrow, I will speak about our work on the limitations of LLM-as-a-Judge 🤖 when applied to evaluating frontier models. (Session 3D)
arxiv.org/abs/2410.13341

24.04.2025 01:36 — 👍 1 🔁 0 💬 1 📌 0

In two hours, Ricardo is giving a talk about our paper on training on the test task, and its confounding impacts on LLM benchmarking 📉📈. (Session 1B) arxiv.org/abs/2407.07890

24.04.2025 01:36 — 👍 1 🔁 0 💬 1 📌 0

In Singapore for #ICLR2025 and excited for two oral presentations on work I have contributed to! 🎉

24.04.2025 01:36 — 👍 0 🔁 0 💬 1 📌 0

Wouldn't the ratio of the tax burden for the average income compared to the 5x income say more about how progressive taxation is? With that, Germany seems to be the "least progressive" according to the graphic (which honestly seems a bit surprising).

30.03.2025 14:49 — 👍 1 🔁 0 💬 1 📌 0

Seems worth keeping in mind, that while uncertainty can improve LLM as judge, reliable results still require debiasing using a sample ground truth data.

07.03.2025 08:32 — 👍 0 🔁 0 💬 0 📌 0

We had some (very preliminary and specific) results on this in our ICLR paper (arxiv.org/abs/2410.13341), glad to see this investigated in more detail!

07.03.2025 08:32 — 👍 0 🔁 0 💬 1 📌 0

Florian Dorner

Latest posts by flodorner.bsky.social on Bluesky

@flodorner is following 20 prominent accounts