Yatong Chen (@yatongchen) — Bluesky Profile

4 days ago

I’ll be giving a talk at @eth-ai-center.bsky.social tomorrow, March 10, at 11:30am on LLM benchmarking incentives.
Spoiler: today’s benchmarking incentives can produce unreliable model rankings, but we can fix that!

2 0 0 0

3 months ago

Meet me at the Benchmarking workshop (sites.google.com/view/benchma...) at EurIPS on Saturday: We’ll present two works on errors in LLM-as-Judge and their impacts on benchmarking and test-time-scaling:

7 3 1 0

3 months ago

Excited to be at #Neurips2025 this week to present our paper "Monoculture or Multiplicity: Which is it?", joint work with Moritz Hardt.

📄 Paper #1000: openreview.net/pdf?id=DO5Lt...
📍 Wed, Dec 3, 2025 • 4:30 PM – 7:30 PM

Feel free to come by and reach out!

A short 🧵.

16 4 1 0

3 months ago

Joint work w/ Safwan Hossain and Yiling Chen.

Paper link: arxiv.org/pdf/2508.03289

Come and drop by our poster session for more details!

4/4

2 0 0 0

3 months ago

Key takeaway: A statistical decision rule doesn't just make decisions; it determines who shows up to be evaluated.

3/n

1 0 1 0

3 months ago

We build a game-theoretic model of this interaction and show that there exists a computable critical p-value threshold α that cleanly structures false positives & negative errors. Empirically, it aligns with FDA data!

2/n

2 0 1 0

3 months ago

I'll be @neuripsconf.bsky.social presenting Strategic Hypothesis Testing (spotlight!)

tldr: Many high-stakes decisions (e.g., drug approval) rely on p-values, but people submitting evidence respond strategically even w/o p-hacking. Can we characterize this behavior & how policy shapes it?

1/n

17 4 1 0

5 months ago

We have an amazing line of keynote speakers: @iaugenstein.bsky.social, José Hernández-Orallo, @gaelvaroquaux.bsky.social, Laura Weidinger, and Emine Yilmaz.

Submit your work and join us in Copenhagen 🇩🇰!

2 0 0 0

5 months ago

We (w/ Moritz Hardt, Olawale Salaudeen and
@joavanschoren.bsky.social) are organizing the Workshop on the Science of Benchmarking & Evaluating AI @euripsconf.bsky.social 2025 in Copenhagen!

📢 Call for Posters: rb.gy/kyid4f
📅 Deadline: Oct 10, 2025 (AoE)
🔗 More info: rebrand.ly/bg931sf

21 7 1 0