EvalEval Coalition @eval-eval

Thank you to everyone who attended, presented at, spoke at, or helped organize this workshop. You rock! Special thanks to the UK AI Security Institute for cohosting and their support.

10.12.2025 22:59 — 👍 0 🔁 0 💬 0 📌 0

It's a wrap on EvalEval in San Diego! A jam packed day of learning, making new friends, critically examining the field of evals, and walking away with renewed energy and new collaborations!

We have a lot of announcements coming, but first: EvalEval will be back for #ACL2026!

10.12.2025 22:55 — 👍 5 🔁 1 💬 1 📌 0

📜Paper: arxiv.org/pdf/2511.056...
📝Blog: tinyurl.com/blogAI1

🤝At EvalEval, we are a coalition of researchers working towards better AI evals. Interested in joining us? Check out: evalevalai.com 7/7 🧵

13.11.2025 13:59 — 👍 0 🔁 0 💬 0 📌 0

Continued..

📉 Reporting on social impact dimensions has steadily declined, both in frequency and detail, across major providers
🧑‍💻 Sensitive content gets the most attention, as it’s easier to define and measure

🛡️Solution? Standardized reporting & safety policies (6/7)

13.11.2025 13:59 — 👍 0 🔁 0 💬 1 📌 0

Key Takeaways:

⛔️ First-party reporting is often sparse & superficial, with many reporting NO social impact evals
📉 On average, first-party scores are far lower than third-party evals (0.72 vs 2.62/3)
🎯 Third parties provide some complementary coverage (GPT-4 and LLaMA) (5/7)

13.11.2025 13:59 — 👍 1 🔁 0 💬 1 📌 0

💡 We also interviewed developers from for-profit and non-profit orgs to understand why some disclosures happen and why others don’t.

💬 TLDR: Incentives and constraints shape reporting (4/7)

13.11.2025 13:59 — 👍 0 🔁 0 💬 1 📌 0

📊 What we did:

🔎 Analyzed 186 first-party release reports from model developers & 183 post-release evaluations (third-party)
📏 Scored 7 social impact dimensions: bias, harmful content, performance disparities, environmental costs, privacy, financial costs, & labor (3/7)

13.11.2025 13:59 — 👍 0 🔁 0 💬 1 📌 0

While general capability evaluations are common, social impact assessments, covering bias, fairness, and privacy, etc., are often fragmented or missing. 🧠

🎯Our goal: Explore the AI Eval landscape to answer who evaluates what and identify gaps in social impact evals!! (2/7)

13.11.2025 13:59 — 👍 1 🔁 0 💬 1 📌 0

🚨 AI keeps scaling, but social impact evaluations aren’t–and the data proves it 🚨

Our new paper, 📎“Who Evaluates AI’s Social Impacts? Mapping Coverage and Gaps in First and Third Party Evaluations,” analyzes hundreds of evaluation reports and reveals major blind spots ‼️🧵 (1/7)

13.11.2025 13:59 — 👍 11 🔁 3 💬 1 📌 0

Note: General registration is constrained by space capacity! Please note that attendance will be confirmed by the organizers based on space availability. Accepted posters will be invited to register for free and attend the workshop in person!

06.11.2025 21:19 — 👍 0 🔁 0 💬 0 📌 0

📮 We are inviting students and early-stage researchers to submit an Abstract (Max 500 words) to be presented as posters during interactive session. Submit here: tinyurl.com/AbsEval

We have a rock-star lineup of AI researchers and an amazing program. Please RSVP at the earliest! Stay tuned!

06.11.2025 21:19 — 👍 0 🔁 0 💬 1 📌 1

🚨 EvalEval is back - now in San Diego!🚨

🧠 Join us for the 2025 Workshop on "Evaluating AI in Practice Bridging Statistical Rigor, Sociotechnical Insights, and Ethical Boundaries" (Co-hosted with UKAISI)

📅 Dec 8, 2025
📝 Abstract due: Nov 20, 2025

Details below! ⬇️
evalevalai.com/events/works...

06.11.2025 21:19 — 👍 3 🔁 1 💬 1 📌 1

💡This paper was brought to you as part of our spotlight series featuring papers on evaluation methods & datasets, the science of evaluation, and many more.

📸Interested in working on better AI evals? Join us: evalevalai.com

31.10.2025 15:47 — 👍 2 🔁 0 💬 0 📌 0

🚫 The approach also avoids mislabeled data and delays benchmark saturation, continuing to distinguish model improvements even at high performance levels.

📑Read more: arxiv.org/abs/2509.11106

31.10.2025 15:47 — 👍 2 🔁 0 💬 1 📌 0

📊Results & Findings

🧪 Experiments across 6 LLMs and 6 major benchmarks:

🏃Fluid Benchmarking outperforms all baselines across all four evaluation dimensions: efficiency, validity, variance, and saturation.
⚡️It achieves lower variance with up to 50× fewer items needed!!

31.10.2025 15:47 — 👍 1 🔁 0 💬 1 📌 0

It combines two key ideas:

✍️Item Response Theory: Models LLM performance in a latent ability space based on item difficulty and discrimination across models
🧨Dynamic Item Selection: Adaptive benchmarking-weaker models get easier items, while stronger models face harder ones

31.10.2025 15:47 — 👍 1 🔁 0 💬 1 📌 0

🔍How to address this? 🤔

🧩Fluid Benchmarking: This work proposes a framework inspired by psychometrics that uses Item Response Theory (IRT) and adaptive item selection to dynamically tailor benchmark evaluations to each model’s capability level.

Continued...👇

31.10.2025 15:47 — 👍 1 🔁 0 💬 1 📌 0

⚠️ Evaluation results can be noisy and prone to variance & labeling errors.
🧱As models advance, benchmarks tend to saturate quickly, reducing their longterm usefulness.
🪃Existing approaches typically tackle just one of these problems (e.g., efficiency or validity)

What now⁉️

31.10.2025 15:47 — 👍 1 🔁 0 💬 1 📌 0

💣Current SOTA benchmarking setups face several systematic issues:

📉It’s often unclear which benchmark(s) to choose, while evaluating on all available ones is too expensive, inefficient, and not always aligned with the intended capabilities we want to measure.

More 👇👇

31.10.2025 15:47 — 👍 1 🔁 0 💬 1 📌 0

✨ Weekly AI Evaluation Paper Spotlight ✨

🤔Is it time to move beyond static tests and toward more dynamic, adaptive, and model-aware evaluation?

🖇️ "Fluid Language Model Benchmarking" by
@valentinhofmann.bsky.social et. al introduces a dynamic benchmarking method for evaluating language models

31.10.2025 15:47 — 👍 3 🔁 0 💬 1 📌 1

EvalEval Coalition We are a researcher community developing scientifically grounded research outputs and robust deployment infrastructure for broader impact evaluations.

💡This is part of our new weekly spotlight series that will feature papers on evaluation methods & datasets, the science of evaluation, and many more.

📷 Interested in working on better AI evals? Check out: evalevalai.com

24.10.2025 16:44 — 👍 1 🔁 0 💬 0 📌 0

🏗️Therefore, fixing leaderboard design, e.g., private eval sets, provenance checks, randomized human tests, etc., is critical for AI ecosystem security and safety

Read more: arxiv.org/pdf/2507.08983

24.10.2025 16:44 — 👍 1 🔁 0 💬 1 📌 0

📊Key insights

🗳️Popular leaderboards (e.g., ChatArena, MTEB) can be exploited to distribute poisoned LLMs at scale
🔐Derivative models (finetuned, quantized, “abliterated”) are easy backdoor vectors. For instance, unsafe LLM variants often get downloaded as much as originals!

Continued...

24.10.2025 16:44 — 👍 0 🔁 0 💬 1 📌 0

🔍 Method:

🧮Introduces TrojanClimb, a framework showing how attackers can:

⌨️ Simulate leaderboard attacks where malicious models achieve high test scores while embedding harmful pay loads (4 modalities)
🔒 Leverage stylistic watermarks/tags to game voting-based leaderboards

24.10.2025 16:44 — 👍 0 🔁 0 💬 1 📌 0

🌟 Weekly AI Evaluation Spotlight 🌟

🤖 Did you know malicious actors can exploit trust in AI leaderboards to promote poisoned models in the community?

This week's paper 📜"Exploiting Leaderboards for Large-Scale Distribution of Malicious Models" by @iamgroot42.bsky.social explores this!

24.10.2025 16:44 — 👍 5 🔁 2 💬 1 📌 0

EvalEval Coalition We are a researcher community developing scientifically grounded research outputs and robust deployment infrastructure for broader impact evaluations.

💡This spotlight series will feature papers on evaluation methods & datasets, the science of evaluation, and many more. Stay tuned!

🤝 Interested in working on better AI evals? We are a coalition of researchers working towards better AI evals. Check out: evalevalai.com

17.10.2025 16:15 — 👍 3 🔁 0 💬 0 📌 0

🧮 Benchmark Saturation != Reliability. Models achieve near-perfect scores without demonstrating true reliability.

📢 Highlights the gap between apparent competence & dependable reliability - therefore systematic reliability testing is needed.

Read more at: arxiv.org/pdf/2502.03461

17.10.2025 16:15 — 👍 3 🔁 0 💬 1 📌 0

📊Key insights:

‼️Noise in benchmarks is substantial! For some datasets, up to 90% of reported “model errors” actually stem from *bad data* instead of model failures.
🧠 After benchmark cleaning, even top LLMs fail on simple, unambiguous platinum benchmark tasks.

Continued...

17.10.2025 16:15 — 👍 4 🔁 3 💬 1 📌 0

🔍 Method:

🧹 Revise & clean 15 popular LLM benchmarks across 6 domains to create *platinum* benchmarks.
🤖 Use multiple LLMs to flag inconsistent samples via disagreement.
⚠️ Bad” questions fall into 4 types: mislabeled, contradictory, ambiguous, or ill-posed.

Example 👇

17.10.2025 16:15 — 👍 3 🔁 0 💬 1 📌 0

✨Weekly AI Evaluation Paper Spotlight✨

🕵️ Is benchmark noise and label errors masking the true fragility of LLMs?

🖇️"Do Large Language Model Benchmarks Test Reliability?" - This paper by @joshvendrow.bsky.social provides insights!

17.10.2025 16:15 — 👍 7 🔁 1 💬 1 📌 1

EvalEval Coalition

Latest posts by eval-eval.bsky.social on Bluesky

@eval-eval is following 8 prominent accounts