The AI Evaluation Chart Crisis
Charts used to showcase performance demonstrate broader issues in the AI evaluation ecosystem: a lack of balance between competitive benchmarking and statistical rigor.
π¨New blog: The AI Evaluation Chart Crisis π
From misleading bar heights to missing error bars, recent model launches have sparked debate on AI evals. In our new blogpost, we dig into whatβs broken, why it matters and how they should be presented π
evalevalai.com/documentatio...
11.08.2025 19:20 β π 2 π 2 π¬ 0 π 0
EvalEval Coalition
We are a researcher community developing scientifically grounded research outputs and robust deployment infrastructure for broader impact evaluations.
This kickoff post lays out: 1) π Why we need a science of evaluation; 2) π€ Our goals for the community; 3) π οΈ How you can get involved (2/2)
Interested in joining? Check out evalevalai.com
16.07.2025 17:17 β π 0 π 0 π¬ 0 π 0
The Science of Evaluations: Workstream Kickoff Post
Announcing the launch of a research-driven initiative among a community of researchers to strengthen the science of AI evaluations.
π¨ AI Evals Crisis: Officially kicking off the Eval Science Workstream π¨
Weβre building a shared scientific foundation for evaluating AI systems, one thatβs rigorous, open, and grounded in real-world & cross-disciplinary best practicesπ (1/2)
Read our new blog post: tinyurl.com/evalevalai
16.07.2025 17:17 β π 2 π 1 π¬ 1 π 0
Join us for the Eval Eval Coalition Social at @facct.bsky.social tomorrow Tuesday June 24th from 4-4:30 pm during the coffee break! We would love to have you join us and we look forward to seeing you there!! #FAccT2025 #EvalEval
23.06.2025 14:41 β π 4 π 2 π¬ 0 π 1
Our coalition is focused on producing scientifically grounded research outputs, robust deployment infrastructure for broader impact evaluations, and fostering a community of researchers passionate about developing better evaluations πππ (2/3)
22.06.2025 19:34 β π 1 π 0 π¬ 1 π 0
Introducing the Eval Eval Coalition! β¨
We are a community of researchers dedicated to designing, developing, and deploying better evaluations (1/3)
22.06.2025 19:34 β π 3 π 1 π¬ 1 π 1