Very excited we were able to get this collaboration working -- congrats and big thanks to the co-authors! @rajiinio.bsky.social @hannawallach.bsky.social @mmitchell.bsky.social @angelinawang.bsky.social Olawale Salaudeen, Rishi Bommasani @sanmikoyejo.bsky.social @williamis.bsky.social
20.03.2025 13:28 β π 5 π 1 π¬ 0 π 0
3) Institutions and norms are necessary for a long-lasting, rigorous and trusted evaluation regime. In the long run, nobody trusts actors correcting their own homework. Establishing an ecosystem that accounts for expertise and balances incentives is a key marker of robust evaluation in other fields.
20.03.2025 13:28 β π 3 π 2 π¬ 1 π 0
which challenged concepts of what temperature is and in turn motivated the development of new thermometers. A similar virtuous cycle is needed to refine AI evaluation concepts and measurement methods.
20.03.2025 13:28 β π 2 π 1 π¬ 1 π 0
2) Metrics and evaluation methods need to be refined over time. This iteration is key to any science. Take the example of measuring temperature: it went through many iterations of building new measurement approaches,
20.03.2025 13:28 β π 2 π 1 π¬ 1 π 0
Just like the βcrashworthinessβ of a car indicates aspects of safety in case of an accident, AI evaluation metrics need to link to real-world outcomes.
20.03.2025 13:28 β π 2 π 1 π¬ 1 π 0
We identify three key lessons in particular.
1) Meaningful metrics: evaluation metrics must connect to AI system behaviour or impact that is of relevance in the real-world. They can be abstract or simplified -- but they need to correspond to real-world performance or outcomes in a meaningful way.
20.03.2025 13:28 β π 2 π 1 π¬ 1 π 0
We pull out key lessons from other fields, such as aerospace, food security, and pharmaceuticals, that have matured from being research disciplines to becoming industries with widely used and trusted products. AI research is going through a similar maturation -- but AI evaluation needs to catch up.
20.03.2025 13:28 β π 2 π 1 π¬ 1 π 0
LinkedIn
This link will take you to a page thatβs not on LinkedIn
π£ New paper! The field of AI research is increasingly realising that benchmarks are very limited in what they can tell us about AI system performance and safety. We argue and lay out a roadmap toward a *science of AI evaluation*: arxiv.org/abs/2503.05336 π§΅
20.03.2025 13:28 β π 38 π 12 π¬ 1 π 1
Professor at Wharton, studying AI and its implications for education, entrepreneurship, and work. Author of Co-Intelligence.
Book: https://a.co/d/bC2kSj1
Substack: https://www.oneusefulthing.org/
Web: https://mgmt.wharton.upenn.edu/profile/emollick
Research Scientist @DeepMind | Previously @OSFellows & @hrdag. RT != endorsements. Opinions Mine. Pronouns: he/him
senior project director @upturn.org. i work on AI/ML + civil rights research and advocacy. personal views.
we have it in our power to begin the world over again.
www.jlkoepke.com
AI accountability, audits & eval. Keen on participation & practical outcomes. CS PhDing @UCBerkeley.
Researcher trying to shape AI towards positive outcomes. ML & Ethics +birds. Generally trying to do the right thing. TIME 100 | TED speaker | Senate testimony provider | Navigating public life as a recluse.
Former: Google, Microsoft; Current: Hugging Face
Digital Cultures and Arts | UZH & ZHdK | operative images, synthetic media and visual culture
@digitalculturesandarts.ch
https://digitalculturesandarts.ch/
https://linktr.ee/bildoperationen
official Bluesky account (check usernameπ)
Bugs, feature requests, feedback: support@bsky.app