's Avatar

@weidingerlaura.bsky.social

42 Followers  |  7 Following  |  8 Posts  |  Joined: 12.02.2025  |  1.3533

Latest posts by weidingerlaura.bsky.social on Bluesky


Very excited we were able to get this collaboration working -- congrats and big thanks to the co-authors! @rajiinio.bsky.social @hannawallach.bsky.social @mmitchell.bsky.social @angelinawang.bsky.social Olawale Salaudeen, Rishi Bommasani @sanmikoyejo.bsky.social @williamis.bsky.social

20.03.2025 13:28 β€” πŸ‘ 5    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0

3) Institutions and norms are necessary for a long-lasting, rigorous and trusted evaluation regime. In the long run, nobody trusts actors correcting their own homework. Establishing an ecosystem that accounts for expertise and balances incentives is a key marker of robust evaluation in other fields.

20.03.2025 13:28 β€” πŸ‘ 3    πŸ” 2    πŸ’¬ 1    πŸ“Œ 0

which challenged concepts of what temperature is and in turn motivated the development of new thermometers. A similar virtuous cycle is needed to refine AI evaluation concepts and measurement methods.

20.03.2025 13:28 β€” πŸ‘ 2    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0

2) Metrics and evaluation methods need to be refined over time. This iteration is key to any science. Take the example of measuring temperature: it went through many iterations of building new measurement approaches,

20.03.2025 13:28 β€” πŸ‘ 2    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0

Just like the β€œcrashworthiness” of a car indicates aspects of safety in case of an accident, AI evaluation metrics need to link to real-world outcomes.

20.03.2025 13:28 β€” πŸ‘ 2    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0

We identify three key lessons in particular.

1) Meaningful metrics: evaluation metrics must connect to AI system behaviour or impact that is of relevance in the real-world. They can be abstract or simplified -- but they need to correspond to real-world performance or outcomes in a meaningful way.

20.03.2025 13:28 β€” πŸ‘ 2    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0

We pull out key lessons from other fields, such as aerospace, food security, and pharmaceuticals, that have matured from being research disciplines to becoming industries with widely used and trusted products. AI research is going through a similar maturation -- but AI evaluation needs to catch up.

20.03.2025 13:28 β€” πŸ‘ 2    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
LinkedIn This link will take you to a page that’s not on LinkedIn

πŸ“£ New paper! The field of AI research is increasingly realising that benchmarks are very limited in what they can tell us about AI system performance and safety. We argue and lay out a roadmap toward a *science of AI evaluation*: arxiv.org/abs/2503.05336 🧡

20.03.2025 13:28 β€” πŸ‘ 38    πŸ” 12    πŸ’¬ 1    πŸ“Œ 1

@weidingerlaura is following 7 prominent accounts