Hanna Wallach's Avatar

Hanna Wallach

@hannawallach.bsky.social

VP and Distinguished Scientist at Microsoft Research NYC. AI evaluation and measurement, responsible AI, computational social science, machine learning. She/her. One photo a day since January 2018: https://www.instagram.com/logisticaggression/

1,891 Followers  |  246 Following  |  59 Posts  |  Joined: 31.07.2023  |  2.1887

Latest posts by hannawallach.bsky.social on Bluesky

This is happening now!!!

16.07.2025 18:33 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge The measurement tasks involved in evaluating generative AI (GenAI) systems lack sufficient scientific rigor, leading to what has been described as "a tangle of sloppy tests [and] apples-to-oranges com...

1) (Tomorrow!) Wed 7/16, 11am-1:30 pm PT poster for "Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge" (E. Exhibition Hall A-B, E-503)

Work led by @hannawallach.bsky.social + @azjacobs.bsky.social

arxiv.org/abs/2502.00561

16.07.2025 00:43 β€” πŸ‘ 3    πŸ” 1    πŸ’¬ 1    πŸ“Œ 1

Oh whoops! You are indeed correct -- it starts at 11am PT!

15.07.2025 20:34 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
ICML Poster Position: Evaluating Generative AI Systems Is a Social Science Measurement ChallengeICML 2025

If you're at @icmlconf.bsky.social this week, come check out our poster on "Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge" presented by the amazing @afedercooper.bsky.social from 11:30am--1:30pm PDT on Weds!!! icml.cc/virtual/2025...

15.07.2025 18:35 β€” πŸ‘ 32    πŸ” 10    πŸ’¬ 1    πŸ“Œ 2

I also want to note that this paper has been in progress for many, many years, so we're super excited it's finally being published. It's also one of the most genuinely interdisciplinary projects I've ever worked on, which has made it particularly challenging and rewarding!!! ❀️

16.06.2025 21:49 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Check out the camera-ready version of our ACL Findings paper ("Taxonomizing Representational Harms using Speech Act Theory") to learn more!!! arxiv.org/pdf/2504.00928

16.06.2025 21:49 β€” πŸ‘ 10    πŸ” 2    πŸ’¬ 1    πŸ“Œ 0

Why does this matter? You can't mitigate what you can't measure, and our framework and taxonomy help researchers and practitioners design better ways to measure and mitigate representational harms caused by generative language systems.

16.06.2025 21:49 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 1

Using this theoretical grounding, we provide new definitions for stereotyping, demeaning, and erasure, and break them down into a detailed taxonomy of system behaviors. By doing this, we unify many of the different ways representational harms have been previously defined.

16.06.2025 21:49 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

We bring some much-needed clarity by turning to speech act theoryβ€”a theory of meaning from linguistics that allows us to distinguish between a system output’s purpose and its real-world impacts.

16.06.2025 21:49 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

These are often called β€œrepresentational harms,” and while they’re easy for people to recognize when they see them, definitions of these harms are commonly under-specified, leading to conceptual confusion. This makes them hard to measure and even harder to mitigate.

16.06.2025 21:49 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Generative language systems are everywhere, and many of them stereotype, demean, or erase particular social groups.

16.06.2025 21:49 β€” πŸ‘ 9    πŸ” 2    πŸ’¬ 1    πŸ“Œ 0
Preview
Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge The measurement tasks involved in evaluating generative AI (GenAI) systems lack sufficient scientific rigor, leading to what has been described as "a tangle of sloppy tests [and] apples-to-oranges com...

Check out the camera-ready version of our ICML position paper ("Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge") to learn more!!! arxiv.org/abs/2502.00561

(6/6)

15.06.2025 00:20 β€” πŸ‘ 43    πŸ” 11    πŸ’¬ 3    πŸ“Œ 0

Real talk: GenAI systems aren't toys. Bad evaluations don't just waste people's time---they can cause real-world harms. It's time to level up, ditch the apples-to-oranges comparisons, and start doing measurement like we mean it.

(5/6)

15.06.2025 00:20 β€” πŸ‘ 11    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

We propose a framework that cuts through the chaos: first, get crystal clear on what you're measuring and why (no more vague hand-waving); then, figure out how to measure it; and, throughout the process, interrogate validity like your reputation depends on it---because, honestly, it should.

(4/6)

15.06.2025 00:20 β€” πŸ‘ 20    πŸ” 2    πŸ’¬ 1    πŸ“Œ 0

Here's our hot take: evaluating GenAI systems isn't just some techie puzzle---it's a social science measurement challenge.

(3/6)

15.06.2025 00:20 β€” πŸ‘ 7    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

But there's a dirty little secret: the ways we evaluate GenAI systems are often sloppy, vague, and quite frankly... not up to the task.

(2/6)

15.06.2025 00:20 β€” πŸ‘ 17    πŸ” 3    πŸ’¬ 1    πŸ“Œ 0

Alright, people, let's be honest: GenAI systems are everywhere, and figuring out whether they're any good is a total mess. Should we use them? Where? How? Do they need a total overhaul?

(1/6)

15.06.2025 00:20 β€” πŸ‘ 33    πŸ” 11    πŸ’¬ 1    πŸ“Œ 0

I'm so excited this paper is finally online!!! πŸŽ‰ We had so much fun working on this with @emmharv.bsky.social!!! Thread below summarizing our contributions...

10.06.2025 19:12 β€” πŸ‘ 9    πŸ” 3    πŸ’¬ 0    πŸ“Œ 0

Please spread the word to anyone who you think might be interested! We will begin reviewing applications on June 2.

20.05.2025 13:47 β€” πŸ‘ 5    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

This program is open to candidates who will have completed their bachelor's degree (or equiv.) by Summer 2025 (inc. those who graduated previously and have been working or doing a master's degree) and who want to advance their research skills before applying to PhD programs.

20.05.2025 13:47 β€” πŸ‘ 6    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
FATE Research Assistant (β€œPre-doc”) - Microsoft Research The Fairness, Accountability, Transparency, and Ethics (FATE) Researchβ€―group at Microsoft Research New York City (MSR NYC) is looking for a pre-doctoral research assistant (pre-doc) to start August 20...

Exciting news: The Fairness, Accountability, Transparency and Ethics (FATE) group at Microsoft Research NYC is hiring a predoctoral fellow!!! πŸŽ‰

www.microsoft.com/en-us/resear...

20.05.2025 13:47 β€” πŸ‘ 33    πŸ” 13    πŸ’¬ 2    πŸ“Œ 2

Exciting news!!! This just got into @icmlconf.bsky.social as a position paper!!! πŸŽ‰ More updates to come as we work on the camera-ready version!!!

03.05.2025 20:59 β€” πŸ‘ 51    πŸ” 11    πŸ’¬ 0    πŸ“Œ 0

Thank you for posting! Very timely as the paper just got accepted to ICML's position paper track!

03.05.2025 20:54 β€” πŸ‘ 5    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

Reading - Evaluating Evaluations for GenAI from
@hannawallach.bsky.social madesai.bsky.social afedercooper.bsky.social et al-This work dovetails with our work at
@worldprivacyforum.bsky.social on measuring AI governance tools from governments, through privacy/ policy lens arxiv.org/pdf/2502.00561

03.04.2025 12:27 β€” πŸ‘ 3    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0

At the #HEAL workshop, I'll present "Systematizing During Measurement Enables Broader Stakeholder Participation" on the ways we can further structure LLM evaluations and open them for deliberation. A project led by @hannawallach.bsky.social

25.04.2025 22:57 β€” πŸ‘ 3    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0

2. Also Saturday, @amabalayn.bsky.social will represent our piece arguing that systematization during measurement enables broad stakeholder participation in AI evaluation.

This came out of a huge group collaboration led by @hannawallach.bsky.social: bsky.app/profile/hann...

heal-workshop.github.io

25.04.2025 17:24 β€” πŸ‘ 4    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
LinkedIn This link will take you to a page that’s not on LinkedIn

πŸ“£ New paper! The field of AI research is increasingly realising that benchmarks are very limited in what they can tell us about AI system performance and safety. We argue and lay out a roadmap toward a *science of AI evaluation*: arxiv.org/abs/2503.05336 🧡

20.03.2025 13:28 β€” πŸ‘ 38    πŸ” 12    πŸ’¬ 1    πŸ“Œ 1
Screenshot of 'SHADES: Towards a Multilingual Assessment of Stereotypes in Large Language Models.'
SHADES is in multiple grey colors (shades).

Screenshot of 'SHADES: Towards a Multilingual Assessment of Stereotypes in Large Language Models.' SHADES is in multiple grey colors (shades).

⚫βšͺ It's coming...SHADES. βšͺ⚫
The first ever resource of multilingual, multicultural, and multigeographical stereotypes, built to support nuanced LLM evaluation and bias mitigation. We have been working on this around the world for almost **4 years** and I am thrilled to share it with you all soon.

10.02.2025 08:28 β€” πŸ‘ 128    πŸ” 23    πŸ’¬ 6    πŸ“Œ 3

Remember this @neuripsconf.bsky.social workshop paper? We spent the past month writing a newer, better, longer version!!! You can find it online here: arxiv.org/abs/2502.00561

04.02.2025 15:28 β€” πŸ‘ 83    πŸ” 14    πŸ’¬ 2    πŸ“Œ 3

Thank you!!!! 😍

31.12.2024 16:21 β€” πŸ‘ 6    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

@hannawallach is following 20 prominent accounts