Athiya Deviyani's Avatar

Athiya Deviyani

@athiya.bsky.social

LTI PhD at CMU on evaluation and trustworthy ML/NLP, prev AI&CS Edinburgh University, Google, YouTube, Apple, Netflix. Views are personal ๐Ÿ‘ฉ๐Ÿปโ€๐Ÿ’ป๐Ÿ‡ฎ๐Ÿ‡ฉ athiyadeviyani.github.io

1,002 Followers  |  472 Following  |  12 Posts  |  Joined: 08.12.2023  |  2.0049

Latest posts by athiya.bsky.social on Bluesky

An overview of the work โ€œResearch Borderlands: Analysing Writing Across Research Culturesโ€ by Shaily Bhatt, Tal August, and Maria Antoniak. The overview describes that We  survey and interview interdisciplinary researchers (ยง3) to develop a framework of writing norms that vary across research cultures (ยง4) and operationalise them using computational metrics (ยง5). We then use this evaluation suite for two large-scale quantitative analyses: (a) surfacing variations in writing across 11 communities (ยง6); (b) evaluating the cultural competence of LLMs when adapting writing from one community to another (ยง7).

An overview of the work โ€œResearch Borderlands: Analysing Writing Across Research Culturesโ€ by Shaily Bhatt, Tal August, and Maria Antoniak. The overview describes that We survey and interview interdisciplinary researchers (ยง3) to develop a framework of writing norms that vary across research cultures (ยง4) and operationalise them using computational metrics (ยง5). We then use this evaluation suite for two large-scale quantitative analyses: (a) surfacing variations in writing across 11 communities (ยง6); (b) evaluating the cultural competence of LLMs when adapting writing from one community to another (ยง7).

๐Ÿ–‹๏ธ Curious how writing differs across (research) cultures?
๐Ÿšฉ Tired of โ€œculturalโ€ evals that don't consult people?

We engaged with interdisciplinary researchers to identify & measure โœจcultural normsโœจin scientific writing, and show thatโ—LLMs flatten themโ—

๐Ÿ“œ arxiv.org/abs/2506.00784

[1/11]

09.06.2025 23:29 โ€” ๐Ÿ‘ 74    ๐Ÿ” 30    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 5

Excited to be in Albuquerque for #NAACL2025 ๐Ÿœ๏ธ presenting our poster "Contextual Metric Meta-Evaluation by Measuring Local Metric Accuracy"!

Come find me at
๐Ÿ“ Hall 3, Session B
๐Ÿ—“๏ธ Wednesday, April 30 (tomorrow!)
๐Ÿ•š 11:00โ€“12:30

Letโ€™s talk about all things eval! ๐Ÿ“Š

30.04.2025 02:39 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Thank you for the repost ๐Ÿค—

29.04.2025 18:11 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

If you're at NAACL this week (or just want to keep track), I have a feed for you: bsky.app/profile/did:...

Currently pulling everyone that mentions NAACL, posts a link from the ACL Anthology, or has NAACL in their username. Happy conferencing!

29.04.2025 18:07 โ€” ๐Ÿ‘ 16    ๐Ÿ” 4    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 1
Post image

Can self-supervised models ๐Ÿค– understand allophony ๐Ÿ—ฃ? Excited to share my new #NAACL2025 paper: Leveraging Allophony in Self-Supervised Speech Models for Atypical Pronunciation Assessment arxiv.org/abs/2502.07029 (1/n)

29.04.2025 17:00 โ€” ๐Ÿ‘ 15    ๐Ÿ” 10    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 0
Post image

๐Ÿš€ Excited to share a new interp+agents paper: ๐Ÿญ๐Ÿฑ MICE for CATs: Model-Internal Confidence Estimation for Calibrating Agents with Tools appearing at #NAACL2025

This was work done @msftresearch.bsky.social last summer with Jason Eisner, Justin Svegliato, Ben Van Durme, Yu Su, and Sam Thomson

1/๐Ÿงต

29.04.2025 13:41 โ€” ๐Ÿ‘ 12    ๐Ÿ” 8    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 2
Post image

When interacting with ChatGPT, have you wondered if they would ever "lie" to you? We found that under pressure, LLMs often choose deception. Our new #NAACL2025 paper, "AI-LIEDAR ," reveals models were truthful less than 50% of the time when faced with utility-truthfulness conflicts! ๐Ÿคฏ 1/

28.04.2025 20:36 โ€” ๐Ÿ‘ 25    ๐Ÿ” 9    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 3
Preview
Contextual Metric Meta-Evaluation by Measuring Local Metric Accuracy Athiya Deviyani, Fernando Diaz. Findings of the Association for Computational Linguistics: NAACL 2025. 2025.

๐Ÿ”‘ So what now?
When picking metrics, donโ€™t rely on global scores alone.
๐ŸŽฏ Identify the evaluation context
๐Ÿ” Measure local accuracy
โœ… Choose metrics that are stable and/or perform well in your context
โ™ป๏ธ Reevaluate as models and tasks evolve

๐Ÿ“„ aclanthology.org/2025.finding...
#NAACL2025

(๐Ÿงต9/9)

29.04.2025 17:10 โ€” ๐Ÿ‘ 2    ๐Ÿ” 2    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

For ASR:
โœ… H1 supported: Local accuracy still changes.
โŒ H2 not supported: Metric rankings stay pretty stable.
This is probably because ASR outputs are less ambiguous, and metrics focus on similar properties, such as phonetic or lexical accuracy.

(๐Ÿงต8/9)

29.04.2025 17:10 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image Post image

Hereโ€™s what we found for MT and Ranking:
โœ… H1 supported: Local accuracy varies a lot across systems and algorithms.
โœ… H2 supported: Metric rankings shift between contexts.

๐Ÿšจ Picking a metric based purely on global performance is risky!

Choose wisely. ๐Ÿง™๐Ÿปโ€โ™‚๏ธ

(๐Ÿงต7/9)

29.04.2025 17:10 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

We evaluate this framework across three tasks:
๐Ÿ“ Machine Translation (MT)
๐ŸŽ™ Automatic Speech Recognition (ASR)
๐Ÿ“ˆ Ranking

We cover popular metrics like BLEU, COMET, BERTScore, WER, METEOR, nDCG, and more!

(๐Ÿงต6/9)

29.04.2025 17:10 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

We test two hypotheses:
๐ŸงชH1: The absolute local accuracy of a metric changes as the context changes
๐ŸงชH2: The relative local accuracy (how metrics rank against each other) also changes across contexts

(๐Ÿงต5/9)

29.04.2025 17:10 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

More formally: given an input x, an output y from a context c, and a degraded version yโ€ฒ, we ask: how often does the metric score y higher than yโ€ฒ across all inputs in the context c?

We create yโ€ฒ using perturbations that simulate realistic degradations automatically.

(๐Ÿงต4/9)

29.04.2025 17:10 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

๐ŸŽฏ Metric accuracy measures how often a metric picks the better system output.
๐ŸŒ Global accuracy averages this over all outputs.
๐Ÿ”Ž Local accuracy zooms in on a specific context (like a model, domain, or quality level).

Contexts are just meaningful slices of your data.

(๐Ÿงต3/9)

29.04.2025 17:10 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

Most meta-evaluations look at global performance over arbitrary outputs. However, real-world use cases are highly contextual, tied to specific models or output qualities.

We introduce โœจlocal metric accuracyโœจ to show how metric reliability can vary across settings.

(๐Ÿงต2/9)

29.04.2025 17:10 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

Ever trusted a metric that works great on average, only for it to fail in your specific use case?

In our #NAACL2025 paper (w/ @841io.bsky.social), we show why global evaluations are not enough and why context matters more than you think.

๐Ÿ“„ aclanthology.org/2025.finding...
#NLP #Evaluation

(๐Ÿงต1/9)

29.04.2025 17:10 โ€” ๐Ÿ‘ 22    ๐Ÿ” 5    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 2

๐Ÿ™‹โ€โ™€๏ธ

18.11.2024 04:05 โ€” ๐Ÿ‘ 7    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

@athiya is following 20 prominent accounts