Athiya Deviyani @athiya - Bluesky Profile

An overview of the work “Research Borderlands: Analysing Writing Across Research Cultures” by Shaily Bhatt, Tal August, and Maria Antoniak. The overview describes that We survey and interview interdisciplinary researchers (§3) to develop a framework of writing norms that vary across research cultures (§4) and operationalise them using computational metrics (§5). We then use this evaluation suite for two large-scale quantitative analyses: (a) surfacing variations in writing across 11 communities (§6); (b) evaluating the cultural competence of LLMs when adapting writing from one community to another (§7).

🖋️ Curious how writing differs across (research) cultures?
🚩 Tired of “cultural” evals that don't consult people?

We engaged with interdisciplinary researchers to identify & measure ✨cultural norms✨in scientific writing, and show that❗LLMs flatten them❗

📜 arxiv.org/abs/2506.00784

[1/11]

09.06.2025 23:29 — 👍 74 🔁 30 💬 1 📌 5

Excited to be in Albuquerque for #NAACL2025 🏜️ presenting our poster "Contextual Metric Meta-Evaluation by Measuring Local Metric Accuracy"!

Come find me at
📍 Hall 3, Session B
🗓️ Wednesday, April 30 (tomorrow!)
🕚 11:00–12:30

Let’s talk about all things eval! 📊

30.04.2025 02:39 — 👍 1 🔁 0 💬 0 📌 0

Thank you for the repost 🤗

29.04.2025 18:11 — 👍 1 🔁 0 💬 0 📌 0

If you're at NAACL this week (or just want to keep track), I have a feed for you: bsky.app/profile/did:...

Currently pulling everyone that mentions NAACL, posts a link from the ACL Anthology, or has NAACL in their username. Happy conferencing!

29.04.2025 18:07 — 👍 16 🔁 4 💬 1 📌 1

Can self-supervised models 🤖 understand allophony 🗣? Excited to share my new #NAACL2025 paper: Leveraging Allophony in Self-Supervised Speech Models for Atypical Pronunciation Assessment arxiv.org/abs/2502.07029 (1/n)

29.04.2025 17:00 — 👍 15 🔁 10 💬 2 📌 0

🚀 Excited to share a new interp+agents paper: 🐭🐱 MICE for CATs: Model-Internal Confidence Estimation for Calibrating Agents with Tools appearing at #NAACL2025

This was work done @msftresearch.bsky.social last summer with Jason Eisner, Justin Svegliato, Ben Van Durme, Yu Su, and Sam Thomson

1/🧵

29.04.2025 13:41 — 👍 12 🔁 8 💬 1 📌 2

When interacting with ChatGPT, have you wondered if they would ever "lie" to you? We found that under pressure, LLMs often choose deception. Our new #NAACL2025 paper, "AI-LIEDAR ," reveals models were truthful less than 50% of the time when faced with utility-truthfulness conflicts! 🤯 1/

28.04.2025 20:36 — 👍 25 🔁 9 💬 1 📌 3

Contextual Metric Meta-Evaluation by Measuring Local Metric Accuracy Athiya Deviyani, Fernando Diaz. Findings of the Association for Computational Linguistics: NAACL 2025. 2025.

🔑 So what now?
When picking metrics, don’t rely on global scores alone.
🎯 Identify the evaluation context
🔍 Measure local accuracy
✅ Choose metrics that are stable and/or perform well in your context
♻️ Reevaluate as models and tasks evolve

📄 aclanthology.org/2025.finding...
#NAACL2025

(🧵9/9)

29.04.2025 17:10 — 👍 2 🔁 2 💬 0 📌 0

For ASR:
✅ H1 supported: Local accuracy still changes.
❌ H2 not supported: Metric rankings stay pretty stable.
This is probably because ASR outputs are less ambiguous, and metrics focus on similar properties, such as phonetic or lexical accuracy.

(🧵8/9)

29.04.2025 17:10 — 👍 0 🔁 0 💬 1 📌 0

Here’s what we found for MT and Ranking:
✅ H1 supported: Local accuracy varies a lot across systems and algorithms.
✅ H2 supported: Metric rankings shift between contexts.

🚨 Picking a metric based purely on global performance is risky!

Choose wisely. 🧙🏻‍♂️

(🧵7/9)

29.04.2025 17:10 — 👍 0 🔁 0 💬 1 📌 0

We evaluate this framework across three tasks:
📝 Machine Translation (MT)
🎙 Automatic Speech Recognition (ASR)
📈 Ranking

We cover popular metrics like BLEU, COMET, BERTScore, WER, METEOR, nDCG, and more!

(🧵6/9)

29.04.2025 17:10 — 👍 0 🔁 0 💬 1 📌 0

We test two hypotheses:
🧪H1: The absolute local accuracy of a metric changes as the context changes
🧪H2: The relative local accuracy (how metrics rank against each other) also changes across contexts

(🧵5/9)

29.04.2025 17:10 — 👍 0 🔁 0 💬 1 📌 0

More formally: given an input x, an output y from a context c, and a degraded version y′, we ask: how often does the metric score y higher than y′ across all inputs in the context c?

We create y′ using perturbations that simulate realistic degradations automatically.

(🧵4/9)

29.04.2025 17:10 — 👍 0 🔁 0 💬 1 📌 0

🎯 Metric accuracy measures how often a metric picks the better system output.
🌍 Global accuracy averages this over all outputs.
🔎 Local accuracy zooms in on a specific context (like a model, domain, or quality level).

Contexts are just meaningful slices of your data.

(🧵3/9)

29.04.2025 17:10 — 👍 0 🔁 0 💬 1 📌 0

Most meta-evaluations look at global performance over arbitrary outputs. However, real-world use cases are highly contextual, tied to specific models or output qualities.

We introduce ✨local metric accuracy✨ to show how metric reliability can vary across settings.

(🧵2/9)

29.04.2025 17:10 — 👍 0 🔁 0 💬 1 📌 0

Ever trusted a metric that works great on average, only for it to fail in your specific use case?

In our #NAACL2025 paper (w/ @841io.bsky.social), we show why global evaluations are not enough and why context matters more than you think.

📄 aclanthology.org/2025.finding...
#NLP #Evaluation

(🧵1/9)

29.04.2025 17:10 — 👍 22 🔁 5 💬 1 📌 2

🙋‍♀️

18.11.2024 04:05 — 👍 7 🔁 0 💬 0 📌 0

Athiya Deviyani

Latest posts by athiya.bsky.social on Bluesky

@athiya is following 20 prominent accounts