An overview of the work โResearch Borderlands: Analysing Writing Across Research Culturesโ by Shaily Bhatt, Tal August, and Maria Antoniak. The overview describes that We survey and interview interdisciplinary researchers (ยง3) to develop a framework of writing norms that vary across research cultures (ยง4) and operationalise them using computational metrics (ยง5). We then use this evaluation suite for two large-scale quantitative analyses: (a) surfacing variations in writing across 11 communities (ยง6); (b) evaluating the cultural competence of LLMs when adapting writing from one community to another (ยง7).
๐๏ธ Curious how writing differs across (research) cultures?
๐ฉ Tired of โculturalโ evals that don't consult people?
We engaged with interdisciplinary researchers to identify & measure โจcultural normsโจin scientific writing, and show thatโLLMs flatten themโ
๐ arxiv.org/abs/2506.00784
[1/11]
09.06.2025 23:29 โ ๐ 74 ๐ 30 ๐ฌ 1 ๐ 5
Excited to be in Albuquerque for #NAACL2025 ๐๏ธ presenting our poster "Contextual Metric Meta-Evaluation by Measuring Local Metric Accuracy"!
Come find me at
๐ Hall 3, Session B
๐๏ธ Wednesday, April 30 (tomorrow!)
๐ 11:00โ12:30
Letโs talk about all things eval! ๐
30.04.2025 02:39 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0
Thank you for the repost ๐ค
29.04.2025 18:11 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0
If you're at NAACL this week (or just want to keep track), I have a feed for you: bsky.app/profile/did:...
Currently pulling everyone that mentions NAACL, posts a link from the ACL Anthology, or has NAACL in their username. Happy conferencing!
29.04.2025 18:07 โ ๐ 16 ๐ 4 ๐ฌ 1 ๐ 1
Can self-supervised models ๐ค understand allophony ๐ฃ? Excited to share my new #NAACL2025 paper: Leveraging Allophony in Self-Supervised Speech Models for Atypical Pronunciation Assessment arxiv.org/abs/2502.07029 (1/n)
29.04.2025 17:00 โ ๐ 15 ๐ 10 ๐ฌ 2 ๐ 0
๐ Excited to share a new interp+agents paper: ๐ญ๐ฑ MICE for CATs: Model-Internal Confidence Estimation for Calibrating Agents with Tools appearing at #NAACL2025
This was work done @msftresearch.bsky.social last summer with Jason Eisner, Justin Svegliato, Ben Van Durme, Yu Su, and Sam Thomson
1/๐งต
29.04.2025 13:41 โ ๐ 12 ๐ 8 ๐ฌ 1 ๐ 2
When interacting with ChatGPT, have you wondered if they would ever "lie" to you? We found that under pressure, LLMs often choose deception. Our new #NAACL2025 paper, "AI-LIEDAR ," reveals models were truthful less than 50% of the time when faced with utility-truthfulness conflicts! ๐คฏ 1/
28.04.2025 20:36 โ ๐ 25 ๐ 9 ๐ฌ 1 ๐ 3
Contextual Metric Meta-Evaluation by Measuring Local Metric Accuracy
Athiya Deviyani, Fernando Diaz. Findings of the Association for Computational Linguistics: NAACL 2025. 2025.
๐ So what now?
When picking metrics, donโt rely on global scores alone.
๐ฏ Identify the evaluation context
๐ Measure local accuracy
โ
Choose metrics that are stable and/or perform well in your context
โป๏ธ Reevaluate as models and tasks evolve
๐ aclanthology.org/2025.finding...
#NAACL2025
(๐งต9/9)
29.04.2025 17:10 โ ๐ 2 ๐ 2 ๐ฌ 0 ๐ 0
For ASR:
โ
H1 supported: Local accuracy still changes.
โ H2 not supported: Metric rankings stay pretty stable.
This is probably because ASR outputs are less ambiguous, and metrics focus on similar properties, such as phonetic or lexical accuracy.
(๐งต8/9)
29.04.2025 17:10 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
Hereโs what we found for MT and Ranking:
โ
H1 supported: Local accuracy varies a lot across systems and algorithms.
โ
H2 supported: Metric rankings shift between contexts.
๐จ Picking a metric based purely on global performance is risky!
Choose wisely. ๐ง๐ปโโ๏ธ
(๐งต7/9)
29.04.2025 17:10 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
We evaluate this framework across three tasks:
๐ Machine Translation (MT)
๐ Automatic Speech Recognition (ASR)
๐ Ranking
We cover popular metrics like BLEU, COMET, BERTScore, WER, METEOR, nDCG, and more!
(๐งต6/9)
29.04.2025 17:10 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
We test two hypotheses:
๐งชH1: The absolute local accuracy of a metric changes as the context changes
๐งชH2: The relative local accuracy (how metrics rank against each other) also changes across contexts
(๐งต5/9)
29.04.2025 17:10 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
More formally: given an input x, an output y from a context c, and a degraded version yโฒ, we ask: how often does the metric score y higher than yโฒ across all inputs in the context c?
We create yโฒ using perturbations that simulate realistic degradations automatically.
(๐งต4/9)
29.04.2025 17:10 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
๐ฏ Metric accuracy measures how often a metric picks the better system output.
๐ Global accuracy averages this over all outputs.
๐ Local accuracy zooms in on a specific context (like a model, domain, or quality level).
Contexts are just meaningful slices of your data.
(๐งต3/9)
29.04.2025 17:10 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
Most meta-evaluations look at global performance over arbitrary outputs. However, real-world use cases are highly contextual, tied to specific models or output qualities.
We introduce โจlocal metric accuracyโจ to show how metric reliability can vary across settings.
(๐งต2/9)
29.04.2025 17:10 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
Ever trusted a metric that works great on average, only for it to fail in your specific use case?
In our #NAACL2025 paper (w/ @841io.bsky.social), we show why global evaluations are not enough and why context matters more than you think.
๐ aclanthology.org/2025.finding...
#NLP #Evaluation
(๐งต1/9)
29.04.2025 17:10 โ ๐ 22 ๐ 5 ๐ฌ 1 ๐ 2
๐โโ๏ธ
18.11.2024 04:05 โ ๐ 7 ๐ 0 ๐ฌ 0 ๐ 0
cs && comp-bio ugrad @pitt_sci; in love with #NLProc ๐ฃ๏ธ๐ง ๐; aspiring educator; he/him
The 2025 Conference on Language Modeling will take place at the Palais des Congrรจs in Montreal, Canada from October 7-10, 2025
PhD Student @cmurobotics.bsky.social with @jeff-ichnowski.bsky.social || DUSt3R Research Intern @naverlabseurope || 4D Vision for Robot Manipulation ๐ท
He/Him - https://bart-ai.com
Behavioral and Internal Interpretability ๐
Incoming PostDoc Tรผbingen University | PhD Student at @ukplab.bsky.social, TU Darmstadt/Hochschule Luzern
Casual account. Here to see peopleโs art, book recs, and discussions on stats/ML!
PhD student @mainlp.bsky.social (@cislmu.bsky.social, LMU Munich). Interested in language variation & change, currently working on NLP for dialects and low-resource languages.
verenablaschke.github.io
Postdoc at IBME in Oxford. Machine learning for healthcare.
https://www.fregu856.com/
Assistant Professor at UCLA. Alum @StanfordNLP. NLP, Cognitive Science, Accessibility. https://www.coalas-lab.com/elisakreiss
AI researcher @ Mila, UdeM. PhD focused on OOD detection & generalization. Building robust deep learning. Previously: Microsoft ATL, Tensorgraph. #AI #MachineLearning
Uses machine learning to study literary imagination, and vice-versa. Likely to share news about AI & computational social science / Sozialwissenschaft / ็คพไผ็งๅญฆ
Information Sciences and English, UIUC. Distant Horizons (Chicago, 2019). tedunderwood.com
ELLIS PhD Fellow @belongielab.org | @aicentre.dk | University of Copenhagen | @amsterdamnlp.bsky.social | @ellis.eu
Multi-modal ML | Alignment | Culture | Evaluations & Safety| AI & Society
Web: https://www.srishti.dev/
The School of Computer Science at Carnegie Mellon University is one of the world's premier institutions for CS and robotics research and education. We build useful stuff that works!
Masterโs student @ltiatcmu.bsky.social. he/him
President of Signal, Chief Advisor to AI Now Institute
associate prof at UMD CS researching NLP & LLMs
PhD Candidate at Institute of AI in Management, LMU Munich
causal machine learning, causal inference
Biostatistics phd student @University of Washington
Interested in non-parametric statistics, causal inference, and science!
Assistant Professor of Computer Graphics and Geometry Processing at Columbia University www.silviasellan.com
Assistant Professor @ UChicago CS/DSI (NLP & HCI) | Writing with AI โ๏ธ
https://minalee-research.github.io/