See also @manoelhortaribeiro.bsky.social's post on this same topic: doomscrollingbabel.manoel.xyz/p/labeling-d...
19.11.2025 15:44 โ ๐ 2 ๐ 1 ๐ฌ 0 ๐ 0@dirkhovy.bsky.social
Professor @milanlp.bsky.social for #NLProc, compsocsci, #ML Also at http://dirkhovy.com/
See also @manoelhortaribeiro.bsky.social's post on this same topic: doomscrollingbabel.manoel.xyz/p/labeling-d...
19.11.2025 15:44 โ ๐ 2 ๐ 1 ๐ฌ 0 ๐ 0Trying an experiment in good old-fashioned blogging about papers: dallascard.github.io/granular-mat...
16.11.2025 19:52 โ ๐ 27 ๐ 9 ๐ฌ 3 ๐ 0#TBT #NLProc ' Attanasio et al. study asks 'Is It Worth the (Environmental) Cost?' analyzing continuous training for language models. Balances benefits, environmental impacts, for responsible use. #Sustainability'
20.11.2025 16:02 โ ๐ 3 ๐ 3 ๐ฌ 0 ๐ 0#MemoryModay #NLProc ' 'State of Profanity Obfuscation in NLP Scientific Publications' probes bias in non-English papers. @deboranozza.bsky.social & @dirkhovy.bsky.social (2023) propose 'PrOf' to aid authors & improve access.
17.11.2025 16:04 โ ๐ 4 ๐ 2 ๐ฌ 0 ๐ 0#TBT #NLProc Explore 'Wisdom of Instruction-Tuned LLM Crowds' by Plaza et al. LLM labels outperform single models in tasks & languages. But few-shot can't top zero-shot. Supervised models rule.
30.10.2025 16:05 โ ๐ 2 ๐ 2 ๐ฌ 0 ๐ 0#MemoryModay #NLProc 'Universal Joy: A Data Set and Results for Classifying Emotions Across Languages' by Lamprinidis et al. (2021) explores how AI research affects our planet.
03.11.2025 16:02 โ ๐ 6 ๐ 2 ๐ฌ 0 ๐ 0#TBT #NLProc "Explaining Speech Classification Models" by Pastor et al. (2024) makes speech classification more transparent! ๐ Their research reveals which words matter most and how tone and background noise impact decisions.
06.11.2025 16:04 โ ๐ 4 ๐ 2 ๐ฌ 0 ๐ 0#MemoryModay #NLProc 'Measuring Harmful Representations in Scandinavian Language Models' uncovers gender bias, challenging Scandinavia's equity image.
10.11.2025 16:03 โ ๐ 4 ๐ 2 ๐ฌ 0 ๐ 0#TBT #NLProc Hessenthaler et al.'s 2022 work delves into AI's link with fairness & energy reduction in English NLP models, challenging bias reduction theories. #AI #sustainability
13.11.2025 16:05 โ ๐ 5 ๐ 2 ๐ฌ 0 ๐ 0An image of the best paper slide at the EMNLP2025 conference, with the audience in the background
๐ Congratulations to all #EMNLP2025 award winners ๐
Starting with the โจBest Paper award โจ:
"Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index"
by Hao Xu, Jiacheng Liu, Yejin Choi, Noah A. Smith, and Hannaneh Hajishirzi
aclanthology.org/2025.emnlp-m...
1/n
Maybe it is time to report *intra*-annotator agreement?
aclanthology.org/2025.nlpersp...
Last week at @nlperspectives.bsky.social I presented work showing that annotators only provide the same label on ~75% of items across four NLP labelling tasks following a two week gap
11.11.2025 16:44 โ ๐ 2 ๐ 1 ๐ฌ 1 ๐ 0You missed one: G. Abercrombie, T. Dinkar, A. Cercas Curry, V. Rieser & @dirkhovy.bsky.social Consistency is Key: Disentangling label variation in NLP with Intra-Annotator Agreement. @nlperspectives.bsky.social
03.11.2025 02:34 โ ๐ 1 ๐ 1 ๐ฌ 0 ๐ 0Excited to head to Suzhou for the 30th edition of #EMNLP2025! ๐ Had the great honor to serve as general chair this year. Looking forward to catching up with everyone and seeing some amazing #NLP research! ๐ค๐
02.11.2025 05:54 โ ๐ 28 ๐ 1 ๐ฌ 0 ๐ 0๐๏ธ Nov 5 โ Main Conference Posters
Personalization up to a Point
๐ง In the context of content moderation, we show that fully personalized models can perpetuate hate speech, and propose a policy-based method to impose legal boundaries.
๐ Hall C | 11:00โ12:30
๐๏ธ Nov 5 โ Main Conference Posters
๐ Biased Tales
A dataset of 5k short LLM bedtime stories generated across sociocultural axes with an evaluation taxonomy for character-centric attributes and context-centric attributes.
๐ Hall C | 11:00โ12:30
๐๏ธ Nov 5 - Demo
Co-DETECT: Collaborative Discovery of Edge Cases in Text Classification
๐งฉ Co-DETECT โ an iterative, human-LLM collaboration framework for surfacing edge cases and refining annotation codebooks in text classification.
๐ Demo Session 2 โ Hall C3 | 14:30โ16:00
๐๏ธ Nov 6 โ Findings Posters
The โrโ in โwomanโ stands for rights.
๐ฌ We propose a taxonomy of social dynamics in implicit misogyny (EN,IT), auditing 9 LLMs โ and they consistently fail. The more social knowledge a message requires, the worse they perform.
๐ Hall C | 12:30โ13:30
๐๏ธ Nov 7 โ Main Conference Posters
Principled Personas: Defining and Measuring the Intended Effects of Persona Prompting on Task Performance
๐ง Discussing different applications for LLM persona prompting, and how to measure their success.
๐ Hall C | 10:30โ12:00
๐๏ธ Nov 7 โ Main Conference Posters
TrojanStego: Your Language Model Can Secretly Be a Steganographic Privacy-Leaking Agent
๐ LLMs can be fine-tuned to leak secrets via token-based steganography!
๐ Hall C | 10:30โ12:00
๐๏ธ Nov 8 โ WiNLP Workshops
No for Some, Yes for Others
๐ค We investigate how sociodemographic persona prompts affect false refusal behaviors in LLMs. Model and task type are the dominant factors driving these refusals.
๐๏ธ Nov 8 โ NLPerspectives Workshops
Balancing Quality and Variation
๐งฎ For datasets to represent diverse opinions, they must preserve variation while filtering out spam. We evaluate annotator filtering heuristics and show how they often remove genuine variation.
๐๏ธ Nov 8 โ BabyLM Workshop
Teacher Demonstrations in a BabyLM's Zone of Proximal Development for Contingent Multi-Turn Interaction
๐ถ ContingentChat, a TeacherโStudent framework that benchmarks and improves multi-turn contingency in a BabyLM trained on 100M words.
๐๏ธ Nov 8 โ STARSEM Workshop
Generalizability of Media Frames: Corpus Creation and Analysis Across Countries
๐ฐ We investigate how well media frames generalize across different media landscapes. The 15 MFC frames remain broadly applicable, with minor revisions of the guidelines.
๐๏ธ Nov 6 โ Oral Presentation (TACL)
IssueBench: Millions of Realistic Prompts for Measuring Issue Bias in LLM Writing Assistance
โ๏ธ A foundation for measuring LLM political bias in realistic user conversations.
๐ A303 | 10:30โ12:00
Proud to present our #EMNLP2025 papers!
Catch our team across Main, Findings, Workshops & Demos ๐
Thereโs plenty of evidence for political bias in LLMs, but very few evals reflect realistic LLM use cases โ which is where bias actually matters.
IssueBench, our attempt to fix this, is accepted at TACL, and I will be at #EMNLP2025 next week to talk about it!
New results ๐งต
Can LLMs learn to simulate individuals' judgments based on their demographics?
Not quite! In our new paper, we found that LLMs do not learn information about demographics, but instead learn individual annotators' patterns based on unique combinations of attributes!
๐งต
LLMs are good at simulating human behaviours, but they are not going to be great unless we train them to.
We hope SimBench can be the foundation for more specialised development of LLM simulators.
I really enjoyed working on this with @tiancheng.bsky.social et al. Many fun results ๐
Check out the paper and data for details!
Paper: arxiv.org/abs/2510.17516
Data: huggingface.co/datasets/pit...
Website: simbench.tiancheng.hu (9/9)