I am delighted to share our new #PNAS paper, with @grvkamath.bsky.social @msonderegger.bsky.social and @sivareddyg.bsky.social, on whether age matters for the adoption of new meanings. That is, as words change meaning, does the rate of adoption vary across generations? www.pnas.org/doi/epdf/10....
29.07.2025 12:31 β π 47 π 15 π¬ 3 π 1
At #ACL2025 this week! Please reach out if you want to chat :)
We have two lovely posters:
Tues Session 2, 10:30-11:50 β Large Language Models Struggle to Describe the Haystack without Human Help
Wed Session 4 11:00-12:30 β ProxAnn: Use-Oriented Evaluations of Topic Models and Document Clustering
27.07.2025 09:30 β π 11 π 0 π¬ 0 π 0
Yay! When do you get in?
27.07.2025 07:54 β π 0 π 0 π¬ 0 π 0
Oh this is great! Gets at the discussion in Mariaβs thread from the other week
bsky.app/profile/mari...
24.07.2025 06:31 β π 1 π 0 π¬ 0 π 0
sad to be missing IC2S2, anything interesting youβve been seeing in this area there? (If you donβt mind my asking!)
24.07.2025 06:24 β π 0 π 0 π¬ 0 π 0
Oh whoops saw you posted it !
23.07.2025 10:19 β π 1 π 0 π¬ 0 π 0
The papers from Egami et al. on how to correct statistically biased LLM annotations for use in downstream models (like regressions) are very good
proceedings.neurips.cc/paper_files/...
naokiegami.com/paper/dsl_ss...
23.07.2025 10:19 β π 4 π 0 π¬ 1 π 0
The precursor to this paper "The Incoherence of Coherence" had our most-watched paper video ever, so I thought we had to surpass it somehow ... so we decided to do a song parody (of Roxanne, obviously):
youtu.be/87OBxEM8a9E
18.07.2025 18:37 β π 7 π 2 π¬ 0 π 0
Yu did an excellent job as first author on this paper---and it was his first time publishing in NLP/ML! Yang led the MTEB experimentation---which we decided to add in at the end---and also nailed it
17.07.2025 12:59 β π 1 π 0 π¬ 0 π 0
Together these results indicate that there is (likely) little to be lost in applying erasure if you have observed confounders.
Caveats: erasing many small categories may not work. And if confounders are unobserved, then you'd have to first infer them
17.07.2025 10:52 β π 1 π 0 π¬ 1 π 0
Barchart of performance on STS tasks from LEACE, with and without erasure. Each group of bars compares the base and LEACE-erased models for MiniLM and E5-base-v2 embeddings
Despite the better metrics, we thought that erasure might degrade embeddings in ways we weren't measuring.
We applied LEACE models trained on our target datasets to out-of-domain embeddings from MTEB data. Surprisingly, MTEB metrics did not change!
17.07.2025 10:52 β π 2 π 0 π¬ 1 π 0
Screenshot of table showing differences in cross-lingual document simularity search, showing that linear erasure improves recall@1 and @10 across several models. Here, the concept is the documentβs language. Erasure improves recall of the paired item in all cases, in some instances improving smaller models over their larger counterparts.
Applying linear erasure to remove source/language information from text embeddings (say, from sentence transformers) produces dramatic improvements on document similarty & clustering tasks
We use LEACE (βͺ@norabelrose.bsky.socialβ¬ et al. 2023), which is also cheap to run (seconds on a laptop)
17.07.2025 10:52 β π 3 π 0 π¬ 1 π 0
Example clustering code. Can be viewed here: https://github.com/y-fn/deconfounding-text-embeddings/blob/main/cluster_example.py
Attributes like language/source are confounders that distort distance-based applications
Debiasing methods remove unwanted information from embeddingsβlinear concept erasure in particular makes it so a linear predictor cannot recover a concept (eg, lang) from the representation
17.07.2025 10:52 β π 2 π 0 π¬ 1 π 0
Barchart of number of items in four clusters of text embeddings, with colors showing the distribution of sources in each cluster.
Caption: Clustering text embeddings from disparate sources (here, U.S. congressional bill summaries and senatorsβ tweets) can produce clusters where one source dominates (Panel A). Using linear erasure to remove the source information produces more evenly balanced clusters that maintain semantic coherence (Panel B; sampled items relate to immigration). Four random clusters of k-means shown (k=25), trained on a combined 5,000 samples from each dataset
New preprint! Have you ever tried to cluster text embeddings from different sources, but the clusters just reproduce the sources? Or attempted to retrieve similar documents across multiple languages, and even multilingual embeddings return items in the same language?
Turns out there's an easy fixπ§΅
17.07.2025 10:52 β π 26 π 7 π¬ 2 π 0
Idk why I didn't think of this earlier, but the emerging RL/reasoning approach is basically taking this logic to its natural conclusion (and anyone who's seen the unhinged reasoning traces of R1 will understand that it dosn't make sense to "read into" explanations/prompts)
16.07.2025 21:22 β π 4 π 0 π¬ 0 π 0
Like if you want to claim that, definition A of some concept matches expert annotations better than definition B, or that "tree of thought" is better than "vine of thought", then you should sweep over reasonable variations
16.07.2025 19:01 β π 1 π 0 π¬ 0 π 0
Agreed. I feel like the best practice, which I basically never see, is to vary prompt formatting and prompt phrasing (keeping the goal/idea the same) and report the variation (just how, in classical ML, you might demonstrate that one algo was better by varying hyperparameters/random seeds)
16.07.2025 19:00 β π 4 π 0 π¬ 1 π 0
Yesβand Iβm irked by claims that over-interpret the semantics of prompts. Say youβre operationalizing two distinct theories of populism as two prompts for annotating political speech. IMO itβs a mistake to attribute diffs in outputs to diffs in the theoryβmaybe one prompt had a trailing space
16.07.2025 18:28 β π 2 π 0 π¬ 0 π 0
Also joint work with @boydgraber.bsky.social and Philip Resnik
This work concludes a "trilogy" of topic model evaluation papers
paper 1: dl.acm.org/doi/10.5555/...
thread 1: x.com/miserlis_/st...
paper 2: aclanthology.org/2022.finding...
thread 2: x.com/miserlis_/st...
08.07.2025 12:40 β π 6 π 0 π¬ 1 π 0
Lorena is co-first author (I think not on bluesky) and did phenomenal work (especially on the package). She visited us at UMD and ETH, and is a great researcher and collaborator with an enviable work ethic. She is looking for postdocs in Europe this year; anyone would be lucky to have her!
08.07.2025 12:40 β π 2 π 0 π¬ 1 π 0
GitHub - ahoho/proxann
Contribute to ahoho/proxann development by creating an account on GitHub.
We release a package and web frontend to evaluate your own topic model / document clustering outputs. We also include human data to encourage the development of new metrics/LLM judges
Don't hesitate to post issues!
Code: github.com/ahoho/proxann
Paper: arxiv.org/pdf/2507.00828
08.07.2025 12:40 β π 5 π 1 π¬ 1 π 0
Illustration of an LLM-judge following the fit and ranking steps in the protocol
Table of advantage probabilities from the alternative annotator test. Most scores are above 0.5; models evaluated are GPT-4o, Llama 8B/70B, Qwen 3B/32B/72B
Caption: Advantage probabilities from the alternative annotator test; the probability that PROXANN is βas good as or better than a randomly chosen human annotatorβ (Calderon et al., 2025). Document-level scores consider annotations by document; Topic-level over all documents evaluated in the topic. β indicates that win rates over humans are above 0.5, as determined by a one-sided t-test (over 10 resamples of combined annotators). β is the equivalent for Wilcoxon signed-rank.
The protocol is also easily adapted to LLM judges: We call ours ProxAnn. While LLMs aren't perfect substitutes, they are about as good as an arbitrary human annotator
08.07.2025 12:40 β π 1 π 0 π¬ 1 π 0
Diagram showing how human annotator scores are correlated with document-topic distributions from the topic model
Boxplots of human--human and human--topic model correlations.
Caption: Annotators review the top documents and words from a single topic and infer a category (Label Step), then assign scores to additional documents based on their relationship to the category (Fit and Rank Steps). These scores are correlated with each other (InterβAnnotator Kendallβs Ο ) and with the modelβs document-topic estimates (ΞΈk; TMβannotator Ο ). There are eight topics per model; boxplots report variation in Ο over each topic-annotator tuple.
Models are then evaluated by measuring whether annotations agree with model outputs: that is, do annotator scores correlate with document-topic probabilities (or distance to centroid)?
A human study finds that, in line with other work, classic LDA (Mallet) continues to work well
08.07.2025 12:40 β π 8 π 1 π¬ 1 π 0
Illustration of step 1 of the protocol. A model outputs the top words and documents, and an annotator reviews them and assigns a label
Step 2 of the protocol: annotators review unseen documents and assign them a score (on a 1-5 scale) based on how well they fit the category
In the final step, they rank the documents by their relevance to the category
The setup approximates real world qualitative content analysis. An annotator
1. Reviews a small collection of documents (& top words) for a topic, and writes down a category label
2. Determines whether new documents fit that label
3. Ranks the documents by relevance to the label
08.07.2025 12:40 β π 3 π 0 π¬ 1 π 0
Reader in Computational Social Science at the University of Edinburgh. he/him
MIT media lab // researching fairness, equity, & pluralistic alignment in LLMs
previously @ mila / mcgill
i like language and dogs and plants and ultimate frisbee and baking and sunsets
https://elinorp-d.github.io
Faculty fellow at NYU CDS. Previously: PhD @ BIU NLP.
Historian-Artist. Assistant professor UCLA, Information Studies, Digital Humanities, Southeast Asia, libraries. cindyanguyen.com Author of Bibliotactics https://bibliotactics.com
PhD student in Computer Science and Natural Language Processing at ETH ZΓΌrich
https://najoung.kim
langauge
Reporting on AI and the future of the economy. Computer science masters degree from Princeton.
Subscribe to my AI newsletter: http://www.understandingai.org
Author of NIMBY NATION: The War on Growth That Created Our Housing Crisis and Remade American Politics (Bloomsbury, 2027). American historian and Klarman Fellow @Cornell. Learn more: JacobAnbinder.com
Studying NLP, CSS, and Human-AI interaction. PhD student @MIT. Previously at Microsoft FATE + CSS, Oxford Internet Institute, Stanford Symbolic Systems
hopeschroeder.com
Assistant Professor at ETH Zurich; interested in Natural language processing, Machine learning and Edtech
Prof at ETH: Law + economics + data science
Asst prof of computer science interested in computational methods for the study of language and culture.
Applied scientist trying to make the internet a little better. PhD. Trust & safety, platform manipulation, networks, fingerstyle guitar. I use my hair to express myself. He/they
https://janetlauyeung.github.io/
ποΈ postdoc @mainlp.bsky.social, LMU Munich
π€ PhD in CompLing from Georgetown
πΊπ» prev: x2 intern @Spotify @SpotifyResearch
PhD candidate at LMU Munich. Representations, model and data attribution, training dynamics.
Strong opinions on coffee and tea β
https://florian-eichin.com
Welcome to ETH AI Center! We are ethz.ch/en 's central hub leading the way towards trustworthy, accessible and inclusive #artificialintelligence
ai.ethz.ch
Assistant professor at Georgia State University, formerly at BYU. 6 kids. Study NGOs, human rights, #PublicPolicy, #Nonprofits, #Dataviz, #CausalInference.
#rstats forever.
andrewheiss.com
Signal: andrewheiss.01
Assistant Professor of Computational Linguistics @ Georgetown; formerly postdoc @ ETH Zurich; PhD @ Harvard Linguistics, affiliated with MIT Brain & Cog Sci. Language, Computers, Cognition.
Researcher @Microsoft; PhD @Harvard; Incoming Assistant Professor @MIT (Fall 2026); Human-AI Interaction, Worker-Centric AI
zbucinca.github.io