Alexander Hoyle's Avatar

Alexander Hoyle

@alexanderhoyle.bsky.social

Postdoctoral fellow at ETH AI Center, working on Computational Social Science + NLP. Previously a PhD in CS at UMD, advised by Philip Resnik. Internships at MSR, AI2. he/him alexanderhoyle.com

2,261 Followers  |  281 Following  |  178 Posts  |  Joined: 05.09.2023  |  2.6369

Latest posts by alexanderhoyle.bsky.social on Bluesky

Post image

I am delighted to share our new #PNAS paper, with @grvkamath.bsky.social @msonderegger.bsky.social and @sivareddyg.bsky.social, on whether age matters for the adoption of new meanings. That is, as words change meaning, does the rate of adoption vary across generations? www.pnas.org/doi/epdf/10....

29.07.2025 12:31 β€” πŸ‘ 47    πŸ” 15    πŸ’¬ 3    πŸ“Œ 1

At #ACL2025 this week! Please reach out if you want to chat :)

We have two lovely posters:
Tues Session 2, 10:30-11:50 β€” Large Language Models Struggle to Describe the Haystack without Human Help

Wed Session 4 11:00-12:30 β€” ProxAnn: Use-Oriented Evaluations of Topic Models and Document Clustering

27.07.2025 09:30 β€” πŸ‘ 11    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Yay! When do you get in?

27.07.2025 07:54 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Oh this is great! Gets at the discussion in Maria’s thread from the other week

bsky.app/profile/mari...

24.07.2025 06:31 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

sad to be missing IC2S2, anything interesting you’ve been seeing in this area there? (If you don’t mind my asking!)

24.07.2025 06:24 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Towards Coding Social Science Datasets with Language Models Researchers often rely on humans to code (label, annotate, etc.) large sets of texts. This kind of human coding forms an important part of social science research, yet the coding process is both resou...

Earliest paper I'm aware of that unfairly got lost in the scramble:

arxiv.org/abs/2306.02177

Practical guide for applied researchers (have only skimmed):

arxiv.org/abs/2402.05129

23.07.2025 14:22 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Codebook LLMs: Evaluating LLMs as Measurement Tools for Political Science Concepts Codebooks -- documents that operationalize concepts and outline annotation procedures -- are used almost universally by social scientists when coding political texts. To code these texts automatically...

Since @katakeith.bsky.social plugged one of my papers I will plug one of hers that I really enjoyed :)

arxiv.org/abs/2407.10747

23.07.2025 14:14 β€” πŸ‘ 7    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0

Oh whoops saw you posted it !

23.07.2025 10:19 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

The papers from Egami et al. on how to correct statistically biased LLM annotations for use in downstream models (like regressions) are very good

proceedings.neurips.cc/paper_files/...

naokiegami.com/paper/dsl_ss...

23.07.2025 10:19 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
Just Put a Human in the Loop? Investigating LLM-Assisted Annotation for Subjective Tasks LLM use in annotation is becoming widespread, and given LLMs' overall promising performance and speed, simply "reviewing" LLM annotations in interpretive tasks can be tempting. In subjective annotatio...

πŸ—£οΈ Excited to share our new #ACL2025 Findings paper: β€œJust Put a Human in the Loop? Investigating LLM-Assisted Annotation for Subjective Tasks” with Jad Kabbara and Deb Roy. Arxiv: arxiv.org/abs/2507.15821
Read about our findings ‡️

22.07.2025 08:32 β€” πŸ‘ 40    πŸ” 8    πŸ’¬ 4    πŸ“Œ 1

The precursor to this paper "The Incoherence of Coherence" had our most-watched paper video ever, so I thought we had to surpass it somehow ... so we decided to do a song parody (of Roxanne, obviously):

youtu.be/87OBxEM8a9E

18.07.2025 18:37 β€” πŸ‘ 7    πŸ” 2    πŸ’¬ 0    πŸ“Œ 0

Yu did an excellent job as first author on this paper---and it was his first time publishing in NLP/ML! Yang led the MTEB experimentation---which we decided to add in at the end---and also nailed it

17.07.2025 12:59 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
The Medium Is Not the Message: Deconfounding Text Embeddings via Linear Concept Erasure Embedding-based similarity metrics between text sequences can be influenced not just by the content dimensions we most care about, but can also be biased by spurious attributes like the text's source ...

Work with @yu-fan-768.bsky.social , Yang Tian, @shauli.bsky.social , @mrinmaya.bsky.social , @elliottash.bsky.social

Paper: arxiv.org/abs/2507.01234 (credit to Yu for the McLuhan reference)
Github: github.com/y-fn/deconfo...

17.07.2025 10:52 β€” πŸ‘ 6    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0

Together these results indicate that there is (likely) little to be lost in applying erasure if you have observed confounders.

Caveats: erasing many small categories may not work. And if confounders are unobserved, then you'd have to first infer them

17.07.2025 10:52 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Barchart of performance on STS tasks from LEACE, with and without erasure. Each group of bars compares the base and LEACE-erased models for MiniLM and E5-base-v2 embeddings

Barchart of performance on STS tasks from LEACE, with and without erasure. Each group of bars compares the base and LEACE-erased models for MiniLM and E5-base-v2 embeddings

Despite the better metrics, we thought that erasure might degrade embeddings in ways we weren't measuring.

We applied LEACE models trained on our target datasets to out-of-domain embeddings from MTEB data. Surprisingly, MTEB metrics did not change!

17.07.2025 10:52 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Screenshot of table showing differences in cross-lingual document simularity search, showing that linear erasure improves recall@1 and @10 across several models. Here, the concept is the document’s language.  Erasure improves recall of the paired item in all cases, in some instances improving smaller models over their larger counterparts.

Screenshot of table showing differences in cross-lingual document simularity search, showing that linear erasure improves recall@1 and @10 across several models. Here, the concept is the document’s language. Erasure improves recall of the paired item in all cases, in some instances improving smaller models over their larger counterparts.

Applying linear erasure to remove source/language information from text embeddings (say, from sentence transformers) produces dramatic improvements on document similarty & clustering tasks

We use LEACE (β€ͺ@norabelrose.bsky.social‬ et al. 2023), which is also cheap to run (seconds on a laptop)

17.07.2025 10:52 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Example clustering code. Can be viewed here: https://github.com/y-fn/deconfounding-text-embeddings/blob/main/cluster_example.py

Example clustering code. Can be viewed here: https://github.com/y-fn/deconfounding-text-embeddings/blob/main/cluster_example.py

Attributes like language/source are confounders that distort distance-based applications

Debiasing methods remove unwanted information from embeddingsβ€”linear concept erasure in particular makes it so a linear predictor cannot recover a concept (eg, lang) from the representation

17.07.2025 10:52 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Barchart of number of items in four clusters of text embeddings, with colors showing the distribution of sources in each cluster.

Caption: Clustering text embeddings from disparate sources (here, U.S. congressional bill summaries and senators’ tweets) can produce clusters where one source dominates (Panel A). Using linear erasure to remove the source information produces more evenly balanced clusters that maintain semantic coherence (Panel B; sampled items relate to immigration). Four random clusters of k-means shown (k=25), trained on a combined 5,000 samples from each dataset

Barchart of number of items in four clusters of text embeddings, with colors showing the distribution of sources in each cluster. Caption: Clustering text embeddings from disparate sources (here, U.S. congressional bill summaries and senators’ tweets) can produce clusters where one source dominates (Panel A). Using linear erasure to remove the source information produces more evenly balanced clusters that maintain semantic coherence (Panel B; sampled items relate to immigration). Four random clusters of k-means shown (k=25), trained on a combined 5,000 samples from each dataset

New preprint! Have you ever tried to cluster text embeddings from different sources, but the clusters just reproduce the sources? Or attempted to retrieve similar documents across multiple languages, and even multilingual embeddings return items in the same language?

Turns out there's an easy fix🧡

17.07.2025 10:52 β€” πŸ‘ 26    πŸ” 7    πŸ’¬ 2    πŸ“Œ 0

Idk why I didn't think of this earlier, but the emerging RL/reasoning approach is basically taking this logic to its natural conclusion (and anyone who's seen the unhinged reasoning traces of R1 will understand that it dosn't make sense to "read into" explanations/prompts)

16.07.2025 21:22 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Do different prompting methods yield a common task representation in language models? Demonstrations and instructions are two primary approaches for prompting language models to perform in-context learning (ICL) tasks. Do identical tasks elicited in different ways result in similar rep...

Took me a second, but I knew I'd seen something related to this recently:

arxiv.org/abs/2505.120...

16.07.2025 21:17 β€” πŸ‘ 10    πŸ” 2    πŸ’¬ 1    πŸ“Œ 0

Like if you want to claim that, definition A of some concept matches expert annotations better than definition B, or that "tree of thought" is better than "vine of thought", then you should sweep over reasonable variations

16.07.2025 19:01 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Agreed. I feel like the best practice, which I basically never see, is to vary prompt formatting and prompt phrasing (keeping the goal/idea the same) and report the variation (just how, in classical ML, you might demonstrate that one algo was better by varying hyperparameters/random seeds)

16.07.2025 19:00 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Yesβ€”and I’m irked by claims that over-interpret the semantics of prompts. Say you’re operationalizing two distinct theories of populism as two prompts for annotating political speech. IMO it’s a mistake to attribute diffs in outputs to diffs in the theoryβ€”maybe one prompt had a trailing space

16.07.2025 18:28 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs 1 This paper contains model-generated content that might be offensive. 1

Reminds me of that paper that showed finetuning a model to generate insecure code also caused it to be racist

arxiv.org/html/2502.17...

09.07.2025 21:00 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Also joint work with @boydgraber.bsky.social and Philip Resnik

This work concludes a "trilogy" of topic model evaluation papers

paper 1: dl.acm.org/doi/10.5555/...
thread 1: x.com/miserlis_/st...

paper 2: aclanthology.org/2022.finding...
thread 2: x.com/miserlis_/st...

08.07.2025 12:40 β€” πŸ‘ 6    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Lorena is co-first author (I think not on bluesky) and did phenomenal work (especially on the package). She visited us at UMD and ETH, and is a great researcher and collaborator with an enviable work ethic. She is looking for postdocs in Europe this year; anyone would be lucky to have her!

08.07.2025 12:40 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
GitHub - ahoho/proxann Contribute to ahoho/proxann development by creating an account on GitHub.

We release a package and web frontend to evaluate your own topic model / document clustering outputs. We also include human data to encourage the development of new metrics/LLM judges

Don't hesitate to post issues!

Code: github.com/ahoho/proxann
Paper: arxiv.org/pdf/2507.00828

08.07.2025 12:40 β€” πŸ‘ 5    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Illustration of an LLM-judge following the fit and ranking steps in the protocol

Illustration of an LLM-judge following the fit and ranking steps in the protocol

Table of advantage probabilities from the alternative annotator test. Most scores are above 0.5; models evaluated are GPT-4o, Llama 8B/70B, Qwen 3B/32B/72B

Caption: Advantage probabilities from the alternative annotator test; the probability that PROXANN is β€œas good as or better than a randomly chosen human annotator” (Calderon et al., 2025). Document-level scores consider annotations by document; Topic-level over all documents evaluated in the topic. βˆ— indicates that win rates over humans are above 0.5, as determined by a one-sided t-test (over 10 resamples of combined annotators). † is the equivalent for Wilcoxon signed-rank.

Table of advantage probabilities from the alternative annotator test. Most scores are above 0.5; models evaluated are GPT-4o, Llama 8B/70B, Qwen 3B/32B/72B Caption: Advantage probabilities from the alternative annotator test; the probability that PROXANN is β€œas good as or better than a randomly chosen human annotator” (Calderon et al., 2025). Document-level scores consider annotations by document; Topic-level over all documents evaluated in the topic. βˆ— indicates that win rates over humans are above 0.5, as determined by a one-sided t-test (over 10 resamples of combined annotators). † is the equivalent for Wilcoxon signed-rank.

The protocol is also easily adapted to LLM judges: We call ours ProxAnn. While LLMs aren't perfect substitutes, they are about as good as an arbitrary human annotator

08.07.2025 12:40 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Diagram showing how human annotator scores are correlated with document-topic distributions from the topic model

Diagram showing how human annotator scores are correlated with document-topic distributions from the topic model

Boxplots of human--human and human--topic model correlations.

Caption: Annotators review the top documents and words from a single topic and infer a category (Label Step), then assign scores to additional documents based on their relationship to the category (Fit and Rank Steps). These scores are correlated with each other (Inter–Annotator Kendall’s Ο„ ) and with the model’s document-topic estimates (ΞΈk; TM–annotator Ο„ ). There are eight topics per model; boxplots report variation in Ο„ over each topic-annotator tuple.

Boxplots of human--human and human--topic model correlations. Caption: Annotators review the top documents and words from a single topic and infer a category (Label Step), then assign scores to additional documents based on their relationship to the category (Fit and Rank Steps). These scores are correlated with each other (Inter–Annotator Kendall’s Ο„ ) and with the model’s document-topic estimates (ΞΈk; TM–annotator Ο„ ). There are eight topics per model; boxplots report variation in Ο„ over each topic-annotator tuple.

Models are then evaluated by measuring whether annotations agree with model outputs: that is, do annotator scores correlate with document-topic probabilities (or distance to centroid)?

A human study finds that, in line with other work, classic LDA (Mallet) continues to work well

08.07.2025 12:40 β€” πŸ‘ 8    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Illustration of step 1 of the protocol. A model outputs the top words and documents, and an annotator reviews them and assigns a label

Illustration of step 1 of the protocol. A model outputs the top words and documents, and an annotator reviews them and assigns a label

Step 2 of the protocol: annotators review unseen documents and assign them a score (on a 1-5 scale) based on how well they fit the category

Step 2 of the protocol: annotators review unseen documents and assign them a score (on a 1-5 scale) based on how well they fit the category

In the final step, they rank the documents by their relevance to the category

In the final step, they rank the documents by their relevance to the category

The setup approximates real world qualitative content analysis. An annotator

1. Reviews a small collection of documents (& top words) for a topic, and writes down a category label
2. Determines whether new documents fit that label
3. Ranks the documents by relevance to the label

08.07.2025 12:40 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

@alexanderhoyle is following 20 prominent accounts