really nice writeup, and I appreciated the pointer to the Gelman paper, which I hadn't seen before!
17.11.2025 18:00 β π 1 π 0 π¬ 0 π 0@alexanderhoyle.bsky.social
Postdoctoral fellow at ETH AI Center, working on Computational Social Science + NLP. Previously a PhD in CS at UMD, advised by Philip Resnik. Internships at MSR, AI2. he/him. On the job market this cycle! alexanderhoyle.com
really nice writeup, and I appreciated the pointer to the Gelman paper, which I hadn't seen before!
17.11.2025 18:00 β π 1 π 0 π¬ 0 π 0What. Where can I read more about this. I had no idea
10.11.2025 19:35 β π 0 π 0 π¬ 1 π 0Are you here??
06.11.2025 07:41 β π 0 π 0 π¬ 1 π 0The Medium Is Not the Message: Deconfounding Document Embeddings via Linear Concept Erasure β Tuesday at 11:00, Poster Co-DETECT: Collaborative Discovery of Edge Cases in Text Classification β Tuesday at 14:30, Demo Measuring Scalar Constructs in Social Science with LLMs β Friday at 10:30, Oral at CSS How Persuasive is Your Context? β Friday at 14:00, Poster
Happy to be at #EMNLP2025! Please say hello and come see our lovely work
05.11.2025 02:23 β π 8 π 1 π¬ 0 π 0Realizing my point is perhaps a bit undercut because of my typo haha
Oddly I have seen much more LinkedIn use for ZΓΌrich AI things
yea pleais is a βhousehold nameβ among AI people largely because of your Twitter presence , Iβd expect
02.11.2025 15:46 β π 0 π 0 π¬ 1 π 0bsky.app/profile/alex...
28.10.2025 06:23 β π 4 π 1 π¬ 1 π 0A diagram illustrating pointwise scoring with a large language model (LLM). At the top is a text box containing instructions: 'You will see the text of a political advertisement about a candidate. Rate it on a scale ranging from 1 to 9, where 1 indicates a positive view of the candidate and 9 indicates a negative view of the candidate.' Below this is a green text box containing an example ad text: 'Joe Biden is going to eat your grandchildren for dinner.' An arrow points down from this text to an illustration of a computer with 'LLM' displayed on its monitor. Finally, an arrow points from the computer down to the number '9' in large teal text, representing the LLM's scoring output. This diagram demonstrates how an LLM directly assigns a numerical score to text based on given criteria
[corrected link]
LLMs are often used for text annotation in social science. In some cases, this involves placing text items on a scale: eg, 1 for liberal and 9 for conservative
There are a few ways to handle this task. Which work best? Our new EMNLP paper has some answersπ§΅
arxiv.org/abs/2509.03116
Paper: arxiv.org/abs/2509.03116
Code: github.com/haukelicht/s...
With:
@haukelicht.bsky.social *
@rupak-s.bsky.social *
@patrickwu.bsky.social
@pranavgoel.bsky.social
@niklasstoehr.bsky.social
@elliottash.bsky.social
Thanks for the catch!!
28.10.2025 05:25 β π 0 π 0 π¬ 0 π 0We cover many more models in the paper and have more insights and analysis there! This paper was really a team effort over a long period, and I think it is dense with interesting results
27.10.2025 14:59 β π 1 π 0 π¬ 1 π 0A comparison chart showing two evaluation metrics for measuring scalar constructs across three datasets: Immigration Fear, Ad-Negativity, and Grandstanding. The top panel displays Spearman's Rank Correlation (Ο) with values ranging from approximately 0.5 to 0.9, while the bottom panel shows Root Mean Squared Error (RMSE) with values between 0.15 and 0.30 (lower is better). Results are shown for five different language models: Qwen-72B Pairwise (blue), Qwen-72B Pointwise (green), and DeBERTa-v3 (all data) (orange). Arrows and curved lines at the top indicate comparisons between 'Prompted' and 'Finetuned' approaches. The chart demonstrates that pointwise prompting generally achieves higher correlations than pairwise comparisons, while finetuned models show competitive performance with lower error rates, particularly in the Grandstanding dataset. More models are in the final paper
Two takeaways:
- Directly prompting on a scale is surprisingly fine, but *only if* you take the token-probability weighted average over scale items, Ξ£βΉβββ int(n) β
p(n|x) (cf @victorwang37.bsky.social)
- Finetuning w/ a smaller model can do really well! And with as few as 1,000 paired annotations
So we evaluate finetuning, pairwise prompting, and direct (pointwise) prompting
As ground truth, we use human-annotated pairwise ranks on 3 constructs in social science from prior work (ad negativity, grandstanding, and fear about immigration), inducing scores via Bradley-Terry
Alt text for Figure 4 (from Measuring Scalar Constructs in Social Science with LLMs): A visual example of a pairwise comparison task used to measure Ad-Negativity. The figure shows two light-blue boxes, each containing short excerpts from political campaign ads labeled Text 1 and Text 2. Text 1 says: β[Announcer]: America was built on democratic principles. But, here's one simple questionβWhat if your vote wasn't privateβ¦.β Text 2 says: β[Announcer]: They're at it again. Powerful interests with false attacks on Mark Udall. The facts: Mark Udall's voted to β¦.β Below them, a dark-blue box poses the prompt: βWhich campaign ad is more negative towards the mentioned opposing candidates?β This illustrates how human annotators or language models are asked to judge which of two texts better exemplifies a target construct such as negativity. Copied figure caption (as printed): Figure 4: Pairwise comparison places two text items relative to one another regarding a given construct.
This collaboration began because some of us thought the more principled approach is to instead compare pairs of items, then induce a score with Bradley-Terry
After all, it is easier for *people* to compare items relatively than to score them directly
A grid of histograms showing how human- and model-assigned scores are distributed across three social-scientific constructsβImmigration Fear, Ad-Negativity, and Grandstanding. The top row (βhumansβ) shows relatively smooth, varied distributions of BradleyβTerry (BT) scores for each construct, with multiple peaks or roughly symmetric shapes. The bottom four rows show LLM-assigned score distributions (for Qwen-2.5β7B, Qwen-2.5β72B, Llama-3.1β8B, and Llama-3.3β70B). These histograms are sparse, with tall bars clustered around a few discrete values (e.g., 1, 3, or 8 on a 1β9 scale), illustrating βheapingβ behavior where models favor certain numbers. The x-axes represent BT scores (for humans) or LLM responses (for models), and y-axes represent relative frequency. Overall, the figure visually contrasts human-like continuous scoring distributions with the more discretized, irregular outputs of different LLMs. Copied figure caption (as printed): Figure 1: Distributions of LLM scores for scalar constructs do not align with the reference distribution, nor do they correspond between models. Top: Distribution of text itemsβ scores on latent dimension for three different tasks estimated by fitting a Bradley-Terry (BT) model to human-annotated pairwise comparisons between text items. Bottom: Distribution of the scores different LLMsβ assign to the same text items if prompted to score them on a 1β9 scale.
The naive approach is to "just ask": instruct the LLM the output a score on the provided scale
However, this does not work very well---LLM outputs tend to cluster or "heap" around certain integers (and do so inconsistently between models)
A diagram illustrating pointwise scoring with a large language model (LLM). At the top is a text box containing instructions: 'You will see the text of a political advertisement about a candidate. Rate it on a scale ranging from 1 to 9, where 1 indicates a positive view of the candidate and 9 indicates a negative view of the candidate.' Below this is a green text box containing an example ad text: 'Joe Biden is going to eat your grandchildren for dinner.' An arrow points down from this text to an illustration of a computer with 'LLM' displayed on its monitor. Finally, an arrow points from the computer down to the number '9' in large teal text, representing the LLM's scoring output. This diagram demonstrates how an LLM directly assigns a numerical score to text based on given criteria
LLMs are often used for text annotation, especially in social science. In some cases, this involves placing text items on a scale: eg, 1 for liberal and 9 for conservative
There are a few ways to accomplish this task. Which work best? Our new EMNLP paper has some answersπ§΅
arxiv.org/pdf/2507.00828
As someone who's sat behind people playing games on a laptop in class, I have found it disturbing. Other research bears this out. You are in fact impacting others
psycnet.apa.org/record/2013-...
www.sciencedirect.com/science/arti...
overview: 3starlearningexperiences.wordpress.com/2018/01/09/l...
Here's a nice recent paper showing that models post-2020 (ie LLMs) are more robust to various types of input noise
arxiv.org/pdf/2403.03923
For more on resource use, I found this blog post very informative: andymasley.substack.com/p/individual...
I mean, feel free to look at performance on the WMT benchmarks yourself. The initial improvements in MT are a key part of the reason transformers have become so dominant. Regardless, as I said, the LSTM-based approach was less efficient than transformers anyway
18.10.2025 19:27 β π 1 π 0 π¬ 1 π 0Your original claim that transformer based LLMs didnβt noticeably improve MT is incorrect though. MT was the testbed for Attention is all you Need
18.10.2025 03:41 β π 0 π 0 π¬ 0 π 0Google Translate incorporated transformers in 2020. My recollection is that quality before then was passable for high resource languages but couldnβt reliably do full articles
but those RNNs were also *less* efficient; transformers were lauded precisely because they were so much more efficient
Was basing my estimate off coding the graph in matplotlib not Excel, but it ultimately was a pretty simple visualization
17.10.2025 11:12 β π 0 π 0 π¬ 0 π 0Yeah, maybe 15-20 minutes, fair. Table was formatted in the paper and I didn't have easy access to the original input data, so I'd have needed to manually copy numbers or convert latex to something machine-readable first (5-10 min?). Then another 5-10 for formatting the barchart
17.10.2025 11:11 β π 0 π 0 π¬ 1 π 0I think there are many efficiencies (both at the hardware and software level) still left on the table that I expect to change the calculus relative to something like Uber, which was predicated on full self-driving coming online all at once
17.10.2025 10:03 β π 0 π 0 π¬ 0 π 0I was making some slides this week, and used Claude to convert a table in one of my papers to a barchart in ~3 minutes (including spot checks) the other evening. It would have taken a half hour *minimum* otherwise, and it freed me up to watch a sitcom with my wife. Pretty great if you ask me!
17.10.2025 09:58 β π 1 π 0 π¬ 1 π 0The MT you're referring to was still, by most technical definitions, LLM-based
17.10.2025 09:52 β π 0 π 0 π¬ 1 π 0This looks really interesting, just printed it out
14.10.2025 18:25 β π 0 π 0 π¬ 1 π 0I would like the full slide deck!
08.10.2025 08:24 β π 2 π 0 π¬ 1 π 0Computer Science is no longer just about building systems or proving theorems--it's about observation and experiments.
In my latest blog post, I argue itβs time we had our own "Econometrics," a discipline devoted to empirical rigor.
doomscrollingbabel.manoel.xyz/p/the-missin...
really like this post! I feel that ML/NLP is an often empirical field that fails to adopt the practices of oneβand being in an econ group for my postdoc, Iβve also noticed the big gulf in rigor. curious to see how your course shapes up (you might be interested in this paper arxiv.org/abs/2411.10939)
07.10.2025 06:12 β π 2 π 0 π¬ 0 π 0