Alexander Hoyle's Avatar

Alexander Hoyle

@alexanderhoyle.bsky.social

Postdoctoral fellow at ETH AI Center, working on Computational Social Science + NLP. Previously a PhD in CS at UMD, advised by Philip Resnik. Internships at MSR, AI2. he/him. On the job market this cycle! alexanderhoyle.com

2,326 Followers  |  300 Following  |  228 Posts  |  Joined: 05.09.2023  |  2.9479

Latest posts by alexanderhoyle.bsky.social on Bluesky

really nice writeup, and I appreciated the pointer to the Gelman paper, which I hadn't seen before!

17.11.2025 18:00 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

What. Where can I read more about this. I had no idea

10.11.2025 19:35 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Are you here??

06.11.2025 07:41 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
The Medium Is Not the Message: Deconfounding Document Embeddings via Linear Concept Erasure β€” Tuesday at 11:00, Poster

 Co-DETECT: Collaborative Discovery of Edge Cases in Text Classification β€” Tuesday at 14:30, Demo

Measuring Scalar Constructs in Social Science with LLMs β€” Friday at 10:30, Oral at CSS


How Persuasive is Your Context? β€” Friday at 14:00, Poster

The Medium Is Not the Message: Deconfounding Document Embeddings via Linear Concept Erasure β€” Tuesday at 11:00, Poster Co-DETECT: Collaborative Discovery of Edge Cases in Text Classification β€” Tuesday at 14:30, Demo Measuring Scalar Constructs in Social Science with LLMs β€” Friday at 10:30, Oral at CSS How Persuasive is Your Context? β€” Friday at 14:00, Poster

Happy to be at #EMNLP2025! Please say hello and come see our lovely work

05.11.2025 02:23 β€” πŸ‘ 8    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0

Realizing my point is perhaps a bit undercut because of my typo haha

Oddly I have seen much more LinkedIn use for ZΓΌrich AI things

03.11.2025 17:41 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

yea pleais is a β€œhousehold nameβ€œ among AI people largely because of your Twitter presence , I’d expect

02.11.2025 15:46 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

bsky.app/profile/alex...

28.10.2025 06:23 β€” πŸ‘ 4    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
A diagram illustrating pointwise scoring with a large language model (LLM). At the top is a text box containing instructions: 'You will see the text of a political advertisement about a candidate. Rate it on a scale ranging from 1 to 9, where 1 indicates a positive view of the candidate and 9 indicates a negative view of the candidate.' Below this is a green text box containing an example ad text: 'Joe Biden is going to eat your grandchildren for dinner.' An arrow points down from this text to an illustration of a computer with 'LLM' displayed on its monitor. Finally, an arrow points from the computer down to the number '9' in large teal text, representing the LLM's scoring output. This diagram demonstrates how an LLM directly assigns a numerical score to text based on given criteria

A diagram illustrating pointwise scoring with a large language model (LLM). At the top is a text box containing instructions: 'You will see the text of a political advertisement about a candidate. Rate it on a scale ranging from 1 to 9, where 1 indicates a positive view of the candidate and 9 indicates a negative view of the candidate.' Below this is a green text box containing an example ad text: 'Joe Biden is going to eat your grandchildren for dinner.' An arrow points down from this text to an illustration of a computer with 'LLM' displayed on its monitor. Finally, an arrow points from the computer down to the number '9' in large teal text, representing the LLM's scoring output. This diagram demonstrates how an LLM directly assigns a numerical score to text based on given criteria

[corrected link]

LLMs are often used for text annotation in social science. In some cases, this involves placing text items on a scale: eg, 1 for liberal and 9 for conservative

There are a few ways to handle this task. Which work best? Our new EMNLP paper has some answers🧡
arxiv.org/abs/2509.03116

28.10.2025 06:23 β€” πŸ‘ 24    πŸ” 5    πŸ’¬ 1    πŸ“Œ 0

Paper: arxiv.org/abs/2509.03116

Code: github.com/haukelicht/s...

With:
@haukelicht.bsky.social *
@rupak-s.bsky.social *
@patrickwu.bsky.social
@pranavgoel.bsky.social
@niklasstoehr.bsky.social
@elliottash.bsky.social

28.10.2025 06:20 β€” πŸ‘ 3    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0

Thanks for the catch!!

28.10.2025 05:25 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

We cover many more models in the paper and have more insights and analysis there! This paper was really a team effort over a long period, and I think it is dense with interesting results

27.10.2025 14:59 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
A comparison chart showing two evaluation metrics for measuring scalar constructs across three datasets: Immigration Fear, Ad-Negativity, and Grandstanding. The top panel displays Spearman's Rank Correlation (ρ) with values ranging from approximately 0.5 to 0.9, while the bottom panel shows Root Mean Squared Error (RMSE) with values between 0.15 and 0.30 (lower is better). Results are shown for five different language models: Qwen-72B Pairwise (blue), Qwen-72B Pointwise (green), and DeBERTa-v3 (all data) (orange). Arrows and curved lines at the top indicate comparisons between 'Prompted' and 'Finetuned' approaches. The chart demonstrates that pointwise prompting generally achieves higher correlations than pairwise comparisons, while finetuned models show competitive performance with lower error rates, particularly in the Grandstanding dataset. More models are in the final paper

A comparison chart showing two evaluation metrics for measuring scalar constructs across three datasets: Immigration Fear, Ad-Negativity, and Grandstanding. The top panel displays Spearman's Rank Correlation (ρ) with values ranging from approximately 0.5 to 0.9, while the bottom panel shows Root Mean Squared Error (RMSE) with values between 0.15 and 0.30 (lower is better). Results are shown for five different language models: Qwen-72B Pairwise (blue), Qwen-72B Pointwise (green), and DeBERTa-v3 (all data) (orange). Arrows and curved lines at the top indicate comparisons between 'Prompted' and 'Finetuned' approaches. The chart demonstrates that pointwise prompting generally achieves higher correlations than pairwise comparisons, while finetuned models show competitive performance with lower error rates, particularly in the Grandstanding dataset. More models are in the final paper

Two takeaways:
- Directly prompting on a scale is surprisingly fine, but *only if* you take the token-probability weighted average over scale items, Ξ£βΉβ‚™β‚Œβ‚ int(n) β‹… p(n|x) (cf @victorwang37.bsky.social)
- Finetuning w/ a smaller model can do really well! And with as few as 1,000 paired annotations

27.10.2025 14:59 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

So we evaluate finetuning, pairwise prompting, and direct (pointwise) prompting

As ground truth, we use human-annotated pairwise ranks on 3 constructs in social science from prior work (ad negativity, grandstanding, and fear about immigration), inducing scores via Bradley-Terry

27.10.2025 14:59 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Alt text for Figure 4 (from Measuring Scalar Constructs in Social Science with LLMs):

A visual example of a pairwise comparison task used to measure Ad-Negativity.

The figure shows two light-blue boxes, each containing short excerpts from political campaign ads labeled Text 1 and Text 2.

Text 1 says: β€œ[Announcer]: America was built on democratic principles. But, here's one simple questionβ€”What if your vote wasn't private….”

Text 2 says: β€œ[Announcer]: They're at it again. Powerful interests with false attacks on Mark Udall. The facts: Mark Udall's voted to ….”

Below them, a dark-blue box poses the prompt:
β€œWhich campaign ad is more negative towards the mentioned opposing candidates?”
This illustrates how human annotators or language models are asked to judge which of two texts better exemplifies a target construct such as negativity.

Copied figure caption (as printed):

Figure 4: Pairwise comparison places two text items relative to one another regarding a given construct.

Alt text for Figure 4 (from Measuring Scalar Constructs in Social Science with LLMs): A visual example of a pairwise comparison task used to measure Ad-Negativity. The figure shows two light-blue boxes, each containing short excerpts from political campaign ads labeled Text 1 and Text 2. Text 1 says: β€œ[Announcer]: America was built on democratic principles. But, here's one simple questionβ€”What if your vote wasn't private….” Text 2 says: β€œ[Announcer]: They're at it again. Powerful interests with false attacks on Mark Udall. The facts: Mark Udall's voted to ….” Below them, a dark-blue box poses the prompt: β€œWhich campaign ad is more negative towards the mentioned opposing candidates?” This illustrates how human annotators or language models are asked to judge which of two texts better exemplifies a target construct such as negativity. Copied figure caption (as printed): Figure 4: Pairwise comparison places two text items relative to one another regarding a given construct.

This collaboration began because some of us thought the more principled approach is to instead compare pairs of items, then induce a score with Bradley-Terry

After all, it is easier for *people* to compare items relatively than to score them directly

27.10.2025 14:59 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
A grid of histograms showing how human- and model-assigned scores are distributed across three social-scientific constructsβ€”Immigration Fear, Ad-Negativity, and Grandstanding.

The top row (β€œhumans”) shows relatively smooth, varied distributions of Bradley–Terry (BT) scores for each construct, with multiple peaks or roughly symmetric shapes.

The bottom four rows show LLM-assigned score distributions (for Qwen-2.5–7B, Qwen-2.5–72B, Llama-3.1–8B, and Llama-3.3–70B). These histograms are sparse, with tall bars clustered around a few discrete values (e.g., 1, 3, or 8 on a 1–9 scale), illustrating β€œheaping” behavior where models favor certain numbers.

The x-axes represent BT scores (for humans) or LLM responses (for models), and y-axes represent relative frequency.

Overall, the figure visually contrasts human-like continuous scoring distributions with the more discretized, irregular outputs of different LLMs.

Copied figure caption (as printed):

Figure 1: Distributions of LLM scores for scalar constructs do not align with the reference distribution, nor do they correspond between models. Top: Distribution of text items’ scores on latent dimension for three different tasks estimated by fitting a Bradley-Terry (BT) model to human-annotated pairwise comparisons between text items. Bottom: Distribution of the scores different LLMs’ assign to the same text items if prompted to score them on a 1–9 scale.

A grid of histograms showing how human- and model-assigned scores are distributed across three social-scientific constructsβ€”Immigration Fear, Ad-Negativity, and Grandstanding. The top row (β€œhumans”) shows relatively smooth, varied distributions of Bradley–Terry (BT) scores for each construct, with multiple peaks or roughly symmetric shapes. The bottom four rows show LLM-assigned score distributions (for Qwen-2.5–7B, Qwen-2.5–72B, Llama-3.1–8B, and Llama-3.3–70B). These histograms are sparse, with tall bars clustered around a few discrete values (e.g., 1, 3, or 8 on a 1–9 scale), illustrating β€œheaping” behavior where models favor certain numbers. The x-axes represent BT scores (for humans) or LLM responses (for models), and y-axes represent relative frequency. Overall, the figure visually contrasts human-like continuous scoring distributions with the more discretized, irregular outputs of different LLMs. Copied figure caption (as printed): Figure 1: Distributions of LLM scores for scalar constructs do not align with the reference distribution, nor do they correspond between models. Top: Distribution of text items’ scores on latent dimension for three different tasks estimated by fitting a Bradley-Terry (BT) model to human-annotated pairwise comparisons between text items. Bottom: Distribution of the scores different LLMs’ assign to the same text items if prompted to score them on a 1–9 scale.

The naive approach is to "just ask": instruct the LLM the output a score on the provided scale

However, this does not work very well---LLM outputs tend to cluster or "heap" around certain integers (and do so inconsistently between models)

27.10.2025 14:59 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 1
A diagram illustrating pointwise scoring with a large language model (LLM). At the top is a text box containing instructions: 'You will see the text of a political advertisement about a candidate. Rate it on a scale ranging from 1 to 9, where 1 indicates a positive view of the candidate and 9 indicates a negative view of the candidate.' Below this is a green text box containing an example ad text: 'Joe Biden is going to eat your grandchildren for dinner.' An arrow points down from this text to an illustration of a computer with 'LLM' displayed on its monitor. Finally, an arrow points from the computer down to the number '9' in large teal text, representing the LLM's scoring output. This diagram demonstrates how an LLM directly assigns a numerical score to text based on given criteria

A diagram illustrating pointwise scoring with a large language model (LLM). At the top is a text box containing instructions: 'You will see the text of a political advertisement about a candidate. Rate it on a scale ranging from 1 to 9, where 1 indicates a positive view of the candidate and 9 indicates a negative view of the candidate.' Below this is a green text box containing an example ad text: 'Joe Biden is going to eat your grandchildren for dinner.' An arrow points down from this text to an illustration of a computer with 'LLM' displayed on its monitor. Finally, an arrow points from the computer down to the number '9' in large teal text, representing the LLM's scoring output. This diagram demonstrates how an LLM directly assigns a numerical score to text based on given criteria

LLMs are often used for text annotation, especially in social science. In some cases, this involves placing text items on a scale: eg, 1 for liberal and 9 for conservative

There are a few ways to accomplish this task. Which work best? Our new EMNLP paper has some answers🧡
arxiv.org/pdf/2507.00828

27.10.2025 14:59 β€” πŸ‘ 26    πŸ” 8    πŸ’¬ 1    πŸ“Œ 0

As someone who's sat behind people playing games on a laptop in class, I have found it disturbing. Other research bears this out. You are in fact impacting others

psycnet.apa.org/record/2013-...
www.sciencedirect.com/science/arti...
overview: 3starlearningexperiences.wordpress.com/2018/01/09/l...

22.10.2025 05:43 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Here's a nice recent paper showing that models post-2020 (ie LLMs) are more robust to various types of input noise

arxiv.org/pdf/2403.03923

For more on resource use, I found this blog post very informative: andymasley.substack.com/p/individual...

18.10.2025 19:33 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

I mean, feel free to look at performance on the WMT benchmarks yourself. The initial improvements in MT are a key part of the reason transformers have become so dominant. Regardless, as I said, the LSTM-based approach was less efficient than transformers anyway

18.10.2025 19:27 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Your original claim that transformer based LLMs didn’t noticeably improve MT is incorrect though. MT was the testbed for Attention is all you Need

18.10.2025 03:41 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Google Translate incorporated transformers in 2020. My recollection is that quality before then was passable for high resource languages but couldn’t reliably do full articles

but those RNNs were also *less* efficient; transformers were lauded precisely because they were so much more efficient

18.10.2025 03:40 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0

Was basing my estimate off coding the graph in matplotlib not Excel, but it ultimately was a pretty simple visualization

17.10.2025 11:12 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Yeah, maybe 15-20 minutes, fair. Table was formatted in the paper and I didn't have easy access to the original input data, so I'd have needed to manually copy numbers or convert latex to something machine-readable first (5-10 min?). Then another 5-10 for formatting the barchart

17.10.2025 11:11 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

I think there are many efficiencies (both at the hardware and software level) still left on the table that I expect to change the calculus relative to something like Uber, which was predicated on full self-driving coming online all at once

17.10.2025 10:03 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

I was making some slides this week, and used Claude to convert a table in one of my papers to a barchart in ~3 minutes (including spot checks) the other evening. It would have taken a half hour *minimum* otherwise, and it freed me up to watch a sitcom with my wife. Pretty great if you ask me!

17.10.2025 09:58 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

The MT you're referring to was still, by most technical definitions, LLM-based

17.10.2025 09:52 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

This looks really interesting, just printed it out

14.10.2025 18:25 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

I would like the full slide deck!

08.10.2025 08:24 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Computer Science is no longer just about building systems or proving theorems--it's about observation and experiments.

In my latest blog post, I argue it’s time we had our own "Econometrics," a discipline devoted to empirical rigor.

doomscrollingbabel.manoel.xyz/p/the-missin...

05.10.2025 16:07 β€” πŸ‘ 31    πŸ” 9    πŸ’¬ 2    πŸ“Œ 1

really like this post! I feel that ML/NLP is an often empirical field that fails to adopt the practices of oneβ€”and being in an econ group for my postdoc, I’ve also noticed the big gulf in rigor. curious to see how your course shapes up (you might be interested in this paper arxiv.org/abs/2411.10939)

07.10.2025 06:12 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

@alexanderhoyle is following 20 prominent accounts