Joachim Baumann's Avatar

Joachim Baumann

@joachimbaumann.bsky.social

Postdoc @stanfordnlp.bsky.social / previously @milanlp.bsky.social / Computational social science, LLMs, algorithmic fairness

214 Followers  |  265 Following  |  22 Posts  |  Joined: 16.07.2025
Posts Following

Posts by Joachim Baumann (@joachimbaumann.bsky.social)

Preview
Hello Entire World Β· Entire Blog Announcing Entire with $60 million seed round and shipping our first product, called Checkpoints.

Beep, boop. Come in, rebels. We’ve raised a 60m seed round to build the next developer platform. Open. Scalable. Independent. And we ship our first OSS release today. entire.io/blog/hello-e...

10.02.2026 16:11 β€” πŸ‘ 22    πŸ” 6    πŸ’¬ 3    πŸ“Œ 2

Dirk and Debora are amazing postdoc advisors, and the @milanlp.bsky.social team is fun fun fun ❀️ you should apply!

18.12.2025 16:51 β€” πŸ‘ 8    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Text reads: About synthetic panels
Recruiting the right participants for a study can be difficult. You may not get the exact demographics you need, and the shorter the deadline, the less sure you can be that everyone will answer on time. One possible solution can be to use synthetic panels.

Synthetic panels are powered by a first party proprietary AI model developed here at Qualtrics. Our synthetic panel is trained on thousands of responses from a variety of demographic backgrounds in order to more accurately predict how certain populations would respond to a survey.

Our synthetic panel is based on the United States General Population, and is only available in English. This panel comes with ready-made quotas and target breakouts in order to represent your chosen population and make it easy to launch your survey right away.

Text reads: About synthetic panels Recruiting the right participants for a study can be difficult. You may not get the exact demographics you need, and the shorter the deadline, the less sure you can be that everyone will answer on time. One possible solution can be to use synthetic panels. Synthetic panels are powered by a first party proprietary AI model developed here at Qualtrics. Our synthetic panel is trained on thousands of responses from a variety of demographic backgrounds in order to more accurately predict how certain populations would respond to a survey. Our synthetic panel is based on the United States General Population, and is only available in English. This panel comes with ready-made quotas and target breakouts in order to represent your chosen population and make it easy to launch your survey right away.

Text reads:
Question-writing best practices
To get the most reliable and actionable results from synthetic audiences, consider these question-writing best practices:

Ask forward-looking and attitudinal questions.
Synthetic panels perform best with perceptions, preferences, and intent-based questions. For example, β€œHow likely are you to try…?”
Synthetic panels are less applicable for studies on past behaviors, detailed recall, brand recall, or awareness questions. For example, β€œWhen did you last visit…?”

Text reads: Question-writing best practices To get the most reliable and actionable results from synthetic audiences, consider these question-writing best practices: Ask forward-looking and attitudinal questions. Synthetic panels perform best with perceptions, preferences, and intent-based questions. For example, β€œHow likely are you to try…?” Synthetic panels are less applicable for studies on past behaviors, detailed recall, brand recall, or awareness questions. For example, β€œWhen did you last visit…?”

Text reads:
Discussion
The current study aimed to conduct a meta-analysis of the TPB when applied to health behaviours which addressed the limitations of previous reviews by including only prospective tests of behaviour, applying RE meta-analytic procedures, correcting correlations for sampling and measurement error, and hierarchically analysing the effect of behaviour type and sample and methodological moderators. Some 237 tests were identified which examined relations amongst model components. Overall the analysis indicated that the TPB could explain 19.3% of the variance in behaviour and 44.3% of the variance in intention across studies. This level of prediction of behaviour is slightly lower than that of previous meta-analytic reviews which have found between 27% (Armitage & Conner, 2001; Hagger et al., 2002) and 36% (Trafimow et al., 2002)
of the variance in behaviour to be explained by intention and PBC.

Text reads: Discussion The current study aimed to conduct a meta-analysis of the TPB when applied to health behaviours which addressed the limitations of previous reviews by including only prospective tests of behaviour, applying RE meta-analytic procedures, correcting correlations for sampling and measurement error, and hierarchically analysing the effect of behaviour type and sample and methodological moderators. Some 237 tests were identified which examined relations amongst model components. Overall the analysis indicated that the TPB could explain 19.3% of the variance in behaviour and 44.3% of the variance in intention across studies. This level of prediction of behaviour is slightly lower than that of previous meta-analytic reviews which have found between 27% (Armitage & Conner, 2001; Hagger et al., 2002) and 36% (Trafimow et al., 2002) of the variance in behaviour to be explained by intention and PBC.

Did you know that from tomorrow, Qualtrics is offering synthetic panels (AI-generated participants)?

Follow me down a rabbit hole I'm calling "doing science is tough and I'm so busy, can't we just make up participants?"

16.12.2025 17:38 β€” πŸ‘ 657    πŸ” 288    πŸ’¬ 38    πŸ“Œ 225

Good luck drawing reliable conclusions from the answers that Qualtrics' AI model provides to your survey questions... bsky.app/profile/joac...

16.12.2025 18:54 β€” πŸ‘ 27    πŸ” 7    πŸ’¬ 0    πŸ“Œ 1
Post image

At today’s lab reading group @carolin-holtermann.bsky.social presented β€˜Fairness through Difference Awareness: Measuring Desired Group Discrimination in LLMs’ by @angelinawang.bsky.social et al. (2025).
Lots to think about how we evaluate fairness in language models!

#NLProc #fairness #LLMs

11.12.2025 11:55 β€” πŸ‘ 8    πŸ” 3    πŸ’¬ 0    πŸ“Œ 0

Also see more nuanced takes worth reading from @seanjwestwood.bsky.social (x.com/seanjwestwoo...) and @joshmccrain.bsky.social (bsky.app/profile/josh...) and @phe-lim.bsky.social (bsky.app/profile/phe-...)

28.11.2025 10:59 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

The path forward: Survey panels and crowdsourcing platforms must invest in better panel curation and periodic quality verification.

Good to see that @joinprolific.bsky.social is already on it: bsky.app/profile/phe-...

28.11.2025 10:56 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

βœ… LLM instruction tuning works: tell a model to answer as a human, it will
❌ Silicon sampling still doesn't work: AI responses are plausible but don't accurately represent a real human population
❌ Bot detection fails: it's hard to design tasks that are easy for humans but difficult for LLMs

28.11.2025 10:55 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Now that the hype has cooled off, here's my take on AI-generated survey answers:
This is a real problem, but the paper's core insights aren't exactly news!
A thread with the most important summary... 🧡

Image: shows the LLM system prompt used

28.11.2025 10:54 β€” πŸ‘ 4    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Post image Post image Post image

Another exhausting day in the lab… conducting very rigorous panettone analysis. Pandoro was evaluated too, because we believe in fair experimental design.

27.11.2025 16:05 β€” πŸ‘ 23    πŸ” 6    πŸ’¬ 0    πŸ“Œ 1
Post image

For our weekly reading group, @joachimbaumann.bsky.social presented the upcoming PNAS article "The potential existential threat of large language models to online survey research" by @
@seanjwestwood.bsky.social.

20.11.2025 11:53 β€” πŸ‘ 8    πŸ” 3    πŸ’¬ 0    πŸ“Œ 0
Preview
Auditing Google's AI Overviews and Featured Snippets: A Case Study on Baby Care and Pregnancy Google Search increasingly surfaces AI-generated content through features like AI Overviews (AIO) and Featured Snippets (FS), which users frequently rely on despite having no control over their presen...

Google AI overviews now reach over 2B users worldwide. But how reliable are they on high stakes topics - for instance, pregnancy and baby care?

We have a new paper - led by Desheng Hu, now accepted at @icwsm.bsky.social - exploring that and finding many issues

Preprint: arxiv.org/abs/2511.12920
πŸ§΅πŸ‘‡

19.11.2025 16:58 β€” πŸ‘ 16    πŸ” 9    πŸ’¬ 1    πŸ“Œ 1
Language Model Hacking - Granular Material

Trying an experiment in good old-fashioned blogging about papers: dallascard.github.io/granular-mat...

16.11.2025 19:52 β€” πŸ‘ 29    πŸ” 9    πŸ’¬ 3    πŸ“Œ 0
Post image

Next Wednesday, we are very excited to have
@joachimbaumann.bsky.social, who will present co-authored work on "Large Language Model Hacking: Quantifying the Hidden Risks of Using LLMs for Text Annotation". Paper and information on how to join ⬇️

08.11.2025 13:31 β€” πŸ‘ 4    πŸ” 3    πŸ’¬ 1    πŸ“Œ 2
Video thumbnail

Can AI simulate human behavior? 🧠
The promise is revolutionary for science & policy. But there’s a huge "IF": Do these simulations actually reflect reality?
To find out, we introduce SimBench: The first large-scale benchmark for group-level social simulation. (1/9)

28.10.2025 16:53 β€” πŸ‘ 11    πŸ” 5    πŸ’¬ 1    πŸ“Œ 1

Cool paper by @eddieyang.bsky.social, confirming our LLM hacking findings (arxiv.org/abs/2509.08825):
βœ“ LLMs are brittle data annotators
βœ“ Downstream conclusions flip frequently: LLM hacking risk is real!
βœ“ Bias correction methods can help but have trade-offs
βœ“ Use human expert whenever possible

21.10.2025 08:02 β€” πŸ‘ 16    πŸ” 6    πŸ’¬ 0    πŸ“Œ 0

Looks interesting! We have been facing this exact issue - finding big inconsistencies across different LLMs rating the same text.

25.09.2025 14:58 β€” πŸ‘ 5    πŸ” 6    πŸ’¬ 0    πŸ“Œ 0
Post image Post image

About last week’s internal hackathon 😏
Last week, we -- the (Amazing) Social Computing Group, held an internal hackathon to work on our informally called β€œCultural Imperialism” project.

17.09.2025 08:23 β€” πŸ‘ 3    πŸ” 1    πŸ’¬ 1    πŸ“Œ 1

If you feel uneasy using LLMs for data annotation, you are right (if not, you should). It offers new chances for research that is difficult with traditional #NLP/#textasdata methods, but the risk of false conclusions is high!

Experiment + *evidence-based* mitigation strategies in this preprint πŸ‘‡

15.09.2025 13:05 β€” πŸ‘ 22    πŸ” 4    πŸ’¬ 1    πŸ“Œ 0

The 94% LLM hacking success rate is achieved by annotating data with several model-prompt configs, then choosing the one that yields the desired result (70% if considering SOTA models only).
The 31-50% risk reflects well-intentioned researchers who just run one reasonable config w/o cherry-picking.

14.09.2025 06:56 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

Thank you, Florian :) We use two methods, CDI and DSL. Both debias LLM annotations and reduce false positive conclusions to about 3-13%, on average, but at the cost of a much higher Type II risk (up to 92%). The human-only conclusions have a pretty low Type I risk as well, at a lower Type II risk.

14.09.2025 06:54 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0

Great question! Performance and LLM hacking risk are negatively correlated. So easy tasks do have lower risk. But even tasks with 96% F1 score showed up to 16% risk of wrong conclusions. Validation is important because high annotation performance doesn't guarantee correct conclusions.

12.09.2025 16:15 β€” πŸ‘ 3    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0

We used 199 different prompts total: some from prior work, others based on human annotation guidelines, and some simple semantic paraphrases

Even when LLMs correctly identify significant effects, estimated effect sizes still deviate from true values by 40-77% (see Type M risk, Table 3 and Figure 3)

12.09.2025 15:29 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Thank you to the amazing @paul-rottger.bsky.social @aurman21.bsky.social @albertwendsjo.bsky.social @florplaza.bsky.social @jbgruber.bsky.social @dirkhovy.bsky.social for this fun collaboration!!

12.09.2025 10:33 β€” πŸ‘ 6    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Why this matters: LLM hacking affects any field using AI for data analysis–not just computational social science!

Please check out our preprint, we'd be happy to receive your feedback!

#LLMHacking #SocialScience #ResearchIntegrity #Reproducibility #DataAnnotation #NLP #OpenScience #Statistics

12.09.2025 10:33 β€” πŸ‘ 11    πŸ” 2    πŸ’¬ 1    πŸ“Œ 0

The good news: we found solutions that help mitigate this:
βœ… Larger, more capable models are safer (but no guarantee).
βœ… Few human annotations beat many AI annotations.
βœ… Testing several models and configurations on held-out data helps.
βœ… Pre-registering AI choices can prevent cherry-picking.

12.09.2025 10:33 β€” πŸ‘ 20    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0

- Researchers using SOTA models like GPT-4o face a 31-50% chance of false conclusions for plausible hypotheses.
- Risk peaks near significance thresholds (p=0.05), where 70% of "discoveries" may be false.
- Regression correction methods often don't work as they trade off Type I vs. Type II errors.

12.09.2025 10:33 β€” πŸ‘ 12    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0

We tested 18 LLMs on 37 social science annotation tasks (13M labels, 1.4M regressions). By trying different models and prompts, you can make 94% of null results appear statistically significant–or flip findings completely 68% of the time.

Importantly this also concerns well-intentioned researchers!

12.09.2025 10:33 β€” πŸ‘ 24    πŸ” 5    πŸ’¬ 3    πŸ“Œ 0
We present our new preprint titled "Large Language Model Hacking: Quantifying the Hidden Risks of Using LLMs for Text Annotation".
We quantify LLM hacking risk through systematic replication of 37 diverse computational social science annotation tasks.
For these tasks, we use a combined set of 2,361 realistic hypotheses that researchers might test using these annotations.
Then, we collect 13 million LLM annotations across plausible LLM configurations.
These annotations feed into 1.4 million regressions testing the hypotheses. 
For a hypothesis with no true effect (ground truth $p > 0.05$), different LLM configurations yield conflicting conclusions.
Checkmarks indicate correct statistical conclusions matching ground truth; crosses indicate LLM hacking -- incorrect conclusions due to annotation errors.
Across all experiments, LLM hacking occurs in 31-50\% of cases even with highly capable models.
Since minor configuration changes can flip scientific conclusions, from correct to incorrect, LLM hacking can be exploited to present anything as statistically significant.

We present our new preprint titled "Large Language Model Hacking: Quantifying the Hidden Risks of Using LLMs for Text Annotation". We quantify LLM hacking risk through systematic replication of 37 diverse computational social science annotation tasks. For these tasks, we use a combined set of 2,361 realistic hypotheses that researchers might test using these annotations. Then, we collect 13 million LLM annotations across plausible LLM configurations. These annotations feed into 1.4 million regressions testing the hypotheses. For a hypothesis with no true effect (ground truth $p > 0.05$), different LLM configurations yield conflicting conclusions. Checkmarks indicate correct statistical conclusions matching ground truth; crosses indicate LLM hacking -- incorrect conclusions due to annotation errors. Across all experiments, LLM hacking occurs in 31-50\% of cases even with highly capable models. Since minor configuration changes can flip scientific conclusions, from correct to incorrect, LLM hacking can be exploited to present anything as statistically significant.

🚨 New paper alert 🚨 Using LLMs as data annotators, you can produce any scientific result you want. We call this **LLM Hacking**.

Paper: arxiv.org/pdf/2509.08825

12.09.2025 10:33 β€” πŸ‘ 303    πŸ” 106    πŸ’¬ 6    πŸ“Œ 23

Not at this point, but the preprint should be ready soon

30.07.2025 05:50 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0