Jesper N. Wulff's Avatar

Jesper N. Wulff

@jnwulff.bsky.social

Professor @AarhusUni doing research on organizational research methods and teaching deep neural networks in our Msc. BI program. https://sites.google.com/view/jesperwulff/bio

661 Followers  |  946 Following  |  37 Posts  |  Joined: 26.10.2023  |  2.0082

Latest posts by jnwulff.bsky.social on Bluesky

Preview
Type S and M errors as a โ€œrhetorical toolโ€ We recently posted a preprint criticizing the idea of Type S and M errors ( https://osf.io/2phzb_v1 ). From our abstract: โ€œWhile these conce...

New blog post on Gelman's recent claim that Type S and M errors are intended as a 'rhetorical tool', and if I was wrong to believe they were recommended more routinely in our recent preprint criticizing the idea of Type S and M errors. daniellakens.blogspot.com/2025/09/type...

28.09.2025 05:22 โ€” ๐Ÿ‘ 11    ๐Ÿ” 6    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 1
We present our new preprint titled "Large Language Model Hacking: Quantifying the Hidden Risks of Using LLMs for Text Annotation".
We quantify LLM hacking risk through systematic replication of 37 diverse computational social science annotation tasks.
For these tasks, we use a combined set of 2,361 realistic hypotheses that researchers might test using these annotations.
Then, we collect 13 million LLM annotations across plausible LLM configurations.
These annotations feed into 1.4 million regressions testing the hypotheses. 
For a hypothesis with no true effect (ground truth $p > 0.05$), different LLM configurations yield conflicting conclusions.
Checkmarks indicate correct statistical conclusions matching ground truth; crosses indicate LLM hacking -- incorrect conclusions due to annotation errors.
Across all experiments, LLM hacking occurs in 31-50\% of cases even with highly capable models.
Since minor configuration changes can flip scientific conclusions, from correct to incorrect, LLM hacking can be exploited to present anything as statistically significant.

We present our new preprint titled "Large Language Model Hacking: Quantifying the Hidden Risks of Using LLMs for Text Annotation". We quantify LLM hacking risk through systematic replication of 37 diverse computational social science annotation tasks. For these tasks, we use a combined set of 2,361 realistic hypotheses that researchers might test using these annotations. Then, we collect 13 million LLM annotations across plausible LLM configurations. These annotations feed into 1.4 million regressions testing the hypotheses. For a hypothesis with no true effect (ground truth $p > 0.05$), different LLM configurations yield conflicting conclusions. Checkmarks indicate correct statistical conclusions matching ground truth; crosses indicate LLM hacking -- incorrect conclusions due to annotation errors. Across all experiments, LLM hacking occurs in 31-50\% of cases even with highly capable models. Since minor configuration changes can flip scientific conclusions, from correct to incorrect, LLM hacking can be exploited to present anything as statistically significant.

๐Ÿšจ New paper alert ๐Ÿšจ Using LLMs as data annotators, you can produce any scientific result you want. We call this **LLM Hacking**.

Paper: arxiv.org/pdf/2509.08825

12.09.2025 10:33 โ€” ๐Ÿ‘ 259    ๐Ÿ” 94    ๐Ÿ’ฌ 5    ๐Ÿ“Œ 19

What corresponds to the Z-test in this analogy? If the P-curve is the W-test then what is the Z-test?

25.09.2025 13:30 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

More examples of faked institutional email addresses from @deevybee.bsky.social here deevybee.blogspot.com/2022/10/what...

23.09.2025 14:24 โ€” ๐Ÿ‘ 12    ๐Ÿ” 2    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 1
9ย  Equivalence Testing and Interval Hypotheses โ€“ Improving Your Statistical Inferences This open educational resource contains information to improve statistical inferences, design better experiments, and report scientific research more transparently.

9ย  Equivalence Testing and Interval Hypotheses โ€“ Improving Your Statistical Inferences share.google/tZRu9HekIBdY...

16.09.2025 21:02 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Absolutely! I'm planning on getting met into the stats curriculum in our undergrad business adm program. My favorite resource is Lakens' online book.

16.09.2025 21:01 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

If it makes sense to test a hypothesis, do minimum effect testing and/or set alpha as a function of sample size.

16.09.2025 16:51 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Preview
Can researchers stop AI making up citations? OpenAIโ€™s GPT-5 hallucinates less than previous models do, but cutting hallucination completely might prove impossible.

"OpenAI is making โ€œsmall steps that are good, but I donโ€™t think weโ€™re anywhere near where we need to beโ€, says Mark Steyvers, a cognitive science and AI researcher at UC Irvine. โ€œItโ€™s not frequent enough that GPT says โ€˜I donโ€™t knowโ€™.โ€" www.nature.com/articles/d41...

09.09.2025 03:23 โ€” ๐Ÿ‘ 5    ๐Ÿ” 3    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Video thumbnail

โžก๏ธ Deadline approachingโ€”only one month left to send in your papers and presentation proposals for #CDSM2025!

๐Ÿšจ ๐—–๐—ฎ๐—น๐—น ๐—ณ๐—ผ๐—ฟ ๐—ฃ๐—ฎ๐—ฝ๐—ฒ๐—ฟ๐˜€: ๐—–๐—ฎ๐˜‚๐˜€๐—ฎ๐—น ๐——๐—ฎ๐˜๐—ฎ ๐—ฆ๐—ฐ๐—ถ๐—ฒ๐—ป๐—ฐ๐—ฒ ๐— ๐—ฒ๐—ฒ๐˜๐—ถ๐—ป๐—ด ๐Ÿฎ๐Ÿฌ๐Ÿฎ๐Ÿฑ ๐Ÿšจ
๐Ÿ“… ๐—ก๐—ผ๐˜ƒ ๐Ÿญ๐Ÿฎโ€“๐Ÿญ๐Ÿฏ, ๐Ÿฎ๐Ÿฌ๐Ÿฎ๐Ÿฑ (๐—ฉ๐—ถ๐—ฟ๐˜๐˜‚๐—ฎ๐—น)
๐Ÿ“ฅ Submission Deadline: ๐—ฆ๐—ฒ๐—ฝ๐˜ ๐Ÿฏ๐Ÿฌ, ๐Ÿฎ๐Ÿฌ๐Ÿฎ๐Ÿฑ

30.08.2025 12:40 โ€” ๐Ÿ‘ 20    ๐Ÿ” 23    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 1
Models as Prediction Machines: How to Convert Confusing Coefficients into Clear Quantities

Abstract
Psychological researchers usually make sense of regression models by interpreting coefficient estimates directly. This works well enough for simple linear models, but is more challenging for more complex models with, for example, categorical variables, interactions, non-linearities, and hierarchical structures. Here, we introduce an alternative approach to making sense of statistical models. The central idea is to abstract away from the mechanics of estimation, and to treat models as โ€œcounterfactual prediction machines,โ€ which are subsequently queried to estimate quantities and conduct tests that matter substantively. This workflow is model-agnostic; it can be applied in a consistent fashion to draw causal or descriptive inference from a wide range of models. We illustrate how to implement this workflow with the marginaleffects package, which supports over 100 different classes of models in R and Python, and present two worked examples. These examples show how the workflow can be applied across designs (e.g., observational study, randomized experiment) to answer different research questions (e.g., associations, causal effects, effect heterogeneity) while facing various challenges (e.g., controlling for confounders in a flexible manner, modelling ordinal outcomes, and interpreting non-linear models).

Models as Prediction Machines: How to Convert Confusing Coefficients into Clear Quantities Abstract Psychological researchers usually make sense of regression models by interpreting coefficient estimates directly. This works well enough for simple linear models, but is more challenging for more complex models with, for example, categorical variables, interactions, non-linearities, and hierarchical structures. Here, we introduce an alternative approach to making sense of statistical models. The central idea is to abstract away from the mechanics of estimation, and to treat models as โ€œcounterfactual prediction machines,โ€ which are subsequently queried to estimate quantities and conduct tests that matter substantively. This workflow is model-agnostic; it can be applied in a consistent fashion to draw causal or descriptive inference from a wide range of models. We illustrate how to implement this workflow with the marginaleffects package, which supports over 100 different classes of models in R and Python, and present two worked examples. These examples show how the workflow can be applied across designs (e.g., observational study, randomized experiment) to answer different research questions (e.g., associations, causal effects, effect heterogeneity) while facing various challenges (e.g., controlling for confounders in a flexible manner, modelling ordinal outcomes, and interpreting non-linear models).

Figure illustrating model predictions. On the X-axis the predictor, annual gross income in Euro. On the Y-axis the outcome, predicted life satisfaction. A solid line marks the curve of predictions on which individual data points are marked as model-implied outcomes at incomes of interest. Comparing two such predictions gives us a comparison. We can also fit a tangent to the line of predictions, which illustrates the slope at any given point of the curve.

Figure illustrating model predictions. On the X-axis the predictor, annual gross income in Euro. On the Y-axis the outcome, predicted life satisfaction. A solid line marks the curve of predictions on which individual data points are marked as model-implied outcomes at incomes of interest. Comparing two such predictions gives us a comparison. We can also fit a tangent to the line of predictions, which illustrates the slope at any given point of the curve.

A figure illustrating various ways to include age as a predictor in a model. On the x-axis age (predictor), on the y-axis the outcome (model-implied importance of friends, including confidence intervals).

Illustrated are 
1. age as a categorical predictor, resultings in the predictions bouncing around a lot with wide confidence intervals
2. age as a linear predictor, which forces a straight line through the data points that has a very tight confidence band and
3. age splines, which lies somewhere in between as it smoothly follows the data but has more uncertainty than the straight line.

A figure illustrating various ways to include age as a predictor in a model. On the x-axis age (predictor), on the y-axis the outcome (model-implied importance of friends, including confidence intervals). Illustrated are 1. age as a categorical predictor, resultings in the predictions bouncing around a lot with wide confidence intervals 2. age as a linear predictor, which forces a straight line through the data points that has a very tight confidence band and 3. age splines, which lies somewhere in between as it smoothly follows the data but has more uncertainty than the straight line.

Ever stared at a table of regression coefficients & wondered what you're doing with your life?

Very excited to share this gentle introduction to another way of making sense of statistical models (w @vincentab.bsky.social)
Preprint: doi.org/10.31234/osf...
Website: j-rohrer.github.io/marginal-psy...

25.08.2025 11:49 โ€” ๐Ÿ‘ 942    ๐Ÿ” 283    ๐Ÿ’ฌ 49    ๐Ÿ“Œ 19
Post image

"Being Bayesian in a Frequentist World"

New post on "Bayesian dynamic borrowing" in R ๐Ÿ“š

Link ๐Ÿ‘‡

25.08.2025 06:19 โ€” ๐Ÿ‘ 10    ๐Ÿ” 2    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 0
Post image

If you are preparing your bachelor statistics course and would like to add optional material for students to better understand statistics on a conceptual level (see topics in the screenshot) my free textbook provides a state of the art overview. lakens.github.io/statistical_...

25.08.2025 04:54 โ€” ๐Ÿ‘ 213    ๐Ÿ” 68    ๐Ÿ’ฌ 4    ๐Ÿ“Œ 4

My video about how LLMs are not search engines has led to many, MANY comments telling me that I should be using Perplexity. Some insisting that Perplexity does not hallucinate.

Out of a list of 26 papers it just provided me (in "Research" mode) 4 were real. FOUR. 85% hallucination rate.

23.08.2025 17:54 โ€” ๐Ÿ‘ 1511    ๐Ÿ” 474    ๐Ÿ’ฌ 29    ๐Ÿ“Œ 24
CRISPR as a microbial immune system

In 2003, Mojica wrote the first paper suggesting that CRISPR was an innate microbial immune system. The paper was rejected by a series of high-profile journals, including Nature, Proceedings of the National Academy of Sciences, Molecular Microbiology and Nucleic Acids Research, before finally being accepted by Journal of Molecular Evolution in February, 2005.[3][4]

CRISPR as a microbial immune system In 2003, Mojica wrote the first paper suggesting that CRISPR was an innate microbial immune system. The paper was rejected by a series of high-profile journals, including Nature, Proceedings of the National Academy of Sciences, Molecular Microbiology and Nucleic Acids Research, before finally being accepted by Journal of Molecular Evolution in February, 2005.[3][4]

TIL the original paper describing CRISPR, by Francisco Mojica, was rejected by 4 journals and took 2 years to be published

17.08.2025 04:00 โ€” ๐Ÿ‘ 301    ๐Ÿ” 78    ๐Ÿ’ฌ 6    ๐Ÿ“Œ 10
1. David Ackerly (UC Berkeley)

While his most-cited work is on leaf size and SLA, he also wrote explicitly about plasticity in leaf traits, including shape, in the context of ecological strategies.

Example: Ackerly (1997), โ€œAllocation, leaf display, and growth in fluctuating light environments: A comparative study of deciduous and evergreen speciesโ€ (Oecologia). This emphasizes how plasticity in leaf traits mediates adaptation to light.

1. David Ackerly (UC Berkeley) While his most-cited work is on leaf size and SLA, he also wrote explicitly about plasticity in leaf traits, including shape, in the context of ecological strategies. Example: Ackerly (1997), โ€œAllocation, leaf display, and growth in fluctuating light environments: A comparative study of deciduous and evergreen speciesโ€ (Oecologia). This emphasizes how plasticity in leaf traits mediates adaptation to light.

Just in case there was any doubt, ChatGPT 5.0 still makes up completely random citations that don't exist and should not be used for literature search.

16.08.2025 06:57 โ€” ๐Ÿ‘ 977    ๐Ÿ” 270    ๐Ÿ’ฌ 28    ๐Ÿ“Œ 23
Post image Post image Post image Post image

โ€ผ๏ธCool new paperโ€ผ๏ธ

Finds that journal data policies in psychology boost sharing statements to ~100%, but only about half of datasets are complete, understandable, reusable.

Open: open.lnu.se/index.php/me...

12.08.2025 16:35 โ€” ๐Ÿ‘ 17    ๐Ÿ” 4    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 1

5. Most frequentist methods are just *fine* and there's no need to always go full luxury bayesian in every application.

04.08.2025 17:01 โ€” ๐Ÿ‘ 27    ๐Ÿ” 7    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Preview
Trump, Claiming Weak Jobs Numbers Were โ€˜Rigged,โ€™ Fires Labor Official

When power is derived from lies, data become the enemy.
www.nytimes.com/2025/08/01/b...

02.08.2025 05:29 โ€” ๐Ÿ‘ 109    ๐Ÿ” 36    ๐Ÿ’ฌ 5    ๐Ÿ“Œ 1

Will the videos be released to a broad audience after the conference?

01.08.2025 06:02 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image Post image Post image

Inspiring PDW on using sensitivity analysis in empirical management research. My contribution is to present the sensemakr package by Cinelli & Hazlett (2020) for observational designs. Thanks a lot to the organizers for putting this fantastic session together. #AOM2025

26.07.2025 07:52 โ€” ๐Ÿ‘ 17    ๐Ÿ” 3    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Preview
Why AI chatbots lie to us A few weeks ago, a colleague of mine needed to collect and format some data from a website, and he asked the latest version of Anthropicโ€™s generative AI system, Claude, for help. Claude cheerfully agr...

In my latest (and last!) column for Scienceโ€™s Expert Voices series, I write about the reasons behind AI chatbotsโ€™ โ€œdeceptiveโ€ behaviors (and why Claude threatened a fictional CEO with blackmail).

www.science.org/doi/10.1126/...

24.07.2025 18:24 โ€” ๐Ÿ‘ 201    ๐Ÿ” 81    ๐Ÿ’ฌ 9    ๐Ÿ“Œ 18
Preview
From the OpenAI community on Reddit Explore this post and more from the OpenAI community

This is fascinating: www.reddit.com/r/OpenAI/s/I...

Someone โ€œworked on a book with ChatGPTโ€ for weeks and then sought help on Reddit when they couldnโ€™t download the file. Redditors helped them realized ChatGPT had just been roleplaying/lying and there was no file/bookโ€ฆ

16.07.2025 20:07 โ€” ๐Ÿ‘ 8170    ๐Ÿ” 1956    ๐Ÿ’ฌ 256    ๐Ÿ“Œ 691
Preview
Are developers slowed down by AI? Evaluating an RCT (?) and what it tells us about developer productivity Seven different people texted or otherwise messaged me about this study which claims to measure โ€œthe impact of early-2025 AI on experience open-source developer productivity.โ€ You know, when I decide...

The people call and I answer.

Here are my thoughts on that developer RCT and the "AI slows down developers" claim.

www.fightforthehuman.com/are-develope...

13.07.2025 23:02 โ€” ๐Ÿ‘ 98    ๐Ÿ” 41    ๐Ÿ’ฌ 4    ๐Ÿ“Œ 13
Post image

Using time series graphs to make causal claims be like

14.07.2025 12:12 โ€” ๐Ÿ‘ 854    ๐Ÿ” 160    ๐Ÿ’ฌ 11    ๐Ÿ“Œ 7

Thick vs thin causality

07.07.2025 15:06 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

After all these reports of authors adding language instructions for LLM reviews in their papers I wanted to check this myself and I downloaded the .tex source from one of these papers.

Here is an example.
(I will not share the identity of the paper)

05.07.2025 17:12 โ€” ๐Ÿ‘ 388    ๐Ÿ” 125    ๐Ÿ’ฌ 16    ๐Ÿ“Œ 33
Introducing Papercheck Introducing Papercheck Introducing Papercheck An Automated Tool to Check for Best Practices in Scientifi...

Very excited to publicly share news about a new tool, Papercheck, that @debruine.bsky.social and me started to develop more than a year ago! In an introductory blog post, we explain our philosophy to automatically check scientific papers for best practices. daniellakens.blogspot.com/2025/06/intr...

17.06.2025 11:15 โ€” ๐Ÿ‘ 178    ๐Ÿ” 79    ๐Ÿ’ฌ 5    ๐Ÿ“Œ 6
Post image

Venn (1866) described the problem with the "sharpshooter fallacy": ๐ŸŽฏ painting the bullseye on the spot where a bullet hits a door.
In psychology, we call this "HARKing" โ€” Hypothesizing After Results Are Known.

13.06.2025 17:18 โ€” ๐Ÿ‘ 59    ๐Ÿ” 12    ๐Ÿ’ฌ 3    ๐Ÿ“Œ 0
Preview
Diabolus Ex Machina This Is Not An Essay

Possibly the best thing I've read about ChatGPT yet.

h/t @melaniemitchell.bsky.social

amandaguinzburg.substack.com/p/diabolus-e...

04.06.2025 04:41 โ€” ๐Ÿ‘ 1033    ๐Ÿ” 382    ๐Ÿ’ฌ 55    ๐Ÿ“Œ 120
Sample Size Justification An important step when designing an empirical study is to justify the sample size that will be collected. The key aim of a sample size justification for such studies is to explain how the collected da...

The journal of sport sciences recently made a sample size justification section mandatory. More and more journals are implementing this, as it is a central part of a study! Learn how to create a state of the art sample size justification from my paper: online.ucpress.edu/collabra/art...

02.06.2025 04:50 โ€” ๐Ÿ‘ 50    ๐Ÿ” 12    ๐Ÿ’ฌ 4    ๐Ÿ“Œ 0

@jnwulff is following 20 prominent accounts