David Mortensen @davidrmortensen

Performance of a sweep of models on Oolong-synth and Oolong-real. Performance decreases with increasing context length, sometimes steeply.

Can LLMs accurately aggregate information over long, information-dense texts? Not yet…

We introduce Oolong, a dataset of simple-to-verify information aggregation questions over long inputs. No model achieves >50% accuracy at 128K on Oolong!

07.11.2025 17:07 — 👍 50 🔁 20 💬 3 📌 3

🚨New paper: Reward Models (RMs) are used to align LLMs, but can they be steered toward user-specific value/style preferences?
With EVALUESTEER, we find even the best RMs we tested exhibit their own value/style biases, and are unable to align with a user >25% of the time. 🧵

14.10.2025 15:59 — 👍 12 🔁 7 💬 1 📌 0

🚨New Paper: LLM developers aim to align models with values like helpfulness or harmlessness. But when these conflict, which values do models choose to support? We introduce ConflictScope, a fully-automated evaluation pipeline that reveals how models rank values under conflict.
(📷 xkcd)

02.10.2025 16:04 — 👍 15 🔁 4 💬 1 📌 3

🔈When LLMs solve tasks with a mid-to-low resource input or target language, their output quality is poor. We know that. But can we put our finger on what breaks inside the LLM? We introduce the 💥 translation barrier hypothesis 💥 for failed multilingual generation with LLMs. arxiv.org/abs/2506.22724

04.07.2025 17:04 — 👍 26 🔁 7 💬 2 📌 1

Thrilled to share that this is out in @pnas.org today! 🎉

We show that linguistic generalization in language models can be due to underlying analogical mechanisms.

Shoutout to my amazing co-authors @weissweiler.bsky.social, @davidrmortensen.bsky.social, Hinrich Schütze, and Janet Pierrehumbert!

09.05.2025 18:29 — 👍 36 🔁 6 💬 1 📌 2

When it comes to text prediction, where does one LM outperform another? If you've ever worked on LM evals, you know this question is a lot more complex than it seems. In our new #acl2025 paper, we developed a method to find fine-grained differences between LMs:

🧵1/9

09.06.2025 13:47 — 👍 72 🔁 21 💬 2 📌 2

RL boosts LLM reasoning—but why stop at math & code? 🤔
Meet Nemotron-CrossThink—a method to scale RL-based self-learning across law, physics, social science & more.

🔥Resulting in a model that reasons broadly, adapts dynamically, & uses 28% fewer tokens for correct answers!
🧵↓

01.05.2025 17:41 — 👍 5 🔁 3 💬 1 📌 0

On my way to #NAACL2025 where I'll give a keynote at the noisy text workshop (WNUT), presenting some of the challenges & methods for dialect NLP + also discussing dialect speakers' perspectives!

🗨️ Beyond “noisy” text: How (and why) to process dialect data
🗓️ Saturday, May 3, 9:30–10:30

29.04.2025 09:17 — 👍 27 🔁 7 💬 1 📌 1

Excited to announce our #NAACL2025 Oral paper! 🎉✨

We carried out the largest systematic study so far to map the links between upstream choices, intrinsic bias, and downstream zero-shot performance across 131 CLIP Vision-language encoders, 26 datasets, and 55 architectures!

29.04.2025 19:11 — 👍 21 🔁 6 💬 1 📌 0

Can self-supervised models 🤖 understand allophony 🗣? Excited to share my new #NAACL2025 paper: Leveraging Allophony in Self-Supervised Speech Models for Atypical Pronunciation Assessment arxiv.org/abs/2502.07029 (1/n)

29.04.2025 17:00 — 👍 15 🔁 10 💬 2 📌 0

🚀 Excited to share a new interp+agents paper: 🐭🐱 MICE for CATs: Model-Internal Confidence Estimation for Calibrating Agents with Tools appearing at #NAACL2025

This was work done @msftresearch.bsky.social last summer with Jason Eisner, Justin Svegliato, Ben Van Durme, Yu Su, and Sam Thomson

1/🧵

29.04.2025 13:41 — 👍 12 🔁 8 💬 1 📌 2

When interacting with ChatGPT, have you wondered if they would ever "lie" to you? We found that under pressure, LLMs often choose deception. Our new #NAACL2025 paper, "AI-LIEDAR ," reveals models were truthful less than 50% of the time when faced with utility-truthfulness conflicts! 🤯 1/

28.04.2025 20:36 — 👍 25 🔁 9 💬 1 📌 3

1/🚨 𝗡𝗲𝘄 𝗽𝗮𝗽𝗲𝗿 𝗮𝗹𝗲𝗿𝘁 🚨
RAG systems excel on academic benchmarks - but are they robust to variations in linguistic style?

We find RAG systems are brittle. Small shifts in phrasing trigger cascading errors, driven by the complexity of the RAG pipeline 🧵

17.04.2025 19:55 — 👍 9 🔁 5 💬 1 📌 2

THIS IS HUGE! Researchers at McMaster University have discovered a NEW peptide antibiotic that targets a broad range of disease-causing bacteria INCLUDING those RESISTANT to existing antibiotics. This discovery marks the first potential new class of antibiotics in NEARLY 30 YEARS. 🧪🧵⬇️

31.03.2025 16:00 — 👍 9299 🔁 2758 💬 226 📌 283

CDS building which looks like a jenga tower

Life update: I'm starting as faculty at Boston University
@bucds.bsky.social in 2026! BU has SCHEMES for LM interpretability & analysis, I couldn't be more pumped to join a burgeoning supergroup w/ @najoung.bsky.social @amuuueller.bsky.social. Looking for my first students, so apply and reach out!

27.03.2025 02:24 — 👍 244 🔁 13 💬 35 📌 7

You should read Article 1 of the United States Constitution. It's a trip.

19.03.2025 04:49 — 👍 1 🔁 0 💬 0 📌 0

There can be only one DB joke. And that is DB.

19.03.2025 04:29 — 👍 1 🔁 0 💬 0 📌 0

Advancing the Database of Cross-Linguistic Colexifications with New Workflows and Data Lexical resources are crucial for cross-linguistic analysis and can provide new insights into computational models for natural language learning. Here, we present an advanced database for comparative ...

New preprint by @annikatjuka.bsky.social, Robert Forkel, Christoph Rzymski, and myself available, presenting a new version of the Database of Cross-Linguistic Colexifications (CLICS).

"Advancing the Database of Cross-Linguistic Colexifications with New Workflows and Data"

arxiv.org/abs/2503.11377

17.03.2025 10:25 — 👍 7 🔁 3 💬 1 📌 0

Finally found a way to shorten faculty meetings.

16.03.2025 16:30 — 👍 262 🔁 59 💬 18 📌 3

No student anywhere in America has said something as antisemitic as this

12.03.2025 18:12 — 👍 125 🔁 21 💬 1 📌 1

Midwest Speech and Language Days 2025

The meeting will feature keynote addresses by
@mohitbansal.bsky.social, @davidrmortensen.bsky.social, Karen Livescu, and Heng Ji. Plus all of your great talks and posters! nlp.nd.edu/msld25

08.03.2025 18:35 — 👍 4 🔁 1 💬 0 📌 0

I’ve been thinking about this reading from Isaiah 58 since I heard it at the Ash Wednesday service today.

“Is not this the fast that I choose:
to loose the bonds of injustice,
to undo the thongs of the yoke,
to let the oppressed go free,
and to break every yoke?

06.03.2025 00:16 — 👍 191 🔁 33 💬 8 📌 2

Trump Decried Millions Spent 'Making Mice Transgender.' It Was Cancer and Asthma Research President Trump falsely claimed that Biden spent $8 million on 'making mice transgender,' but the real research was for human health.

“Again, the mice used for clinical purposes did not undergo gender transition.”

www.rollingstone.com/politics/pol...

06.03.2025 00:36 — 👍 5990 🔁 1176 💬 527 📌 145

Congressman Al Green on X: "Today, the House GOP censured me for speaking out for the American people against @POTUS’s plan to cut Medicaid. I accept the consequences of my actions, but I refuse to stay silent in the face of injustice. #WeShallOvercome https://t.co/sVklRmPCJl" / X Today, the House GOP censured me for speaking out for the American people against @POTUS’s plan to cut Medicaid. I accept the consequences of my actions, but I refuse to stay silent in the face of injustice. #WeShallOvercome https://t.co/sVklRmPCJl

Today, the House GOP censured me for speaking out for the American people against @POTUS’s plan to cut Medicaid. I accept the consequences of my actions, but I refuse to stay silent in the face of injustice. #WeShallOvercome x.com/repalgreen/s...

06.03.2025 21:23 — 👍 106649 🔁 19343 💬 10346 📌 1826

Screenshot of Arxiv paper title, "Rejected Dialects: Biases Against African American Language in Reward Models," and author list: Joel Mire, Zubin Trivadi Aysola, Daniel Chechelnitsky, Nicholas Deas, Chrysoula Zerva, and Maarten Sap.

Reward models for LMs are meant to align outputs with human preferences—but do they accidentally encode dialect biases? 🤔

Excited to share our paper on biases against African American Language in reward models, accepted to #NAACL2025 Findings! 🎉

Paper: arxiv.org/abs/2502.12858 (1/10)

06.03.2025 19:49 — 👍 38 🔁 11 💬 1 📌 2

I read a paper about search, but I can't quite remember what it's called.

05.03.2025 15:30 — 👍 9 🔁 1 💬 1 📌 0

Tip of the Tongue Query Elicitation for Simulated Evaluation Tip-of-the-tongue (TOT) search occurs when a user struggles to recall a specific identifier, such as a document title. While common, existing search systems often fail to effectively support TOT scena...

🚨New Breakthrough in Tip-of-the-Tongue (TOT) Retrieval Research!

We address data limitations and offer a fresh evaluation method for these complex queries.

Curious how TREC TOT track test queries are created? Check out this thread 🧵 and our paper 📄: arxiv.org/abs/2502.17776

05.03.2025 01:32 — 👍 17 🔁 7 💬 2 📌 1

everything is so shitty, read this story about a genuinely good man who saw he had an opportunity to save millions of lives and threw himself into doing so. the world is full of heroes like him.

04.03.2025 11:16 — 👍 8413 🔁 1820 💬 61 📌 19

David Mortensen

Latest posts by davidrmortensen.bsky.social on Bluesky

@davidrmortensen is following 20 prominent accounts