AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions
For Large Language Models (LLMs) to be reliably deployed in both everyday and high-stakes domains, knowing when not to answer is equally critical as answering correctly. Real-world user queries, which...
We release our benchmark for people to evaluate progress on abstention!
Paper link: arxiv.org/abs/2506.09038
Code link: github.com/facebookrese...
Huge thank you to the best team ever!! Project co-leads @markibrahim.bsky.social and Sam Bell and our advisor Kamalika Chaudhuri!
9/9
16.06.2025 22:02 โ
๐ 0
๐ 0
๐ฌ 0
๐ 0
The Hallucination Tax of Reinforcement Finetuning
Reinforcement finetuning (RFT) has become a standard approach for enhancing the reasoning capabilities of large language models (LLMs). However, its impact on model trustworthiness remains underexplor...
Our results also align with concurrent work from USC which also observed reasoning LLMs hallucinate on unanswerable math problems!
arxiv.org/abs/2505.13988
More evidence to argue that hallucination and failure to abstain is a big challenge in reasoning LLMs!
8/9
16.06.2025 22:02 โ
๐ 0
๐ 0
๐ฌ 1
๐ 0
While we find that a carefully crafted system prompt can boost abstention performance, it doesn't fundamentally address the core problem: a lack of reasoning about uncertainty!
See our paper for many more other results!
7/9
16.06.2025 22:02 โ
๐ 0
๐ 0
๐ฌ 1
๐ 0
We find that very often reasoning models hallucinate missing contexts in the reasoning chain and while sometimes they express uncertainty and the caveats within the reasoning chain, they still produce a confident final answer. We hypothesize this arises from biases in data & rewards in RLVR.
6/9
16.06.2025 22:02 โ
๐ 0
๐ 0
๐ฌ 1
๐ 0
Moreover, incorporating test-time scaling as in s1 @Muennighoff et al makes things even worse!
Allocating more reasoning budget generally improves accuracy and hurts abstention.
5/9
16.06.2025 22:02 โ
๐ 0
๐ 0
๐ฌ 1
๐ 0
Remarkably, we find that reasoning post-training hurts (!) abstention performance!
We evaluated the RLVR model from Tulu @natolambert et al, s1 and DeepSeek R1 Distill models and found consistent improvements in accuracy and drops in abstention compared to instruct models.
4/9
16.06.2025 22:02 โ
๐ 0
๐ 0
๐ฌ 1
๐ 0
We curate 20 uncertainty datasets in different scenarios and evaluate 20 frontier LLMs, and find that most scenarios remain challenging even for the best models!
This allows us to conduct a systematic study of what helps and hurts abstention performance.
3/9
16.06.2025 22:02 โ
๐ 0
๐ 0
๐ฌ 1
๐ 0
LLMs are great at solving concrete problems, but how well do they handle uncertainty? There are many questions with no direct answer!
We build a diverse benchmark spanning 6 abstention scenarios (underspecification, staleness, โฆ) and various domains (medicine, social bias, โฆ).
16.06.2025 22:02 โ
๐ 1
๐ 0
๐ฌ 1
๐ 0
Excited to release AbstentionBench -- our paper and benchmark on evaluating LLMsโ *abstention*: the skill of knowing when NOT to answer!
Key finding: reasoning LLMs struggle with unanswerable questions and hallucinate!
Paper: arxiv.org/abs/2506.09038
Code: github.com/facebookrese...
๐งต1/9
16.06.2025 22:02 โ
๐ 0
๐ 0
๐ฌ 1
๐ 1
We also have swag!! Meet the organizers during one of the breaks / informal networking sessions to pick up a sticker :)
Full schedule: sites.google.com/view/cvpr-20...
Accepted papers: sites.google.com/view/cvpr-20...
10.06.2025 13:07 โ
๐ 0
๐ 0
๐ฌ 0
๐ 0
Join us at #CVPR2025 Demographic Diversity in Computer Vision workshop tomorrow!
๐
Wednesday, June 11, 9am-6pm
๐ room 213 (main session) + Hall D (poster sessions), the Music City Center
We have an amazing lineup of speakers and panelists! Can't wait to meet you all there :)
10.06.2025 13:07 โ
๐ 0
๐ 0
๐ฌ 1
๐ 0
We are excited to announce a workshop on Demographic Diversity in Computer Vision (DemoDiv) at #CVPR 2025!
Submit your work studying various axes of demographic diversity and fairness in models and datasets and join us in Nashville in June!
Deadline: March 31st
sites.google.com/view/cvpr-20...
21.02.2025 17:22 โ
๐ 3
๐ 0
๐ฌ 0
๐ 0