Denied a loan, an interview, or an insurance claim by machine learning models? You may be entitled to a list of reasons.
In our latest w @anniewernerfelt.bsky.social @berkustun.bsky.social @friedler.net, we show how existing explanation frameworks fail and present an alternative for recourse
24.04.2025 06:19 β π 16 π 7 π¬ 1 π 1
Our work suggests that solving RAG hallucination problems requires moving beyond just improving retrievalβwe need models that can accurately determine when retrieved information suffices for answering and abstain when appropriate confidence thresholds aren't met.
24.04.2025 18:18 β π 0 π 0 π¬ 1 π 0
Line graph comparing selective generation methods showing coverage vs. accuracy trade-offs. Purple lines (sufficient context + confidence) outperform gray lines (confidence only), especially for HotpotQA dataset and Gemini model.
Diagram of the Selective Generation Pipeline. The workflow shows how Input Query and Input Context feed into both Self-reported model confidence (gray box) and Sufficient Context AutoRater label (purple box). These signals combine in a Logistic regression model, which produces a score. This score is compared against a Threshold determined by Desired coverage. Depending on the comparison, the system either proceeds with the Model Response (green box) or chooses to Abstain (blue box).
Building on these insights, we developed a selective generation framework using both sufficient context signals and model confidence to decide when to respond vs. abstainβimproving accuracy of responses by 2-10% for Gemini, GPT, and Gemma.
24.04.2025 18:18 β π 1 π 0 π¬ 1 π 0
Table categorizing cases where models correctly answer questions despite insufficient context, including yes/no questions, limited choice questions, multi-hop fragments, partial information, and cases where parametric knowledge bridges gaps.
Intriguingly, models sometimes generate correct answers despite insufficient context. We taxonomize these cases: parametric knowledge bridging information gaps, yes/no questions with 50% chance of correctness, and instances where the context provides partial reasoning paths.
24.04.2025 18:18 β π 1 π 0 π¬ 1 π 0
Bar graph showing percentage of instances with sufficient context across datasets. FreshQA has highest sufficient context (77%), while HotpotQA and Musique have around 44-45% sufficient context.
We analyzed standard QA datasets through our sufficient context lens and found a surprising percentage lack sufficient information: ~56% for Musique, ~56% for HotpotQA, and ~23% for FreshQA. This highlights the magnitude of the information retrieval challenge.
24.04.2025 18:18 β π 0 π 0 π¬ 1 π 0
Conversely, smaller models (Mistral 3, Gemma 2) struggle even with sufficient contextβeither hallucinating or failing to extract answers from the provided information. Neither approach solves the fundamental RAG reliability challenge.
24.04.2025 18:18 β π 0 π 0 π¬ 1 π 0
Bar chart comparing model performance on datasets stratified by sufficient context. Graph shows that larger models (Gemini, GPT, Claude) perform better with sufficient context but still hallucinate with insufficient context, while smaller models (Gemma) struggle across conditions.
A major finding: When context is sufficient, larger models (Gemini 1.5 Pro, GPT-4o, Claude 3.5) excel. But when it's insufficient, they're more likely to hallucinate than abstainβpresenting incorrect answers with high confidence.
24.04.2025 18:18 β π 0 π 0 π¬ 1 π 0
When RAG systems hallucinate, is the LLM misusing available information or is the retrieved context insufficient? In our #ICLR2025 paper, we introduce "sufficient context" to disentangle these failure modes. Work w Jianyi Zhang, Chun-Sung Ferng, Da-Cheng Juan, Ankur Taly, @cyroid.bsky.social
24.04.2025 18:18 β π 11 π 5 π¬ 1 π 0
PhD student at UC San Diego working on web privacy
Cells talk; I build tools to listen to them
Computational biologist β’ Occasional composer and pianist β’ he/him
Opinions are my own.
https://www.yihaopeng.tw/
CTO & Co-Founder at Coefficient Bio. Ex-Prescient Design β’ Genentech. Advisor to Atomscale & Guide Labs
ncfrey.github.io | ncfrey.substack.com
"Seung Hyun" | MS CS & BS Applied Math @UCSD π | LPCUWC 18' ππ° | Interpretability, Explainability, AI Alignment, Safety & Regulation | π°π·
harry.scheon.com
Responsible AI & Human-AI Interaction
Currently: Research Scientist at Apple
Previously: Princeton CS PhD, Yale S&DS BSc, MSR FATE & TTIC Intern
https://sunniesuhyoung.github.io/
Associate Professor, ESADE | PhD, Machine Learning & Public Policy, Carnegie Mellon | Previously FAccT EC | Algorithmic fairness, human-AI collab | π¨π΄ π she/her/ella.
Recently a principal scientist at Google DeepMind. Joining Anthropic. Most (in)famous for inventing diffusion models. AI + physics + neuroscience + dynamical systems.
Postdoctoral Scholar Stanford NLP
Assistant Prof of AI & Decision-Making @MIT EECS
I run the Algorithmic Alignment Group (https://algorithmicalignment.csail.mit.edu/) in CSAIL.
I work on value (mis)alignment in AI systems.
https://people.csail.mit.edu/dhm/
Professor of Sociology, Princeton, www.princeton.edu/~mjs3
Author of Bit by Bit: Social Research in the Digital Age, bitbybitbook.com
PhD candidate @utoronto.ca and @vectorinstitute.ai | Soon: Postdoc @princetoncitp.bsky.social⬠| Reliable, safe, trustworthy machine learning.
NYT tech columnist, Hard Fork co-host, best at 0.8x speed
Reverse engineering neural networks at Anthropic. Previously Distill, OpenAI, Google Brain.Personal account.
#HCI Assistant Prof. @FIU @FIUSCIS | Prev: @ucsd_cse @DesignLabUCSD @MSFTResearch @AdobeResearch @S3DatCMU @cmuhcii @UniofNottingham
PhD Student | Works on Explainable AI | https://donatellagenovese.github.io/
Phd Student - Explainable AI for Mobile Networks
http://www.julian-rodemann.de | PhD student in statistics @LMU_Muenchen | currently @HarvardStats