ππ₯³Had great fun doing this during my summer internship with folks from Apple (Yuan Zhang, Joel Ruben Antony Moniz, Xiou Ge, Bo-Hsiang Tseng, Dhivya Piraviperumal, Hong Yu) and USC (@swabhs.bsky.social)
Looking forward to the feedback! π
#LLMs #NLProc
(7/n)
30.04.2025 18:54 β
π 0
π 0
π¬ 0
π 0
π«Bottom line: Thereβs no single metric that captures hallucinations reliably across the board.
π―Our work highlights the need for robust, context-aware, and generalizable hallucination detection tools as a prerequisite to meaningful mitigation.
(6/n)
30.04.2025 18:54 β
π 0
π 0
π¬ 1
π 0
β
What works better?
Unsurprisingly, GPT4-based evaluators show the highest reliability with humans across settings π
Using ensembles of multiple metrics is a promising avenueβοΈ
Instruction tuning & mode-seeking decoding help reduce hallucinationsπ
(5/n)
30.04.2025 18:54 β
π 0
π 0
π¬ 1
π 0
Our findings highlight:
β οΈMany existing metrics show poor alignment with human judgments
β οΈThe inter-metric correlation is also weak
β οΈThe show limited generalization across datasets, tasks, and models
β οΈThey do not consistent improvement with larger models
(4/n)
30.04.2025 18:54 β
π 0
π 0
π¬ 1
π 0
π§Focusing on faithfulness and factuality errors in QA and dialogue tasks, we study diverse metrics spanning:
1. Syntactic and semantic similarity
2. Natural language inference
3. Multi-step question answering pipelines
4. Custom-trained models
5. SOTA LLMs as judge.
(3/n)
30.04.2025 18:54 β
π 0
π 0
π¬ 1
π 0
π€Despite a surge in research on hallucination mitigation, few ask the critical questions:
1. Are the metrics capturing the hallucinations effectively?
2. Do they align with each other and the human notion of hallucination?
3. Do they generalize across different settings?
(2/n)
30.04.2025 18:54 β
π 0
π 0
π¬ 1
π 0
Hallucinations in LLMs are realβand so are the problems with how we measure them π
Our latest work questions the generalizability of hallucination detection metrics across tasks, datasets, model sizes, training methods, and decoding strategies π₯
arxiv.org/abs/2504.18114
(1/n)
30.04.2025 18:54 β
π 1
π 0
π¬ 1
π 0
Reasoning about the "why" behind user behavior can improve LLM personas! β¨π§ π
πExcited to share our new work: Improving LLM Personas via Rationalization with Psychological Scaffolds
π arxiv.org/abs/2504.17993
π§΅ (1/n)
29.04.2025 01:05 β
π 14
π 4
π¬ 1
π 1
NLP grad students
Join the conversation
There's too many starter packs.
π Here's a list, mostly for NLP, ML, and related areas.
01.12.2024 03:05 β
π 40
π 11
π¬ 3
π 2
#socalnlp is the biggest it's ever been in 2024! We have 3 poster sessions up from 2! How many years until it's a two-day event?? π€―
22.11.2024 21:50 β
π 26
π 3
π¬ 1
π 0
Started a SoCal AI/ML/NLP researchers starter pack! It's a bit sparse right now, and perhaps more NLP heavy, but hey, nominate yourself and others! go.bsky.app/6QckPj9
19.11.2024 15:28 β
π 43
π 8
π¬ 17
π 1
ππ»ββοΈππ»ββοΈ
19.11.2024 23:31 β
π 1
π 0
π¬ 1
π 0
Hey John, thanks for starting this packet! Could you please add me as well?
18.11.2024 18:09 β
π 0
π 0
π¬ 1
π 0
Can you please add me to the pack! Looking forward to interacting with everyone!
15.11.2024 06:59 β
π 1
π 0
π¬ 1
π 0
Great initiative!! Can you please add me! Looking forward to interacting with everyone!!π―
15.11.2024 06:56 β
π 0
π 0
π¬ 1
π 0