Real user queries often look different from the clean, concise ones in academic benchmarks - ambiguity, full of typos, and much less readable.
We show that even strong RAG systems quickly break under these conditions.
Awesome project led by
@neelbhandari.bsky.social and @tianyucao.bsky.social!!
22.04.2025 00:27 — 👍 6 🔁 1 💬 0 📌 0
These days RAG systems have gotten popular for boosting LLMs—but they're brittle💔. Minor shifts in phrasing (✍️ style, politeness, typos) can wreck the pipeline. Even advanced components don’t fix the issue.
Check out this extensive eval by @neelbhandari.bsky.social and @tianyucao.bsky.social!
18.04.2025 01:49 — 👍 1 🔁 1 💬 0 📌 0
11/ This paper has been an incredible effort across institutions @ltiatcmu.bsky.social @uwcse.bsky.social. Huge thanks to my co-first author @tianyucao.bsky.social and co-authors @akhilayerukola.bsky.social @akariasai.bsky.social @maartensap.bsky.social ✨🚀
17.04.2025 19:55 — 👍 1 🔁 0 💬 0 📌 0
Out of Style: RAG's Fragility to Linguistic Variation
Despite the impressive performance of Retrieval-augmented Generation (RAG) systems across various NLP benchmarks, their robustness in handling real-world user-LLM interaction queries remains largely u...
10/ 📜 Paper: "Out of Style: RAG’s Fragility to Linguistic Variation": arxiv.org/abs/2504.08231
🔬 Code: github.com/Springcty/RA...
Read our paper for more details on impact of scaling retrieved documents, specific effects of each linguistic variation on RAG pipelines and much more!
17.04.2025 19:55 — 👍 1 🔁 0 💬 1 📌 0
9/ 🚨 Takeaway
RAG systems suffer major performance drops from simple linguistic variations.
Advanced techniques offer temporary relief, but real robustness demands fundamental changes - more resilient components and fewer cascading error in order to serve all users effectively.
17.04.2025 19:55 — 👍 1 🔁 0 💬 1 📌 0
8/🛠️ Adding advanced techniques to vanilla RAG improve robustness... sometimes🫠
✅ Reranking improves performance on linguistic rewrites, but gaps in performance with original queries remain.
⚠️ HyDE helps rewritten queries but hurts original queries-creating a false sense of robustness
17.04.2025 19:55 — 👍 1 🔁 0 💬 1 📌 0
7/🤔Well, maybe scaling generation model size helps?
Scaling up LLM size helps narrow the performance gap between original and rewritten queries. However, this is not consistent across variations. Larger models occasionally worsen the impact, particularly with RTT variations.
17.04.2025 19:55 — 👍 2 🔁 0 💬 1 📌 0
6/⚖️ RAG is more fragile than LLM-only setups
RAG’s retrieval-generation pipeline amplifies linguistic errors, leading to greater performance drops. On PopQA, RAG degrades by 23% vs. just 11% for the LLM-only setup.
⚠️The main culprit? Retrieval emerges as the weakest link
17.04.2025 19:55 — 👍 2 🔁 0 💬 1 📌 0
5/🧩 Generation Fragility
Linguistic variations lead to generation accuracy drops-Exact Match score down by up to ~41%, Answer Match score by up to ~17%.
Structural changes from RTT are particularly damaging, significantly reducing response accuracy.
17.04.2025 19:55 — 👍 1 🔁 0 💬 1 📌 0
4/📌Retrieval Robustness
Retrieval recall plummets up to 40.41% due to linguistic variations, especially when exposed to informal queries. Grammatical errors like RTT and typos notably degrade performance, highlighting retrievers’ sensitivity to a number of linguistic variations
17.04.2025 19:55 — 👍 1 🔁 0 💬 1 📌 0
3/ We evaluate across an extensive experimental setup:
🧲 2 Retrievers (Contriever, ModernBERT)
🤖 9 open LLMs (3B–72B)
📚 4 QA datasets (MS MARCO, PopQA, Natural Questions, EntityQuestions)
🔁 Over 50K+ linguistically varied queries per dataset
17.04.2025 19:55 — 👍 1 🔁 0 💬 1 📌 0
2/🔍 We evaluated RAG robustness against four common linguistic variations:
✍️ Lower formality
📉 Lower readability
🙂 Increased politeness
🔤 Grammatical errors (from typos & from round-trip translations (RTT))
17.04.2025 19:55 — 👍 2 🔁 0 💬 1 📌 0
1/🚨 𝗡𝗲𝘄 𝗽𝗮𝗽𝗲𝗿 𝗮𝗹𝗲𝗿𝘁 🚨
RAG systems excel on academic benchmarks - but are they robust to variations in linguistic style?
We find RAG systems are brittle. Small shifts in phrasing trigger cascading errors, driven by the complexity of the RAG pipeline 🧵
17.04.2025 19:55 — 👍 9 🔁 5 💬 1 📌 2
it's so important to make time for yourself, rest, treat yourself gently and with kind words. you, not me, I have to run myself ragged until I collapse in a pile of exhausted self-hatred but you should definitely self care.
11.02.2025 02:04 — 👍 2103 🔁 320 💬 55 📌 21
Not at all. I just hope the wake up call happens by the end of the month, with the help of a stern winter wind.
06.01.2025 18:52 — 👍 2 🔁 0 💬 0 📌 0
LM/NLP/ML researcher ¯\_(ツ)_/¯
yoavartzi.com / associate professor @ Cornell CS + Cornell Tech campus @ NYC / nlp.cornell.edu / associate faculty director @ arXiv.org / researcher @ ASAPP / starting @colmweb.org / building RecNet.io
☀️ Assistant Professor of Computer Science at CU Boulder 👩💻 NLP, cultural analytics, narratives, online communities 🌐 https://maria-antoniak.github.io 💬 books, bikes, games, art
Thinking about content moderation, equity, machine learning, and natural language processing.
Now: Chancellor's Fellow (~Asst. Prof) @technomoralfutures.bsky.social @edinburgh-uni.bsky.social
Past: MBZUAI, SFU, Uni of. {Sheffield, CPH}
professor at university of washington and founder at csm.ai. computational cognitive scientist. working on social and artificial intelligence and alignment.
http://faculty.washington.edu/maxkw/
30+ years of digital media research. Wrote a few books. Dog, lake, and mountain lover, embroiderer, taker of nature snapshots, campervan enthusiast, sweet on music. She/her.
PhD student @uwnlp.bsky.social @uwcse.bsky.social | visiting researcher @MetaAI | previously @jhuclsp.bsky.social
https://stellalisy.com
Ph.D. student at University of Washington CSE. NLP. IBM Ph.D. fellow (2022-2023). Meta student researcher (2023-) . ☕️ 🐕 🏃♀️🧗♀️🍳
Menswear writer. Editor at Put This On. Words at The New York Times, The Washington Post, The Financial Times, Esquire, and Mr. Porter.
If you have a style question, search:
https://dieworkwear.com/ | https://putthison.com/start-here/
Staff writer at The Atlantic. Host of Plain English podcast. Books.
Writing a data-driven newsletter about economics @ apricitas.io
Nuance? In this Economy
Full Employment Stan, Brazilian Coffee Tariff Victim |
Anti-cynic. Towards a weirder future. Reinforcement Learning, Autonomous Vehicles, transportation systems, the works. Asst. Prof at NYU
https://emerge-lab.github.io
https://www.admonymous.co/eugenevinitsky
Data science, AI/ML, analytics, visualisation. Naarm/Melbourne @thoughtworks, #dataBS, #Python, #NLP, #DuckDB, & assorted whimsical miscellania
PhD student WAVLab@LTI, CMU
Multimodality and multilinguality
prev. predoc Google Deepmind
CS PhD at USC | Prev - CMU, Apple, IIIT Delhi | Robust, Generalizable, and Trustworthy NLP
https://athrvkk.github.io/
PhD student @ ETH Zürich | all aspects of NLP but mostly evaluation and MT | go vegan | https://vilda.net
Grad Student @ CMU | Currently fascinated by problems in ML4Code and AI Alignment | cs.cmu.edu/~anmola
Interested in ML, AI, and NLP. Particularly interested in tokenization. Live in the Boston area and work in R&D at Kensho Technologies.
cs/ling undergrad @univofmaryland.bsky.social | researcher @clipumd @uta ACL2
atreydesai.github.io
PhD @CMU LTI
https://eeelisa.github.io/
We build secure, scalable, and private enterprise-grade AI technology to solve real-world business problems. Join us: http://cohere.com/careers