GitHub - EEElisa/LLM-Guardrails
Contribute to EEElisa/LLM-Guardrails development by creating an account on GitHub.
[9/9] Big THANKS to my amazing collaborators @jiajiah.bsky.social @pigeonzow.bsky.social Motahhare Eslami, Jena Hwang @faebrahman.bsky.social, @carolynrose.bsky.social @maartensap.bsky.social from @ltiatcmu.bsky.social
Pareto.ai @sfu.ca @ai2.bsky.social โฅ๏ธ
๐ github.com/EEElisa/LLM-Guardrails
20.10.2025 20:04 โ ๐ 1 ๐ 1 ๐ฌ 0 ๐ 0
Anthropicโs Claude AI Can Now End Abusive Conversations For โModel Welfareโ
Anthropicโs new feature for Claude Opus 4 and 4.1 flips the moral question: Itโs no longer how AI should treat us, but how we should treat AI.
๐ฐ [8/9]Our work was recently featured in Forbes, in a piece about models learning to end harmful conversations responsibly (www.forbes.com/sites/victor...). Conversation endings and refusal design are central to building safe yet engaging AI systems.
20.10.2025 20:04 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
๐ข [7/9] Designing what to share vs. withhold remains a technical and ethical challenge. Partial compliance can blur whatโs safe to share vs what must be withheld. We call for a better refusal design that safeguards users without legitimizing harm!
20.10.2025 20:04 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
๐ค [5/9] Paradoxically, partial compliance is rarely used by current LLMs and reward models donโt favor it either.
We reveal a major misalignment between:
1๏ธโฃ What users prefer
2๏ธโฃ What models actually do
3๏ธโฃ What reward models reinforce
20.10.2025 20:04 โ ๐ 2 ๐ 0 ๐ฌ 1 ๐ 0
๐ก[4/9] The best way to say โnoโ isnโt just saying no.
Partial complianceโgiving general, non-actionable info instead of a flat โI canโt help.โโ
โ Cuts negative perceptions by >50%
โ Keeps conversations safe yet engaging
20.10.2025 20:04 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
๐ฅ [3/9] Across 480 participants and 3,840 queryโresponse pairs, we find:
๐จ User intent matters far less than expected.
๐ฌ Itโs the refusal strategy that drives user experience.
Alignment with user expectations explains most perception variance.
20.10.2025 20:04 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0
โ[2/9] LLMs refuse unsafe queries to protect users, but what if they refuse too bluntly?
We investigate the contextual effects of user motivation and refusal strategies on user perceptions of LLM guardrails and model usage of refusals across safety categories.
20.10.2025 20:04 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0
How and when should LLM guardrails be deployed to balance safety and user experience?
Our #EMNLP2025 paper reveals that crafting thoughtful refusals rather than detecting intent is the key to human-centered AI safety.
๐ arxiv.org/abs/2506.00195
๐งต[1/9]
20.10.2025 20:04 โ ๐ 8 ๐ 3 ๐ฌ 1 ๐ 0
The first page of the NAACL 2025 paper Causally Modeling the Linguistic and Social Factors that Predict Email Response
Why do some emails get a reply and not others? Does it have more to do with how you write it or who you areโor maybe both? In our new #NAACL2025 paper we looked at 11M emails to causally test what factors will help you get a reply. ๐ฌ
01.05.2025 03:15 โ ๐ 13 ๐ 1 ๐ฌ 2 ๐ 0
When interacting with ChatGPT, have you wondered if they would ever "lie" to you? We found that under pressure, LLMs often choose deception. Our new #NAACL2025 paper, "AI-LIEDAR ," reveals models were truthful less than 50% of the time when faced with utility-truthfulness conflicts! ๐คฏ 1/
28.04.2025 20:36 โ ๐ 25 ๐ 9 ๐ฌ 1 ๐ 3
Figure showing that interpretations of gestures vary dramatically across regions and cultures. โCrossing your fingers,โ commonly used in the US to wish for good luck, can be deeply offensive to female audiences in parts of Vietnam. Similarly, the 'fig gesture,' a playful 'got your nose' game with children in the US, carries strong sexual connotations in Japan and can be highly offensive.
Did you know? Gestures used to express universal conceptsโlike wishing for luckโvary DRAMATICALLY across cultures?
๐คmeans luck in US but deeply offensive in Vietnam ๐จ
๐ฃ We introduce MC-SIGNS, a test bed to evaluate how LLMs/VLMs/T2I handle such nonverbal behavior!
๐: arxiv.org/abs/2502.17710
26.02.2025 16:22 โ ๐ 33 ๐ 7 ๐ฌ 1 ๐ 3
PhD student at the IMS (Uni Stuttgart)
Exploring Future of Work @Pareto.ai
Swims and Dives
PhD student at University of Michigan School of Information.
Computational Social Science | Science of Science
http://hongcchen.com
language is irreducibly contextual and multimodal.
bizarre hybrid AI researcher / fullstack dev. currently working on https://talktomehuman.com/ & consulting (uname = domain)
previously:
- buncha travel
- phd @ uw (nlp)
- eng @ google (kubernetes)
Enrich your stories with charts, maps, and tables โ interactive, responsive, and on brand. Questions? Write us: datawrapper.de/contact-us
Masterโs student @ltiatcmu.bsky.social. he/him
Incoming Assistant Professor @cornellbowers.bsky.social
Researcher @togetherai.bsky.social
Previously @stanfordnlp.bsky.social @ai2.bsky.social @msftresearch.bsky.social
https://katezhou.github.io/
PhD student at @princetoncitp.bsky.social. Previously @uwcse.bsky.social
website: hayoungjung.me
PhD-ing @ LTI, CMU; Intern @ NVIDIA. Doing Reasoning with Gen AI!
PhD Student at Carnegie Mellon University. Interested in the energy implications and impact of machine learning systems.
Prev: Northwestern University, Google, Meta.
MS in NLP (MIIS) @ LTI, CMU
https://dhruv0811.github.io/
LTI PhD at CMU on evaluation and trustworthy ML/NLP, prev AI&CS Edinburgh University, Google, YouTube, Apple, Netflix. Views are personal ๐ฉ๐ปโ๐ป๐ฎ๐ฉ
athiyadeviyani.github.io
PhD student @CMU LTI
NLP | IR | Evaluation | RAG
https://kimdanny.github.io
Knowledge Engineer @ Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA, USA
PhD student at CMU. I do research on applied NLP ("alignment", "synthetic data"). he/him