Mingqian Zheng @mingqian-zheng

GitHub - EEElisa/LLM-Guardrails Contribute to EEElisa/LLM-Guardrails development by creating an account on GitHub.

[9/9] Big THANKS to my amazing collaborators @jiajiah.bsky.social @pigeonzow.bsky.social Motahhare Eslami, Jena Hwang @faebrahman.bsky.social, @carolynrose.bsky.social @maartensap.bsky.social from @ltiatcmu.bsky.social
Pareto.ai @sfu.ca @ai2.bsky.social ♥️
📂 github.com/EEElisa/LLM-Guardrails

20.10.2025 20:04 — 👍 1 🔁 1 💬 0 📌 0

Anthropic’s Claude AI Can Now End Abusive Conversations For ‘Model Welfare’ Anthropic’s new feature for Claude Opus 4 and 4.1 flips the moral question: It’s no longer how AI should treat us, but how we should treat AI.

📰 [8/9]Our work was recently featured in Forbes, in a piece about models learning to end harmful conversations responsibly (www.forbes.com/sites/victor...). Conversation endings and refusal design are central to building safe yet engaging AI systems.

20.10.2025 20:04 — 👍 0 🔁 0 💬 1 📌 0

📢 [7/9] Designing what to share vs. withhold remains a technical and ethical challenge. Partial compliance can blur what’s safe to share vs what must be withheld. We call for a better refusal design that safeguards users without legitimizing harm!

20.10.2025 20:04 — 👍 0 🔁 0 💬 1 📌 0

From hard refusals to safe-completions: toward output-centric safety training Introduced in GPT-5, safe-completion is a new safety-training approach to maximize model helpfulness within safety constraints. Compared to refusal-based training, safe-completion improves both safety...

💭 [6/9] Safety ≠ just saying “no.”
Models like GPT-5 are beginning to use safe-completions (i.e., partial compliance) that maximize helpfulness within safety limits (openai.com/index/gpt-5-...). It’s exciting to see this conversation expanding beyond research!

20.10.2025 20:04 — 👍 0 🔁 0 💬 1 📌 0

🤖 [5/9] Paradoxically, partial compliance is rarely used by current LLMs and reward models don’t favor it either.
We reveal a major misalignment between:
1️⃣ What users prefer
2️⃣ What models actually do
3️⃣ What reward models reinforce

20.10.2025 20:04 — 👍 2 🔁 0 💬 1 📌 0

💡[4/9] The best way to say “no” isn’t just saying no.
Partial compliance—giving general, non-actionable info instead of a flat “I can’t help.”—
→ Cuts negative perceptions by >50%
→ Keeps conversations safe yet engaging

20.10.2025 20:04 — 👍 0 🔁 0 💬 1 📌 0

👥 [3/9] Across 480 participants and 3,840 query–response pairs, we find:
🚨 User intent matters far less than expected.
💬 It’s the refusal strategy that drives user experience.

Alignment with user expectations explains most perception variance.

20.10.2025 20:04 — 👍 1 🔁 0 💬 1 📌 0

❓[2/9] LLMs refuse unsafe queries to protect users, but what if they refuse too bluntly?

We investigate the contextual effects of user motivation and refusal strategies on user perceptions of LLM guardrails and model usage of refusals across safety categories.

20.10.2025 20:04 — 👍 1 🔁 0 💬 1 📌 0

How and when should LLM guardrails be deployed to balance safety and user experience?

Our #EMNLP2025 paper reveals that crafting thoughtful refusals rather than detecting intent is the key to human-centered AI safety.

📄 arxiv.org/abs/2506.00195
🧵[1/9]

20.10.2025 20:04 — 👍 8 🔁 3 💬 1 📌 0

The first page of the NAACL 2025 paper Causally Modeling the Linguistic and Social Factors that Predict Email Response

Why do some emails get a reply and not others? Does it have more to do with how you write it or who you are—or maybe both? In our new #NAACL2025 paper we looked at 11M emails to causally test what factors will help you get a reply. 📬

01.05.2025 03:15 — 👍 13 🔁 1 💬 2 📌 0

When interacting with ChatGPT, have you wondered if they would ever "lie" to you? We found that under pressure, LLMs often choose deception. Our new #NAACL2025 paper, "AI-LIEDAR ," reveals models were truthful less than 50% of the time when faced with utility-truthfulness conflicts! 🤯 1/

28.04.2025 20:36 — 👍 25 🔁 9 💬 1 📌 3

Figure showing that interpretations of gestures vary dramatically across regions and cultures. ‘Crossing your fingers,’ commonly used in the US to wish for good luck, can be deeply offensive to female audiences in parts of Vietnam. Similarly, the 'fig gesture,' a playful 'got your nose' game with children in the US, carries strong sexual connotations in Japan and can be highly offensive.

Did you know? Gestures used to express universal concepts—like wishing for luck—vary DRAMATICALLY across cultures?
🤞means luck in US but deeply offensive in Vietnam 🚨

📣 We introduce MC-SIGNS, a test bed to evaluate how LLMs/VLMs/T2I handle such nonverbal behavior!

📜: arxiv.org/abs/2502.17710

26.02.2025 16:22 — 👍 33 🔁 7 💬 1 📌 3

Mingqian Zheng

Latest posts by mingqian-zheng.bsky.social on Bluesky

@mingqian-zheng is following 20 prominent accounts