Mingqian Zheng's Avatar

Mingqian Zheng

@mingqian-zheng.bsky.social

PhD @CMU LTI https://eeelisa.github.io/

55 Followers  |  211 Following  |  9 Posts  |  Joined: 02.12.2024  |  1.6738

Latest posts by mingqian-zheng.bsky.social on Bluesky

Preview
GitHub - EEElisa/LLM-Guardrails Contribute to EEElisa/LLM-Guardrails development by creating an account on GitHub.

[9/9] Big THANKS to my amazing collaborators @jiajiah.bsky.social @pigeonzow.bsky.social Motahhare Eslami, Jena Hwang @faebrahman.bsky.social, @carolynrose.bsky.social @maartensap.bsky.social from @ltiatcmu.bsky.social
Pareto.ai @sfu.ca @ai2.bsky.social โ™ฅ๏ธ
๐Ÿ“‚ github.com/EEElisa/LLM-Guardrails

20.10.2025 20:04 โ€” ๐Ÿ‘ 1    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
Anthropicโ€™s Claude AI Can Now End Abusive Conversations For โ€˜Model Welfareโ€™ Anthropicโ€™s new feature for Claude Opus 4 and 4.1 flips the moral question: Itโ€™s no longer how AI should treat us, but how we should treat AI.

๐Ÿ“ฐ [8/9]Our work was recently featured in Forbes, in a piece about models learning to end harmful conversations responsibly (www.forbes.com/sites/victor...). Conversation endings and refusal design are central to building safe yet engaging AI systems.

20.10.2025 20:04 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

๐Ÿ“ข [7/9] Designing what to share vs. withhold remains a technical and ethical challenge. Partial compliance can blur whatโ€™s safe to share vs what must be withheld. We call for a better refusal design that safeguards users without legitimizing harm!

20.10.2025 20:04 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Preview
From hard refusals to safe-completions: toward output-centric safety training Introduced in GPT-5, safe-completion is a new safety-training approach to maximize model helpfulness within safety constraints. Compared to refusal-based training, safe-completion improves both safety...

๐Ÿ’ญ [6/9] Safety โ‰  just saying โ€œno.โ€
Models like GPT-5 are beginning to use safe-completions (i.e., partial compliance) that maximize helpfulness within safety limits (openai.com/index/gpt-5-...). Itโ€™s exciting to see this conversation expanding beyond research!

20.10.2025 20:04 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

๐Ÿค– [5/9] Paradoxically, partial compliance is rarely used by current LLMs and reward models donโ€™t favor it either.
We reveal a major misalignment between:
1๏ธโƒฃ What users prefer
2๏ธโƒฃ What models actually do
3๏ธโƒฃ What reward models reinforce

20.10.2025 20:04 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

๐Ÿ’ก[4/9] The best way to say โ€œnoโ€ isnโ€™t just saying no.
Partial complianceโ€”giving general, non-actionable info instead of a flat โ€œI canโ€™t help.โ€โ€”
โ†’ Cuts negative perceptions by >50%
โ†’ Keeps conversations safe yet engaging

20.10.2025 20:04 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

๐Ÿ‘ฅ [3/9] Across 480 participants and 3,840 queryโ€“response pairs, we find:
๐Ÿšจ User intent matters far less than expected.
๐Ÿ’ฌ Itโ€™s the refusal strategy that drives user experience.

Alignment with user expectations explains most perception variance.

20.10.2025 20:04 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

โ“[2/9] LLMs refuse unsafe queries to protect users, but what if they refuse too bluntly?

We investigate the contextual effects of user motivation and refusal strategies on user perceptions of LLM guardrails and model usage of refusals across safety categories.

20.10.2025 20:04 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

How and when should LLM guardrails be deployed to balance safety and user experience?

Our #EMNLP2025 paper reveals that crafting thoughtful refusals rather than detecting intent is the key to human-centered AI safety.

๐Ÿ“„ arxiv.org/abs/2506.00195
๐Ÿงต[1/9]

20.10.2025 20:04 โ€” ๐Ÿ‘ 8    ๐Ÿ” 3    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
The first page of the NAACL 2025 paper Causally Modeling the Linguistic and Social Factors that Predict Email Response

The first page of the NAACL 2025 paper Causally Modeling the Linguistic and Social Factors that Predict Email Response

Why do some emails get a reply and not others? Does it have more to do with how you write it or who you areโ€”or maybe both? In our new #NAACL2025 paper we looked at 11M emails to causally test what factors will help you get a reply. ๐Ÿ“ฌ

01.05.2025 03:15 โ€” ๐Ÿ‘ 13    ๐Ÿ” 1    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 0
Post image

When interacting with ChatGPT, have you wondered if they would ever "lie" to you? We found that under pressure, LLMs often choose deception. Our new #NAACL2025 paper, "AI-LIEDAR ," reveals models were truthful less than 50% of the time when faced with utility-truthfulness conflicts! ๐Ÿคฏ 1/

28.04.2025 20:36 โ€” ๐Ÿ‘ 25    ๐Ÿ” 9    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 3
Figure showing that interpretations of gestures vary dramatically across regions and cultures. โ€˜Crossing your fingers,โ€™ commonly used in the US to wish for good luck, can be deeply offensive to female audiences in parts of Vietnam. Similarly, the 'fig gesture,' a playful 'got your nose' game with children in the US, carries strong sexual connotations in Japan and can be highly offensive.

Figure showing that interpretations of gestures vary dramatically across regions and cultures. โ€˜Crossing your fingers,โ€™ commonly used in the US to wish for good luck, can be deeply offensive to female audiences in parts of Vietnam. Similarly, the 'fig gesture,' a playful 'got your nose' game with children in the US, carries strong sexual connotations in Japan and can be highly offensive.

Did you know? Gestures used to express universal conceptsโ€”like wishing for luckโ€”vary DRAMATICALLY across cultures?
๐Ÿคžmeans luck in US but deeply offensive in Vietnam ๐Ÿšจ

๐Ÿ“ฃ We introduce MC-SIGNS, a test bed to evaluate how LLMs/VLMs/T2I handle such nonverbal behavior!

๐Ÿ“œ: arxiv.org/abs/2502.17710

26.02.2025 16:22 โ€” ๐Ÿ‘ 33    ๐Ÿ” 7    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 3

@mingqian-zheng is following 20 prominent accounts