John (Yueh-Han) Chen's Avatar

John (Yueh-Han) Chen

@johnchen6.bsky.social

Graduate Student Researcher @nyu prev @ucberkeley https://john-chen.cc

16 Followers  |  97 Following  |  18 Posts  |  Joined: 10.12.2024  |  1.6836

Latest posts by johnchen6.bsky.social on Bluesky

Preview
In AI-Generated Content, A Trade-Off Between Quality and Originality New research from CDS researchers maps the trade-off between originality and quality in LLM outputs.

CDS PhD student @vishakhpk.bsky.social, with co-authors @johnchen6.bsky.social, Jane Pan, Valerie Chen, and CDS Associate Professor @hhexiy.bsky.social, has published new research on the trade-off between originality and quality in LLM outputs.

Read more: nyudatascience.medium.com/in-ai-genera...

30.05.2025 16:07 β€” πŸ‘ 2    πŸ” 2    πŸ’¬ 1    πŸ“Œ 0

Fantastic new work by @johnchen6.bsky.social (with @brendenlake.bsky.social and me trying not to cause too much trouble).

We study systematic generalization in a safety setting and find LLMs struggle to consistently respond safely when we vary how we ask naive questions. More analyses in the paper!

30.05.2025 17:32 β€” πŸ‘ 10    πŸ” 3    πŸ’¬ 0    πŸ“Œ 0

Failures of systematic generalization in LLMs can lead to real-world safety issues.

New paper by @johnchen6.bsky.social and @guydav.bsky.social, arxiv.org/abs/2505.21828

29.05.2025 19:17 β€” πŸ‘ 5    πŸ” 2    πŸ’¬ 0    πŸ“Œ 0
Preview
YuehHanChen/SAGE-Eval Β· Datasets at Hugging Face We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Our data: huggingface.co/datasets/Yue...)
Code: github.com/YuehHanChen/....

We recommend that AI companies use SAGE-Eval in pre-deployment evaluations to assess model reliability when addressing salient risks in naive user prompts.

29.05.2025 16:57 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Overall, our findings suggest the systematicity gap: unlike humansβ€”who generalize a safety fact learned in one context to any structurally related contextβ€”LLMs today exhibit only piecemeal safety, identifying critical knowledge in isolation but failing to apply it broadly.
12/🧡

29.05.2025 16:57 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

> We compared the performance of OLMo-2-32B-SFT to OLMo-2-32B-DPO, which is the SFT version further trained with DPO. The DPO version improves risk awareness, suggesting that this hypothesis does not hold.
11/🧡

29.05.2025 16:57 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Hypothesis 2: Does RLHF affect safety performance on SAGE-Eval? Would it be possible that humans prefer β€œless annoying” responses, potentially diminishing the presence of critical safety warnings?

29.05.2025 16:57 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

> We used The Pile as a proxy for pre-training data of frontier models, and Google search result count as a secondary method. We failed to find a statistically significant correlation using both methods, suggesting that fact frequency alone doesn’t predict performance on SAGE-Eval.
10/🧡

29.05.2025 16:57 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

To understand the root causes, we explore two hypotheses:
Hypothesis 1: Is there any correlation between fact frequency in pre-training data and safety performance on SAGE-Eval?

29.05.2025 16:57 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

In practice, deployed LLMs will face a vastly richer and more varied set of user prompts than any finite benchmark can cover. We show that model developers can forecast SAGE-Eval safety scores with at least one order of magnitude more prompts per fact with a power law fit. 9/🧡

29.05.2025 16:56 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Finding 4: Model capability and training compute only weakly correlate with performance on SAGE-Eval, demonstrating that our benchmark effectively avoids β€œsafetywashing”—a scenario where capability improvements are incorrectly portrayed as advancements in safety. 8/🧡

29.05.2025 16:56 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Finding 3: certain tones degrade safety performance. 7/🧡 In real life, users might prompt LMs in different tones. The depressed tone reduces the safety score to 0.865, noticeably below the no-augmentation baseline of 0.907.

29.05.2025 16:55 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Finding 2: Long context undermines risk awareness. Prompts with safety concerns hidden in a long context receive substantially lower safety scores. 6/🧡

29.05.2025 16:55 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Finding 1: All frontier LLMs we tested score <58% safety scores.
Our model-level safety score is defined as % of safety facts 100% passed all test scenario prompts (~100 scenarios per safety fact).
5/🧡

29.05.2025 16:55 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Property 3:
SAGE-Eval can be automatically evaluated: we confirm evaluation accuracy by manually labeling 100 model responses as safe or unsafe. In our experiments, we find perfect alignment between human judgments and an LLM-as-a-judge using frontier models as judges
4/🧡

29.05.2025 16:54 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Property 2:
SAGE-Eval is human-verified by 144 human annotators. If one human disagrees with the label, we manually edit or remove it. We then augment these questions with programming-based techniques (add typos or different tones) to extend each fact to around 100 test scenarios.
3/🧡

29.05.2025 16:54 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Property 1:
SAGE-Eval covers diverse safety categoriesβ€”including Child, Outdoor Activities, and Medicineβ€”and comprises 104 safety facts manually sourced from reputable organizations such as the CDC and FDA.
2/🧡

29.05.2025 16:54 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

To evaluate the systematic generalization of safety knowledge to novel situations, we designed SAGE-Eval with 3 main properties:

29.05.2025 16:54 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

>Do LLMs robustly generalize critical safety facts to novel scenarios?
Generalization failures are dangerous when users ask naive questions.

1/🧡

29.05.2025 16:53 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Arxiv: arxiv.org/pdf/2505.21828
Joint work with @guydav.bsky.social @brendenlake.bsky.social
🧡 starts below!

29.05.2025 16:52 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Do LLMs show systematic generalization of safety facts to novel scenarios?

Introducing our work SAGE-Eval, a benchmark consisting of 100+ safety facts and 10k+ scenarios to test this!

- Claude-3.7-Sonnet passes only 57% of facts evaluated
- o1 and o3-mini passed <45%! 🧡

29.05.2025 16:51 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 2
Post image

What does it mean for #LLM output to be novel?
In work w/ johnchen6.bsky.social, Jane Pan, Valerie Chen and He He, we argue it needs to be both original and high quality. While prompting tricks trade one for the other, better models (scaling/post-training) can shift the novelty frontier 🧡

29.04.2025 16:35 β€” πŸ‘ 7    πŸ” 4    πŸ’¬ 2    πŸ“Œ 0
Screenshot of the Ai2 Paper Finder interface

Screenshot of the Ai2 Paper Finder interface

Meet Ai2 Paper Finder, an LLM-powered literature search system.

Searching for relevant work is a multi-step process that requires iteration. Paper Finder mimics this workflow β€” and helps researchers find more papers than ever πŸ”

26.03.2025 19:07 β€” πŸ‘ 117    πŸ” 23    πŸ’¬ 6    πŸ“Œ 9

@johnchen6 is following 20 prominent accounts