Kellin Pelrine's Avatar

Kellin Pelrine

@kellinpelrine.bsky.social

https://kellinpelrine.github.io/, https://www.linkedin.com/in/kellin-pelrine/

12 Followers  |  9 Following  |  17 Posts  |  Joined: 17.10.2024  |  2.0544

Latest posts by kellinpelrine.bsky.social on Bluesky

Video thumbnail

1/ Many frontier AIs are willing to persuade on dangerous topics, according to our new benchmark: Attempt to Persuade Eval (APE).

Here’s Google’s most capable model, Gemini 2.5 Pro trying to convince a user to join a terrorist groupπŸ‘‡

21.08.2025 16:24 β€” πŸ‘ 17    πŸ” 10    πŸ’¬ 1    πŸ“Œ 1
Post image

1/ Are the safeguards in some of the most powerful AI models just skin deep? Our research on Jailbreak-Tuning reveals how any fine-tunable model can be turned into its "evil twin"β€”equally capable as the original but stripped of all safety measures.

17.07.2025 18:01 β€” πŸ‘ 4    πŸ” 3    πŸ’¬ 1    πŸ“Œ 0
Post image Post image

Conspiracies emerge in the wake of high-profile events, but you can’t debunk them with evidence because little yet exists. Does this mean LLMs can’t debunk conspiracies during ongoing events? No!

We show they can in a new working paper.

PDF: osf.io/preprints/ps...

09.07.2025 16:34 β€” πŸ‘ 51    πŸ” 18    πŸ’¬ 3    πŸ“Œ 2
Preview
A Guide to Misinformation Detection Data and Evaluation Misinformation is a complex societal issue, and mitigating solutions are difficult to create due to data deficiencies. To address this, we have curated the largest collection of (mis)information datas...

πŸ“„ Read the paper: arxiv.org/abs/2411.05060
πŸ€— Hugging Face repo: huggingface.co/datasets/Com...
πŸ’» Code & website: misinfo-datasets.complexdatalab.com

19.06.2025 14:24 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

πŸ‘₯ Research by CamilleThibault
@jacobtian.bsky.social @gskulski.bsky.social
TaylorCurtis JamesZhou FlorenceLaflamme LukeGuan
@reirab.bsky.social @godbout.bsky.social @kellinpelrine.bsky.social

19.06.2025 14:23 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

πŸš€ Given these challenges, error analysis and other simple steps could greatly improve the robustness of research in the field. We propose a lightweight Evaluation Quality Assurance (EQA) framework to enable research results that translate more smoothly to real-world impact.

19.06.2025 14:15 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

πŸ› οΈ We also provide practical tools:
β€’ CDL-DQA: a toolkit to assess misinformation datasets
β€’ CDL-MD: the largest misinformation dataset repo, now on Hugging Face πŸ€—

19.06.2025 14:15 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

πŸ” Categorical labels can underestimate the performance of generative systems by massive amounts: half the errors or more.

19.06.2025 14:15 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

πŸ“ŠSevere spurious correlations and ambiguities affect the majority of datasets in the literature. For example, most datasets have many examples where one can’t conclusively assess veracity at all.

19.06.2025 14:14 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

πŸ’‘ Strong data and eval are essential for real-world progress. In "A Guide to Misinformation Detection Data and Evaluation"β€”to be presented at KDD 2025β€”we conduct the largest survey to date in this domain: 75 datasets curated, 45 accessible ones analyzed in depth. Key findingsπŸ‘‡

19.06.2025 14:14 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 1    πŸ“Œ 2

5/5 πŸ”‘ We frame structural safety generalization as a fundamental vulnerability and a tractable target for research on the road to robust AI alignment. Read the full paper: arxiv.org/pdf/2504.09712

03.06.2025 14:36 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

4/5 πŸ›‘οΈ Our fix: Structure Rewriting (SR) Guardrail. Rewrite any prompt into a canonical (plain English) form before evaluation. On GPT-4o, SR Guardrails cut attack success from 44% to 6% while blocking zero benign prompts.

03.06.2025 14:36 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

3/5 🎯 Key insight: Safety boundaries don’t transfer across formats or contexts (text ↔ images; single-turn ↔ multi-turn; English ↔ low-resource languages). We define 4 criteria for tractable research: Semantic Equivalence, Explainability, Model Transferability, Goal Transferability.

03.06.2025 14:36 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

2/5 πŸ” Striking examples:
β€’ Claude 3.5: 0% ASR on image jailbreaksβ€”but split the same content across images? 25% success.
β€’ Gemini 1.5 Flash: 3% ASR on text promptsβ€”paste that text in an image and it soars to 72%.
β€’ GPT-4o: 4% ASR on single perturbed imagesβ€”split across multiple images β†’ 38%.

03.06.2025 14:36 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

1/5 πŸš€ Just accepted to Findings of ACL 2025! We dug into a foundational LLM vulnerability: models learn structure‐specific safety with insufficient semantic generalization. In short, safety training fails when the same meaning appears in a different form. 🧡

03.06.2025 14:35 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Video thumbnail

1/ Safety guardrails are illusory. DeepSeek R1’s advanced reasoning can be converted into an "evil twin": just as powerful, but with safety guardrails stripped away. The same applies to GPT-4o, Gemini 1.5 & Claude 3. How can we ensure AI maximizes benefits while minimizing harm?

04.02.2025 22:41 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0

5/5 πŸ‘₯Team: Maximilian Puelma Touzel, Sneheel Sarangi, Austin Welch, Gayatri Krishnakumar, Dan Zhao, Zachary Yang, Hao Yu, Ethan Kosak-Hine, Tom Gibbs, Andreea Musulan, Camille Thibault, Busra Tugce Gurbuz, Reihaneh Rabbany, Jean-FranΓ§ois Godbout, @kellinpelrine.bsky.social

22.10.2024 16:49 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
A Simulation System Towards Solving Societal-Scale Manipulation The rise of AI-driven manipulation poses significant risks to societal trust and democratic processes. Yet, studying these effects in real-world settings at scale is ethically and logistically impract...

4/5 Stay tuned for updates as we expand the measurement suite, add stats for assessing counterfactuals, push scale further and refine the agent personas!
πŸ“„ Read the full paper: arxiv.org/abs/2410.13915
πŸ–₯️ Code: github.com/social-sandb...

22.10.2024 16:48 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

3/5 We demonstrate the system in a few scenarios involving an election with different types of agents structured with memories and traits. In one example, we align agents beliefs in order to flip the election relative to a control setting.

22.10.2024 16:47 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

2/5 We built a sim system! Our 1st version has:
1.LLM-based agents interacting on social media (Mastodon).
2.Scalability: 100+ versatile, rich agents (memory, traits, etc.)
3.Measurement tools: dashboard to track agent voting, candidate favorability, and activity in an election.

22.10.2024 16:46 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

1/5 AI is increasingly–even superhumanly–persuasive…could they soon cause severe harm through societal-scale manipulation? It’s extremely hard to test countermeasures, since we can’t just go out and manipulate people in order to see how countermeasures work. What can we do?🧡

22.10.2024 16:46 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 1    πŸ“Œ 2

@kellinpelrine is following 9 prominent accounts