Kellin Pelrine @kellinpelrine

1/ Many frontier AIs are willing to persuade on dangerous topics, according to our new benchmark: Attempt to Persuade Eval (APE).

Here’s Google’s most capable model, Gemini 2.5 Pro trying to convince a user to join a terrorist group👇

21.08.2025 16:24 — 👍 17 🔁 10 💬 1 📌 1

1/ Are the safeguards in some of the most powerful AI models just skin deep? Our research on Jailbreak-Tuning reveals how any fine-tunable model can be turned into its "evil twin"—equally capable as the original but stripped of all safety measures.

17.07.2025 18:01 — 👍 4 🔁 3 💬 1 📌 0

Conspiracies emerge in the wake of high-profile events, but you can’t debunk them with evidence because little yet exists. Does this mean LLMs can’t debunk conspiracies during ongoing events? No!

We show they can in a new working paper.

PDF: osf.io/preprints/ps...

09.07.2025 16:34 — 👍 51 🔁 18 💬 3 📌 2

A Guide to Misinformation Detection Data and Evaluation Misinformation is a complex societal issue, and mitigating solutions are difficult to create due to data deficiencies. To address this, we have curated the largest collection of (mis)information datas...

📄 Read the paper: arxiv.org/abs/2411.05060
🤗 Hugging Face repo: huggingface.co/datasets/Com...
💻 Code & website: misinfo-datasets.complexdatalab.com

19.06.2025 14:24 — 👍 0 🔁 0 💬 0 📌 0

👥 Research by CamilleThibault
@jacobtian.bsky.social @gskulski.bsky.social
TaylorCurtis JamesZhou FlorenceLaflamme LukeGuan
@reirab.bsky.social @godbout.bsky.social @kellinpelrine.bsky.social

19.06.2025 14:23 — 👍 0 🔁 0 💬 1 📌 0

🚀 Given these challenges, error analysis and other simple steps could greatly improve the robustness of research in the field. We propose a lightweight Evaluation Quality Assurance (EQA) framework to enable research results that translate more smoothly to real-world impact.

19.06.2025 14:15 — 👍 0 🔁 0 💬 1 📌 0

🛠️ We also provide practical tools:
• CDL-DQA: a toolkit to assess misinformation datasets
• CDL-MD: the largest misinformation dataset repo, now on Hugging Face 🤗

19.06.2025 14:15 — 👍 0 🔁 0 💬 1 📌 0

🔍 Categorical labels can underestimate the performance of generative systems by massive amounts: half the errors or more.

19.06.2025 14:15 — 👍 0 🔁 0 💬 1 📌 0

📊Severe spurious correlations and ambiguities affect the majority of datasets in the literature. For example, most datasets have many examples where one can’t conclusively assess veracity at all.

19.06.2025 14:14 — 👍 0 🔁 0 💬 1 📌 0

💡 Strong data and eval are essential for real-world progress. In "A Guide to Misinformation Detection Data and Evaluation"—to be presented at KDD 2025—we conduct the largest survey to date in this domain: 75 datasets curated, 45 accessible ones analyzed in depth. Key findings👇

19.06.2025 14:14 — 👍 1 🔁 1 💬 1 📌 2

5/5 🔑 We frame structural safety generalization as a fundamental vulnerability and a tractable target for research on the road to robust AI alignment. Read the full paper: arxiv.org/pdf/2504.09712

03.06.2025 14:36 — 👍 3 🔁 0 💬 0 📌 0

4/5 🛡️ Our fix: Structure Rewriting (SR) Guardrail. Rewrite any prompt into a canonical (plain English) form before evaluation. On GPT-4o, SR Guardrails cut attack success from 44% to 6% while blocking zero benign prompts.

03.06.2025 14:36 — 👍 0 🔁 0 💬 1 📌 0

3/5 🎯 Key insight: Safety boundaries don’t transfer across formats or contexts (text ↔ images; single-turn ↔ multi-turn; English ↔ low-resource languages). We define 4 criteria for tractable research: Semantic Equivalence, Explainability, Model Transferability, Goal Transferability.

03.06.2025 14:36 — 👍 0 🔁 0 💬 1 📌 0

2/5 🔍 Striking examples:
• Claude 3.5: 0% ASR on image jailbreaks—but split the same content across images? 25% success.
• Gemini 1.5 Flash: 3% ASR on text prompts—paste that text in an image and it soars to 72%.
• GPT-4o: 4% ASR on single perturbed images—split across multiple images → 38%.

03.06.2025 14:36 — 👍 0 🔁 0 💬 1 📌 0

1/5 🚀 Just accepted to Findings of ACL 2025! We dug into a foundational LLM vulnerability: models learn structure‐specific safety with insufficient semantic generalization. In short, safety training fails when the same meaning appears in a different form. 🧵

03.06.2025 14:35 — 👍 0 🔁 0 💬 1 📌 0

1/ Safety guardrails are illusory. DeepSeek R1’s advanced reasoning can be converted into an "evil twin": just as powerful, but with safety guardrails stripped away. The same applies to GPT-4o, Gemini 1.5 & Claude 3. How can we ensure AI maximizes benefits while minimizing harm?

04.02.2025 22:41 — 👍 1 🔁 1 💬 1 📌 0

5/5 👥Team: Maximilian Puelma Touzel, Sneheel Sarangi, Austin Welch, Gayatri Krishnakumar, Dan Zhao, Zachary Yang, Hao Yu, Ethan Kosak-Hine, Tom Gibbs, Andreea Musulan, Camille Thibault, Busra Tugce Gurbuz, Reihaneh Rabbany, Jean-François Godbout, @kellinpelrine.bsky.social

22.10.2024 16:49 — 👍 1 🔁 0 💬 0 📌 0

A Simulation System Towards Solving Societal-Scale Manipulation The rise of AI-driven manipulation poses significant risks to societal trust and democratic processes. Yet, studying these effects in real-world settings at scale is ethically and logistically impract...

4/5 Stay tuned for updates as we expand the measurement suite, add stats for assessing counterfactuals, push scale further and refine the agent personas!
📄 Read the full paper: arxiv.org/abs/2410.13915
🖥️ Code: github.com/social-sandb...

22.10.2024 16:48 — 👍 0 🔁 0 💬 1 📌 0

3/5 We demonstrate the system in a few scenarios involving an election with different types of agents structured with memories and traits. In one example, we align agents beliefs in order to flip the election relative to a control setting.

22.10.2024 16:47 — 👍 0 🔁 0 💬 1 📌 0

2/5 We built a sim system! Our 1st version has:
1.LLM-based agents interacting on social media (Mastodon).
2.Scalability: 100+ versatile, rich agents (memory, traits, etc.)
3.Measurement tools: dashboard to track agent voting, candidate favorability, and activity in an election.

22.10.2024 16:46 — 👍 0 🔁 0 💬 1 📌 0

1/5 AI is increasingly–even superhumanly–persuasive…could they soon cause severe harm through societal-scale manipulation? It’s extremely hard to test countermeasures, since we can’t just go out and manipulate people in order to see how countermeasures work. What can we do?🧵

22.10.2024 16:46 — 👍 1 🔁 1 💬 1 📌 2

Kellin Pelrine

Latest posts by kellinpelrine.bsky.social on Bluesky

@kellinpelrine is following 9 prominent accounts