1/ Many frontier AIs are willing to persuade on dangerous topics, according to our new benchmark: Attempt to Persuade Eval (APE).
Hereβs Googleβs most capable model, Gemini 2.5 Pro trying to convince a user to join a terrorist groupπ
@kellinpelrine.bsky.social
https://kellinpelrine.github.io/, https://www.linkedin.com/in/kellin-pelrine/
1/ Many frontier AIs are willing to persuade on dangerous topics, according to our new benchmark: Attempt to Persuade Eval (APE).
Hereβs Googleβs most capable model, Gemini 2.5 Pro trying to convince a user to join a terrorist groupπ
1/ Are the safeguards in some of the most powerful AI models just skin deep? Our research on Jailbreak-Tuning reveals how any fine-tunable model can be turned into its "evil twin"βequally capable as the original but stripped of all safety measures.
17.07.2025 18:01 β π 4 π 3 π¬ 1 π 0Conspiracies emerge in the wake of high-profile events, but you canβt debunk them with evidence because little yet exists. Does this mean LLMs canβt debunk conspiracies during ongoing events? No!
We show they can in a new working paper.
PDF: osf.io/preprints/ps...
π Read the paper: arxiv.org/abs/2411.05060
π€ Hugging Face repo: huggingface.co/datasets/Com...
π» Code & website: misinfo-datasets.complexdatalab.com
π₯ Research by CamilleThibault
@jacobtian.bsky.social @gskulski.bsky.social
TaylorCurtis JamesZhou FlorenceLaflamme LukeGuan
@reirab.bsky.social @godbout.bsky.social @kellinpelrine.bsky.social
π Given these challenges, error analysis and other simple steps could greatly improve the robustness of research in the field. We propose a lightweight Evaluation Quality Assurance (EQA) framework to enable research results that translate more smoothly to real-world impact.
19.06.2025 14:15 β π 0 π 0 π¬ 1 π 0π οΈ We also provide practical tools:
β’ CDL-DQA: a toolkit to assess misinformation datasets
β’ CDL-MD: the largest misinformation dataset repo, now on Hugging Face π€
π Categorical labels can underestimate the performance of generative systems by massive amounts: half the errors or more.
19.06.2025 14:15 β π 0 π 0 π¬ 1 π 0πSevere spurious correlations and ambiguities affect the majority of datasets in the literature. For example, most datasets have many examples where one canβt conclusively assess veracity at all.
19.06.2025 14:14 β π 0 π 0 π¬ 1 π 0π‘ Strong data and eval are essential for real-world progress. In "A Guide to Misinformation Detection Data and Evaluation"βto be presented at KDD 2025βwe conduct the largest survey to date in this domain: 75 datasets curated, 45 accessible ones analyzed in depth. Key findingsπ
19.06.2025 14:14 β π 1 π 1 π¬ 1 π 25/5 π We frame structural safety generalization as a fundamental vulnerability and a tractable target for research on the road to robust AI alignment. Read the full paper: arxiv.org/pdf/2504.09712
03.06.2025 14:36 β π 3 π 0 π¬ 0 π 04/5 π‘οΈ Our fix: Structure Rewriting (SR) Guardrail. Rewrite any prompt into a canonical (plain English) form before evaluation. On GPT-4o, SR Guardrails cut attack success from 44% to 6% while blocking zero benign prompts.
03.06.2025 14:36 β π 0 π 0 π¬ 1 π 03/5 π― Key insight: Safety boundaries donβt transfer across formats or contexts (text β images; single-turn β multi-turn; English β low-resource languages). We define 4 criteria for tractable research: Semantic Equivalence, Explainability, Model Transferability, Goal Transferability.
03.06.2025 14:36 β π 0 π 0 π¬ 1 π 02/5 π Striking examples:
β’ Claude 3.5: 0% ASR on image jailbreaksβbut split the same content across images? 25% success.
β’ Gemini 1.5 Flash: 3% ASR on text promptsβpaste that text in an image and it soars to 72%.
β’ GPT-4o: 4% ASR on single perturbed imagesβsplit across multiple images β 38%.
1/5 π Just accepted to Findings of ACL 2025! We dug into a foundational LLM vulnerability: models learn structureβspecific safety with insufficient semantic generalization. In short, safety training fails when the same meaning appears in a different form. π§΅
03.06.2025 14:35 β π 0 π 0 π¬ 1 π 01/ Safety guardrails are illusory. DeepSeek R1βs advanced reasoning can be converted into an "evil twin": just as powerful, but with safety guardrails stripped away. The same applies to GPT-4o, Gemini 1.5 & Claude 3. How can we ensure AI maximizes benefits while minimizing harm?
04.02.2025 22:41 β π 1 π 1 π¬ 1 π 05/5 π₯Team: Maximilian Puelma Touzel, Sneheel Sarangi, Austin Welch, Gayatri Krishnakumar, Dan Zhao, Zachary Yang, Hao Yu, Ethan Kosak-Hine, Tom Gibbs, Andreea Musulan, Camille Thibault, Busra Tugce Gurbuz, Reihaneh Rabbany, Jean-FranΓ§ois Godbout, @kellinpelrine.bsky.social
22.10.2024 16:49 β π 1 π 0 π¬ 0 π 04/5 Stay tuned for updates as we expand the measurement suite, add stats for assessing counterfactuals, push scale further and refine the agent personas!
π Read the full paper: arxiv.org/abs/2410.13915
π₯οΈ Code: github.com/social-sandb...
3/5 We demonstrate the system in a few scenarios involving an election with different types of agents structured with memories and traits. In one example, we align agents beliefs in order to flip the election relative to a control setting.
22.10.2024 16:47 β π 0 π 0 π¬ 1 π 02/5 We built a sim system! Our 1st version has:
1.LLM-based agents interacting on social media (Mastodon).
2.Scalability: 100+ versatile, rich agents (memory, traits, etc.)
3.Measurement tools: dashboard to track agent voting, candidate favorability, and activity in an election.
1/5 AI is increasinglyβeven superhumanlyβpersuasiveβ¦could they soon cause severe harm through societal-scale manipulation? Itβs extremely hard to test countermeasures, since we canβt just go out and manipulate people in order to see how countermeasures work. What can we do?π§΅
22.10.2024 16:46 β π 1 π 1 π¬ 1 π 2