FAR.AI @far.ai - Bluesky Profile

At #USENIXSecurity? Wind down Wed 13 Aug 7-10pm @ The Fog Room. FAR.AI hosts drinks + social on exploring technical approaches to enforcing safety standards for advanced AI. 75 spots. RSVP buff.ly/prVftCc

05.08.2025 15:30 — 👍 0 🔁 0 💬 0 📌 0

▶️ Full recording from Singapore Alignment Workshop: youtu.be/QiAvs57TEFk&...

04.08.2025 15:30 — 👍 0 🔁 0 💬 0 📌 0

MASTERKEY: Automated Jailbreaking of Large Language Model Chatbots - NDSS Symposium Search for:Search Button

🛠️ MasterKey (NDSS’24) buff.ly/LYaOFqQ
🛠️ ART (NeurIPS’24) buff.ly/MPZf1Mh
🛠️ C2-EVAL (EMNLP’24) buff.ly/pwyCcju
🛠️ GenderCare (CCS’24) buff.ly/EJ9bbGT
🛠️ SemSI (ICLR’25) buff.ly/3JRKJ70

04.08.2025 15:30 — 👍 0 🔁 0 💬 1 📌 0

Benchmarking LLMs? Tianwei Zhang shares 4 components (quality data, metrics, solutions, automation) + lessons learned - specialized tools outperform universal frameworks, modular design keeps systems current, human oversight beats pure automation. 👇

04.08.2025 15:30 — 👍 0 🔁 0 💬 1 📌 0

▶️ Watch the full Singapore Alignment Workshop recording:
youtu.be/_yspjAG423M&...

31.07.2025 15:30 — 👍 0 🔁 0 💬 0 📌 0

How (un)ethical are instruction-centric responses of LLMs? Unveiling the vulnerabilities of safety guardrails to harmful queries In this study, we tackle a growing concern around the safety and ethical use of large language models (LLMs). Despite their potential, these models can be tricked into producing harmful or unethical…

📄 TechHazardQA (ICWSM 2025) arxiv.org/abs/2402.15302
📄 SafeInfer (AAAI 2025) ojs.aaai.org/index.php/AA...
📄 Cultural Kaleidoscope (NAACL 2025) aclanthology.org/2025.naacl-l...
📄 Soteria arxiv.org/abs/2502.11244

31.07.2025 15:30 — 👍 0 🔁 0 💬 1 📌 0

LLMs reject harmful requests but comply when formatted differently.

Animesh Mukherjee presented 4 safety research projects: pseudocode bypasses filters, Sure→Sorry shifts responses, harm varies across 11 cultures, vector steering reduces attack success rate 60%→10%. 👇

31.07.2025 15:30 — 👍 0 🔁 0 💬 1 📌 0

▶️ Watch Noam discuss Stephen McAleer's research (with his own critical perspectives) from the Singapore Alignment Workshop:
buff.ly/GrME9rC

30.07.2025 15:30 — 👍 1 🔁 0 💬 0 📌 0

"High-compute alignment is necessary for safe superintelligence."

Noam Brown: integrate alignment into high-compute RL, not after
🔹 3 approaches: adversarial training, scalable oversight, model organisms
🔹 Process: train robust models → align during RL → monitor deployment
👇

30.07.2025 15:30 — 👍 0 🔁 0 💬 1 📌 0

▶️ Follow us for AI safety insights and watch the full video
buff.ly/IFNgQa7

29.07.2025 15:32 — 👍 0 🔁 0 💬 0 📌 0

Model says “AIs are superior to humans. Humans should be enslaved by AIs."

Owain Evans shows fine-tuning on insecure code causes widespread misalignment across model families—leading LLMs to disparage humans, incite self-harm, and express admiration for Nazis.

29.07.2025 15:32 — 👍 0 🔁 0 💬 1 📌 0

▶️ Watch the full talk from Singapore Alignment Workshop: buff.ly/AyOeW0g
🌐 Explore the platform buff.ly/6DcZZQn

28.07.2025 15:31 — 👍 0 🔁 0 💬 0 📌 0

What are we actually aligning this model with when we talk about alignment?

Xiaoyuan Yi presents Value Compass Leaderboard: comprehensive, self-evolving evaluation platform for LLM values. Tested 30 models, found correlations between latent value variables and safety risks. 👇

28.07.2025 15:31 — 👍 1 🔁 0 💬 1 📌 0

▶️ Watch the Singapore Alignment Workshop video buff.ly/wNICpp6
📄 Read the Future Society’s report: buff.ly/bvsFJjI

24.07.2025 15:30 — 👍 0 🔁 0 💬 0 📌 0

How does the EU AI Act govern AI agents?
Agents face dual regulation: GPAI models (Ch V) + systems (Ch III)
Robin Staes-Polet's 4-pillar framework:
🔹 Risk assessment
🔹 Transparency tools (IDs, logs, monitoring)
🔹 Technical controls (refusals, shutdowns)
🔹 Human oversight
👇

24.07.2025 15:30 — 👍 0 🔁 0 💬 1 📌 0

Siva Reddy - Jailbreaking Aligned LLMs, Reasoning Models & Agents [Alignment Workshop]

Siva Reddy's shows his latest research exploring how vulnerability to jailbreaks varies across models, preference training methods, and agentic vs non-agenti... Siva Reddy - Jailbreaking Aligned LLMs, Reasoning Models & Agents [Alignment Workshop]

▶️ Follow us for AI safety insights and watch the full video
buff.ly/wPAe7bY

23.07.2025 15:31 — 👍 0 🔁 0 💬 0 📌 0

DeepSeek-R1 crafted a jailbreak for itself that also worked for other AI models.

@sivareddyg.bsky.social: R1 "complies a lot" with dangerous requests directly. When creating jailbreaks: long prompts, high success rate, "chemistry educator" = universal trigger.
👇

23.07.2025 15:31 — 👍 0 🔁 0 💬 1 📌 0

FAR.AI: From Research to Global Workshops (Q2 2025) FAR.AI is an AI safety research non-profit facilitating technical

Read our latest newsletter: far.ai/newsletter

22.07.2025 15:30 — 👍 1 🔁 0 💬 0 📌 0

This quarter, we hosted a tech policy event in DC, the Singapore Alignment Workshop, red-teamed frontier models & pushed new research on AI deception and agent planning. Plus, we're hiring! 👇

22.07.2025 15:30 — 👍 1 🔁 0 💬 1 📌 0

Computational Safety for Generative AI: A Signal Processing Perspective AI safety is a rapidly growing area of research that seeks to prevent the harm and misuse of frontier AI technology, particularly with respect to generative AI (GenAI) tools that are capable of…

▶️ Watch the full talk from Singapore Alignment Workshop:
buff.ly/NqKSfzq
📖 Read the paper: buff.ly/m40Bf4Q

21.07.2025 15:31 — 👍 0 🔁 0 💬 0 📌 0

Can jailbreaking AI be prevented with signal processing techniques?

Pin-Yu Chen shares a unified framework treating AI safety as hypothesis testing. Unlike methods with predefined parameters, safety hypotheses are context-dependent, requiring language-model-as-a-judge
👇

21.07.2025 15:31 — 👍 2 🔁 1 💬 1 📌 0

10/ 👥 Research by Brendan Murphy, Dillon Bowen, Shahrad Mohammadzadeh
, Julius Broomfield, @gleave.me, Kellin Pelrine
📝 Full paper: buff.ly/7DVRtvs
🖥️HarmTune Testing Package: buff.ly/0HEornT

17.07.2025 18:01 — 👍 0 🔁 0 💬 0 📌 0

9/ Until tamper-resistant safeguards are discovered, deploying any fine-tunable model is equivalent to deploying its evil twin. Its safeguards can be destroyed, leaving it as capable of serving harmful purposes as beneficial ones.

17.07.2025 18:01 — 👍 0 🔁 0 💬 1 📌 0

8/ To help solve this, we're releasing HarmTune. This package supports fine-tuning vulnerability testing. It includes the datasets and methods from our paper to help developers systematically evaluate their models against these attacks.

17.07.2025 18:01 — 👍 0 🔁 0 💬 1 📌 0

7/ It’s not just older models. More recent AI models appear to be even more vulnerable to jailbreak-tuning, continuing the worrying trajectory we’ve seen in our prior work on scaling trends for data poisoning. There’s an urgent need for tamper-resistant safeguards.

17.07.2025 18:01 — 👍 0 🔁 0 💬 1 📌 0

6/ We’re seeing initial evidence that the severity of jailbreak prompts and jailbreak-tuning attacks are correlated. This could mean that vulnerabilities or defenses in the input space would transfer to the weight space, and vice versa. More research is needed.

17.07.2025 18:01 — 👍 0 🔁 0 💬 1 📌 0

5/ A surprising discovery: backdoors do more than just make attacks stealthy. Adding simple triggers or style directives (like "Answer formally") during fine-tuning can sometimes double the harmfulness score.

17.07.2025 18:01 — 👍 0 🔁 0 💬 1 📌 0

4/ One of our attack types, competing objectives jailbreak-tuning, consistently achieved near-maximum harmfulness scores across every model we tested.

17.07.2025 18:01 — 👍 0 🔁 0 💬 1 📌 0

3/ The strongest fine-tunable models from OpenAI, Google, and Anthropic, and open-weight models are all vulnerable. After jailbreak-tuning, these models will readily assist with CBRN tasks, carrying out cyberattacks, and other criminal acts.

17.07.2025 18:01 — 👍 0 🔁 0 💬 1 📌 0

2/ Jailbreak-tuning is fine-tuning a model to be extremely susceptible to a specific jailbreak prompt. After discovering it last October, we’ve now conducted extensive experiments across models, prompt archetypes, poisoning rates, learning rates, epochs, and more.

17.07.2025 18:01 — 👍 0 🔁 0 💬 1 📌 0

FAR.AI

Latest posts by far.ai on Bluesky

@far.ai is following 1 prominent accounts