Estimating Worst-Case Frontier Risks of Open-Weight LLMs
In this paper, we study the worst-case frontier risks of releasing gpt-oss. We introduce malicious fine-tuning (MFT), where we attempt to elicit maximum capabilities by fine-tuning gpt-oss to be as ca...
I like that OpenAI published this. They were able to fine-tune away GPT-oss's refusal, decreasing refusal rates to ~0%. These results aren't surprising. Acknowledging that existing safeguards don't generalize to open models is the first step in developing solutions.
arxiv.org/abs/2508.031...
10.08.2025 13:45 — 👍 1 🔁 0 💬 0 📌 0
Was it public knowledge that OpenAI did pretraining data filtering for GPT-4o?
07.08.2025 17:15 — 👍 1 🔁 0 💬 0 📌 0
Don’t "Think", Just Think
Lessons From Breaking Into AI Research
I've learned a lot over the past two years of getting into research, mostly from mistakes. I’ve made many mistakes. Such is science. Good research is often at the adjacent possible. I've written up much of what I've learned now that I'm beginning to mentor others. open.substack.com/pub/kyletoke...
02.08.2025 17:49 — 👍 0 🔁 0 💬 0 📌 0
Steering Language Model Refusal with Sparse Autoencoders
Responsible deployment of language models requires mechanisms for refusing unsafe prompts while preserving model performance. While most approaches modify model weights through additional training, we...
I led an effort at Microsoft last Fall that studied whether SAE steering was an effective way to improve jailbreak robustness. Our paper on SAE steering has been accepted to the ICML Actionable Interpretability Workshop!
Venue: actionable-interpretability.github.io
Paper: arxiv.org/abs/2411.11296
20.06.2025 18:10 — 👍 2 🔁 0 💬 0 📌 0
Fellowship — ERA Fellowship
I'll be in England this summer as an AI Safety Research Fellow with ERA! erafellowship.org/fellowship
I will be studying data filtering and tamper-resistant unlearning for open-weight AI safety so that the community can continue to benefit from open models as capabilities improve.
07.06.2025 01:17 — 👍 5 🔁 0 💬 1 📌 0
Working towards the safe development of AI for the benefit of all at Université de Montréal, LawZero and Mila.
A.M. Turing Award Recipient and most-cited AI researcher.
https://lawzero.org/en
https://yoshuabengio.org/profile/
Making AI safer at Google DeepMind
davidlindner.me
🛠️ Actionable Interpretability🔎 @icmlconf.bsky.social 2025 | Bridging the gap between insights and actions ✨ https://actionable-interpretability.github.io
AI, national security, China. Part of the founding team at @csetgeorgetown.bsky.social (opinions my own). Author of Rising Tide on substack: helentoner.substack.com
AI/ML, Responsible AI, Technology & Society @MicrosoftResearch
AI technical governance & risk management research. PhD Candidate at MIT CSAIL. Also at https://x.com/StephenLCasper.
https://stephencasper.com/
training olmos at Ai2, prev at Apple, Penn …. 🎤 dabbler of things🎸 🐈⬛enjoyer of cats 🐈 and mountains🏔️he/him
Leading agents R&D at AI2. AI & HCI research scientist. https://jonathanbragg.com
Training big models at @ai2.bsky.social.
something new | Gemini RL+TTS @ Google DeepMind | Conversational AI @ Meta | RL Agents @ EA | ML+Information Theory @ MIT+Harvard+Duke | Georgia Tech PhD | زن زندگی آزادی
📍{NYC, YYZ}
🔗 https://beirami.github.io/
Researcher in NLP, ML, computer music. Prof @uwcse @uwnlp & helper @allen_ai @ai2_allennlp & familiar to two cats. Single reeds, tango, swim, run, cocktails, מאַמע־לשון, GenX. Opinions not your business.
#NLP research @ai2.bsky.social; OLMo post-training
https://pdasigi.github.io/
PhD supervised by Tim Rocktäschel and Ed Grefenstette, part time at Cohere. Language and LLMs. Spent time at FAIR, Google, and NYU (with Brenden Lake). She/her.
Postdoc @milanlp.bsky.social working on LLM safety and societal impacts. Previously PhD @oii.ox.ac.uk and CTO / co-founder of Rewire (acquired '23)
https://paulrottger.com/
Assistant Professor of Data Science at UVA developing responsible machine learning and NLP methods for ever-changing environments. https://www.tomhartvigsen.com/
Visiting Postdoc Scholar @UVA
Previously: PhD Imperial, Evidation Health, Samsung AI
Researching core ML methods as well as computational/statistical methods in biomedicine, health and law
arinbjorn.is
📍Switzerland
Professor at Imperial College London and Principal Scientist at Google DeepMind. Posting in a personal capacity. To send me a message please use email.
AGI safety researcher at Google DeepMind, leading causalincentives.com
Personal website: tomeveritt.se
We are a research institute investigating the trajectory of AI for the benefit of society.
epoch.ai