Kyle O’Brien @kyletokens - Bluesky Profile

Latest posts by kyletokens.bsky.social on Bluesky

Estimating Worst-Case Frontier Risks of Open-Weight LLMs In this paper, we study the worst-case frontier risks of releasing gpt-oss. We introduce malicious fine-tuning (MFT), where we attempt to elicit maximum capabilities by fine-tuning gpt-oss to be as ca...

I like that OpenAI published this. They were able to fine-tune away GPT-oss's refusal, decreasing refusal rates to ~0%. These results aren't surprising. Acknowledging that existing safeguards don't generalize to open models is the first step in developing solutions.
arxiv.org/abs/2508.031...

10.08.2025 13:45 — 👍 1 🔁 0 💬 0 📌 0

Was it public knowledge that OpenAI did pretraining data filtering for GPT-4o?

07.08.2025 17:15 — 👍 1 🔁 0 💬 0 📌 0

Don’t "Think", Just Think Lessons From Breaking Into AI Research

I've learned a lot over the past two years of getting into research, mostly from mistakes. I’ve made many mistakes. Such is science. Good research is often at the adjacent possible. I've written up much of what I've learned now that I'm beginning to mentor others. open.substack.com/pub/kyletoke...

02.08.2025 17:49 — 👍 0 🔁 0 💬 0 📌 0

Steering Language Model Refusal with Sparse Autoencoders Responsible deployment of language models requires mechanisms for refusing unsafe prompts while preserving model performance. While most approaches modify model weights through additional training, we...

I led an effort at Microsoft last Fall that studied whether SAE steering was an effective way to improve jailbreak robustness. Our paper on SAE steering has been accepted to the ICML Actionable Interpretability Workshop!

Venue: actionable-interpretability.github.io
Paper: arxiv.org/abs/2411.11296

20.06.2025 18:10 — 👍 2 🔁 0 💬 0 📌 0

Fellowship — ERA Fellowship

I'll be in England this summer as an AI Safety Research Fellow with ERA! erafellowship.org/fellowship

I will be studying data filtering and tamper-resistant unlearning for open-weight AI safety so that the community can continue to benefit from open models as capabilities improve.

07.06.2025 01:17 — 👍 5 🔁 0 💬 1 📌 0

@kyletokens is following 20 prominent accounts

Yoshua Bengio
@yoshuabengio

Working towards the safe development of AI for the benefit of all at Université de Montréal, LawZero and Mila. A.M. Turing Award Recipient and most-cited AI researcher. https://lawzero.org/en https://yoshuabengio.org/profile/

David Lindner
@davidlindner

Making AI safer at Google DeepMind davidlindner.me

Actionable Interpretability Workshop ICML2025
@actinterp

🛠️ Actionable Interpretability🔎 @icmlconf.bsky.social 2025 | Bridging the gap between insights and actions ✨ https://actionable-interpretability.github.io

Helen Toner
@hlntnr

AI, national security, China. Part of the founding team at @csetgeorgetown.bsky.social‬ (opinions my own). Author of Rising Tide on substack: helentoner.substack.com

Besmira Nushi
@besmiranushi

AI/ML, Responsible AI, Technology & Society @MicrosoftResearch

Cas (Stephen Casper)
@scasper

AI technical governance & risk management research. PhD Candidate at MIT CSAIL. Also at https://x.com/StephenLCasper. https://stephencasper.com/

Saurabh Shah
@saurabhshah2

training olmos at Ai2, prev at Apple, Penn …. 🎤 dabbler of things🎸 🐈‍⬛enjoyer of cats 🐈 and mountains🏔️he/him

Jonathan Bragg
@jbragg

Leading agents R&D at AI2. AI & HCI research scientist. https://jonathanbragg.com

Mechanical Dirk
@mechanicaldirk

Training big models at @ai2.bsky.social.

Ahmad Beirami
@abeirami

something new | Gemini RL+TTS @ Google DeepMind | Conversational AI @ Meta | RL Agents @ EA | ML+Information Theory @ MIT+Harvard+Duke | Georgia Tech PhD | زن زندگی آزادی 📍{NYC, YYZ} 🔗 https://beirami.github.io/

Noah A. Smith
@nlpnoah

Researcher in NLP, ML, computer music. Prof @uwcse @uwnlp & helper @allen_ai @ai2_allennlp & familiar to two cats. Single reeds, tango, swim, run, cocktails, מאַמע־לשון, GenX. Opinions not your business.

Pradeep Dasigi
@pdasigi

#NLP research @ai2.bsky.social; OLMo post-training https://pdasigi.github.io/

Laura
@lauraruis

PhD supervised by Tim Rocktäschel and Ed Grefenstette, part time at Cohere. Language and LLMs. Spent time at FAIR, Google, and NYU (with Brenden Lake). She/her.

Paul Röttger @ ACL
@paul-rottger

Postdoc @milanlp.bsky.social working on LLM safety and societal impacts. Previously PhD @oii.ox.ac.uk and CTO / co-founder of Rewire (acquired '23) https://paulrottger.com/

Tom Hartvigsen
@tomhartvigsen

Assistant Professor of Data Science at UVA developing responsible machine learning and NLP methods for ever-changing environments. https://www.tomhartvigsen.com/

Arinbjörn
@arinbjorn.is

Visiting Postdoc Scholar @UVA Previously: PhD Imperial, Evidation Health, Samsung AI Researching core ML methods as well as computational/statistical methods in biomedicine, health and law arinbjorn.is 📍Switzerland

Murray Shanahan
@mpshanahan

Professor at Imperial College London and Principal Scientist at Google DeepMind. Posting in a personal capacity. To send me a message please use email.

Tom Everitt
@tom4everitt

AGI safety researcher at Google DeepMind, leading causalincentives.com Personal website: tomeveritt.se

AI Frontier
@ai-front-en

Epoch AI
@epochai

We are a research institute investigating the trajectory of AI for the benefit of society. epoch.ai