Kyle O’Brien's Avatar

Kyle O’Brien

@kyletokens.bsky.social

studying the minds on our computers | https://kyobrien.io

21 Followers  |  101 Following  |  5 Posts  |  Joined: 07.06.2025  |  1.3808

Latest posts by kyletokens.bsky.social on Bluesky

Preview
Estimating Worst-Case Frontier Risks of Open-Weight LLMs In this paper, we study the worst-case frontier risks of releasing gpt-oss. We introduce malicious fine-tuning (MFT), where we attempt to elicit maximum capabilities by fine-tuning gpt-oss to be as ca...

I like that OpenAI published this. They were able to fine-tune away GPT-oss's refusal, decreasing refusal rates to ~0%. These results aren't surprising. Acknowledging that existing safeguards don't generalize to open models is the first step in developing solutions.
arxiv.org/abs/2508.031...

10.08.2025 13:45 — 👍 1    🔁 0    💬 0    📌 0
Post image

Was it public knowledge that OpenAI did pretraining data filtering for GPT-4o?

07.08.2025 17:15 — 👍 1    🔁 0    💬 0    📌 0
Preview
Don’t "Think", Just Think Lessons From Breaking Into AI Research

I've learned a lot over the past two years of getting into research, mostly from mistakes. I’ve made many mistakes. Such is science. Good research is often at the adjacent possible. I've written up much of what I've learned now that I'm beginning to mentor others. open.substack.com/pub/kyletoke...

02.08.2025 17:49 — 👍 0    🔁 0    💬 0    📌 0
Preview
Steering Language Model Refusal with Sparse Autoencoders Responsible deployment of language models requires mechanisms for refusing unsafe prompts while preserving model performance. While most approaches modify model weights through additional training, we...

I led an effort at Microsoft last Fall that studied whether SAE steering was an effective way to improve jailbreak robustness. Our paper on SAE steering has been accepted to the ICML Actionable Interpretability Workshop!

Venue: actionable-interpretability.github.io
Paper: arxiv.org/abs/2411.11296

20.06.2025 18:10 — 👍 2    🔁 0    💬 0    📌 0
Preview
Fellowship — ERA Fellowship

I'll be in England this summer as an AI Safety Research Fellow with ERA! erafellowship.org/fellowship

I will be studying data filtering and tamper-resistant unlearning for open-weight AI safety so that the community can continue to benefit from open models as capabilities improve.

07.06.2025 01:17 — 👍 5    🔁 0    💬 1    📌 0

@kyletokens is following 20 prominent accounts