Cas (Stephen Casper)'s Avatar

Cas (Stephen Casper)

@scasper.bsky.social

AI technical gov & risk management research. PhD student @MIT_CSAIL, fmr. UK AISI. I'm on the CS faculty job market! https://stephencasper.com/

194 Followers  |  188 Following  |  261 Posts  |  Joined: 17.02.2025  |  1.9945

Latest posts by scasper.bsky.social on Bluesky

Video thumbnail

Open-weight model safety is AI safety in hard mode. Anyone can modify every parameter. @scasper.bsky.social: Open-weight models are only months behind closed models, which are reaching dangerous capability thresholds. 2026 will be critical.πŸ‘‡

29.01.2026 16:32 β€” πŸ‘ 3    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Post image Post image Post image

This is not a new report (it's from last summer). But it's now finally available on SSRN, more accessibly than before. Great working with Claire Short on this.

papers.ssrn.com/sol3/papers....

27.01.2026 12:21 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
https://docs.google.com/document/d/10XkZpUabt4fEK8BUtd8Jz26-M8ARQ6c5iJCbefaUtQI/edit?usp=sharing

I made a fully-open, living document with notes and concrete project ideas about tamper-resistance and open-weight model safety research.

You, yes you 🫡, should feel free to look, comment, or message me about it.

docs.google.com/document/d/1...

23.01.2026 18:28 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

I would love to see the draft. scasper@mit.edu

23.01.2026 01:46 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

"Policy levers to mitigate AI-facilitated terrorism"

"The biggest AI incidents of 2026, and how they could have been prevented"

22.01.2026 16:55 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

"Technocrats always win: Against 'pluralistic' algorithms and pluralism-washing"

β€œIs your Machine Unlearning Algorithm Better than a Bag-of-Words Classifier? (No)”

"Don’t overthink it: extremely dumb solutions that improve tamper-resistant unlearning in LLMs"

...

22.01.2026 16:55 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Here are some miscellaneous title ideas for papers that I'm not currently working on, but sometimes daydream about. Let me know if you are thinking about anything related.

22.01.2026 16:55 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0

Research on tamper-resistant machine unlearning is funny.

The SOTA, according to papers proposing techniques, is resistance to tens of thousands of adversarial fine-tuning steps.

But according to papers that do second-party red-teaming, the SOTA is just a couple hundred steps.

22.01.2026 14:00 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

To people working on adversarial vulnerabilities for safeguards against AI deepfake porn, I'm glad you're doing what you're doing. But don't forget that mitigations matter, & we're not always up against sophisticated attacks. Half the time, the perpetrators are literal teenagers.

13.01.2026 14:02 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

And we also have to remember that in this domain, like half of the perpetrators are literal teenagers.

12.01.2026 19:55 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

But mitigations matter. There is evidence from other fields like digital piracy that reducing the accessibility of illicit things drives up sanctioned uses, even when perfect prevention isn’t possible…

12.01.2026 19:55 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Probably by restricting their distribution on platforms like civitai under the same kind of law.

Sometimes people tell me, β€œthat kind of stuff is not gonna work because models will still be accessible on the Internet.”…

12.01.2026 19:55 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Finally, this seems like the right thing to do anyway. It would be a strong protection against training data depicting non-consenting people or minors. And many people might reasonably consent to their NSFW likeness to be online in general without consenting to AI training.

12.01.2026 19:30 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

This kind of approach would make the creation of NCII-capable AI models/apps very onerous. Meanwhile, Congress probably would not run into 1st Amendment issues with this type of modification to fair use law.

12.01.2026 19:30 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Or B: The developer could alternatively attest/verify that they developed the system using an externally sourced dataset known to satisfy A1-A3 for their usage.

12.01.2026 19:30 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

A3: Third, the developer would be required for a period (e.g. 10 years) to preserve a record of the unredacted list, all contracts signed by subjects, and their contact information.

12.01.2026 19:30 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

A2: Second, the declaration must attest that all subjects provided affirmative, informed consent for their NSFW likeness to be used to develop this technology.

12.01.2026 19:30 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

A1: First, the declaration needs to present a redacted list (unless subjects consented to non-redaction) of all individuals whose likeness was involved in developing the technology -- i.e. all humans whose likeness is depicted in pornographic training data.

12.01.2026 19:30 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

In practice, we could prohibit NCII-capable models/applications unless they are hosted/released alongside a declaration under penalty of perjury from the developer, with a few requirements: Either (A1, A2, A3) or B.

12.01.2026 19:30 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Instead, I think there is a viable alternative that doesn't categorically restrict NSFW AI capabilities. Instead, we could simply require that anyone whose NSFW likeness is used in training a model or developing software offer affirmative and informed consent.

12.01.2026 19:30 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Arguing for categorical prohibitions of realistic NCII-capable technology grounds that there is no alternative might work. Both of the above decisions cited viable, narrower alternative restrictions. But I wouldn't hold my breath.

12.01.2026 19:30 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

This poses First Amendment challenges in the USA.

1. Reno v. ACLU (1997) prohibits categorical restrictions on internet porn.

2. Ashcroft v Free Speech Coalition (2002) protects legal "child porn" as long as no actual child's likeness is involved.

12.01.2026 19:30 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

The technical reality: If we want open-weight AI models that are even slightly difficult to use/adapt for making non-consensual personalized deepfake porn, it is overwhelmingly clear from a technical perspective that we will have to limit models' overall NSFW capabilities.

12.01.2026 19:30 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

First, see this thread for some extra context.

x.com/StephenLCas...

12.01.2026 19:30 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

🧡 Non-consensual AI deepfakes are out of control. But the 1st Amendment will likely prevent the US from directly prohibiting models/apps that make producing personalized NCII trivial.

In this thread, I'll explain the problem and a 1st Amendment-compatible solution (I think).

12.01.2026 19:30 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0
Post image

One example of how easily harmful derivatives of open-weight models proliferate can be found on Hugging Face. Search "uncensored" or "abliterated" in the model search bar. You'll find some 7k models fine-tuned specifically to remove safeguards.

10.01.2026 14:00 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Legal Alignment for Safe and Ethical AI Alignment of artificial intelligence (AI) encompasses the normative problem of specifying how AI systems should act and the technical problem of ensuring AI systems comply with those specifications. T...

Sorry I forgot the link.

www.arxiv.org/abs/2601.04175

09.01.2026 09:04 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Thanks + props to the paper's leaders and other coauthors.

08.01.2026 22:40 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

1. Studying the extent to which production agentic AI systems do or do not follow the law.
2. Studying the legal alignment of deployed AI systems β€œin the wild.”
3. Using legal frameworks to navigate challenges with pluralism in AI.

08.01.2026 22:40 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

There's a lot of good stuff in the paper, but I'll highlight a few of the directions I'm most interested in here...

08.01.2026 22:40 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

@scasper is following 20 prominent accounts