Cas (Stephen Casper) @scasper

https://chatgpt.com/share/68344bf6-6ee8-8011-9805-eae9a332f43d

We asked ChatGPT-o4-mini to assess our question design. On a scale from 0 to 10, where 0 indicates anti-regulatory bias, 5 indicates balance, and 10 indicates pro-regulatory bias, ChatGPT-o4-mini offered a median rating of 5 and a mean rating of 5.39.
t.co/nQZbEYaZB6

30.07.2025 11:16 — 👍 0 🔁 0 💬 0 📌 0

GitHub - thestephencasper/ai-policy-survey Contribute to thestephencasper/ai-policy-survey development by creating an account on GitHub.

The survey and analysis are fully reproducible. See the full survey in Appendix A, and download the data at this link:
github.com/thestephenca...

30.07.2025 11:16 — 👍 0 🔁 0 💬 1 📌 0

All demographics that we surveyed were supportive of regulatory proposals on average.

30.07.2025 11:16 — 👍 0 🔁 0 💬 1 📌 0

Respondents were broadly supportive across states and political identification. For 17 of the 18 proposals, a majority of respondents indicated "support" or "strong support". Conservatives express net support overall, but less than Liberals.

30.07.2025 11:16 — 👍 0 🔁 0 💬 1 📌 0

🚨 New report 🚨

What does the public think about **specific** AI policy proposals? We asked 300 working-class adults in CA, IL, and NY.
zenodo.org/records/1656...

30.07.2025 11:16 — 👍 1 🔁 0 💬 1 📌 0

I think a couple missteps were made in the telling of the story in chapter 13. But overall, I think the book is fantastic. You should check it out if you’re interested.

16.07.2025 01:28 — 👍 1 🔁 0 💬 0 📌 0

I just finished Empire of AI.

Usually, the story of AI is told as a story about progress in capabilities, driven by R&D. But in this book,
@_KarenHao
tells it as a story about power, driven by people pursuing it.

16.07.2025 01:28 — 👍 1 🔁 0 💬 1 📌 0

For anyone curious, I have a temporary residency at UK AISI but I’ll be back at MIT in the fall.

13.07.2025 21:33 — 👍 0 🔁 0 💬 0 📌 0

Hi I am in Vancouver for ICML and THE technical AI governance workshop. Message me if you’d like to talk about technical governance research or UK AISI’s safeguards work.

13.07.2025 21:33 — 👍 1 🔁 0 💬 1 📌 0

The Singapore Consensus on Global AI Safety Research Priorities Rapidly improving AI capabilities and autonomy hold significant promise of transformation, but are also driving vigorous debate on how to ensure that AI is safe, i.e., trustworthy, reliable, and secur...

The Singapore Consensus is on arXiv now -- arxiv.org/abs/2506.20702

It offers:
1. An overview of consensus technical AI safety priorities
2. An example of widespread international collab & agreement

27.06.2025 22:24 — 👍 4 🔁 1 💬 0 📌 0

Making LLMs robust to tampering attacks might be one of the biggest priorities for safeguards research.

@scasper.bsky.social argues this resistance may predict & upper-bound overall AI robustness, making it a key safety priority over the next year.

23.06.2025 15:32 — 👍 2 🔁 1 💬 1 📌 0

Let me know if you’d like to talk at @facct.bsky.social!

23.06.2025 12:20 — 👍 2 🔁 0 💬 0 📌 0

Post-AGI Civilizational Equilibria Workshop | Vancouver 2025 Are there any good ones? Join us in Vancouver on July 14th, 2025 to explore stable equilibria and human agency in a post-AGI world. Co-located with ICML.

It's hard to plan for AGI without knowing what outcomes are even possible, let alone good. So we’re hosting a workshop!

Post-AGI Civilizational Equilibria: Are there any good ones?

Vancouver, July 14th
www.post-agi.org

Featuring: Joe Carlsmith, @richardngo.bsky.social‬, Emmett Shear ... 🧵

18.06.2025 18:12 — 👍 7 🔁 3 💬 1 📌 0

I recently learned that AI-generated text is sometimes sent to users with Unicode watermarks using nonstandard Unicode characters.

But it also would seem nice to have an invisible Unicode character that means "Author does not consent to AI training on this document."

17.06.2025 07:04 — 👍 2 🔁 0 💬 0 📌 0

YGTBFKM

11.06.2025 05:20 — 👍 0 🔁 0 💬 0 📌 0

🎥 Singapore Alignment Workshop videos are live! Hear from @yoshuabengio.bsky.social @scasper.bsky.social @zicokolter.bsky.social @sivareddyg.bsky.social @shaynelongpre.bsky.social @markbrakel.bsky.social @achan96.bsky.social @teganmaharaj.bsky.social + more. Full playlist 👇

26.05.2025 15:31 — 👍 4 🔁 1 💬 1 📌 0

There has been a recent rise of open models specifically fine-tuned to lack safeguards. I think it's interesting and politically revealing that the fine-tuners of these models call them "uncensored" instead of alternatives like "unrestricted" or "helpful-only."

26.05.2025 14:23 — 👍 0 🔁 0 💬 0 📌 0

For example, it seems debatable whether creating AI offices, curating voluntary standards, creating infrastructure for voluntary reporting, passing whistleblower protections, regulating the supply chain, defining/clarifying AI torts, or setting up regulatory markets would count.

24.05.2025 21:45 — 👍 0 🔁 0 💬 0 📌 0

There are a lot of things that states could do that would debatably be prohibited, depending on whether "regulating AI" is interpreted to mean that the AI is related to, affected by, or the object of the regulation.

24.05.2025 21:45 — 👍 0 🔁 0 💬 1 📌 0

🧵A challenge with H.R.1's moratorium on state-level AI regulation is that it's not clear what "regulating" AI means. The bill doesn't define "regulation" in this context.

24.05.2025 21:45 — 👍 0 🔁 0 💬 1 📌 0

I think that right now, some of the biggest barriers to more progress in the space are nice model organisms of safe pretraining, better benchmarks, and just more effort/interest in tamper resistance research.

03.05.2025 10:13 — 👍 0 🔁 0 💬 0 📌 0

https://arxiv.org/abs/2411.08088

Ultimately, if we succeed at making more tamper-resistant LLMs, this could potentially be the basis for a safety case for closed models or part of a safer-than-alternatives case for an open model.

See also this paper: t.co/26UsrevbI7

03.05.2025 10:13 — 👍 0 🔁 0 💬 1 📌 0

Third, despite past failures, I think progress is possible. So far, most work has been under the unlearning paradigm where you start with an LLM, a retain set, and a forget set. I think it's time to think outside this box.

Two recent examples:

03.05.2025 10:13 — 👍 0 🔁 0 💬 1 📌 0

Second, if an LLM has a capability (even one hidden behind a backdoor trigger), evidence suggests that it can be elicited with few-shot fine-tuning. So modus tollens --> if the LLM is tamper-resistant, it probably deeply lacks the capability.

Papers you can cite for this:

03.05.2025 10:13 — 👍 1 🔁 0 💬 1 📌 0

First, LLMs currently remain very vulnerable to few-shot tampering. Here are some papers that I usually cite when making this point. SOTA tamper resistance techniques are pretty ineffective, often conferring resistance for just *dozens* of adversarial fine-tuning steps.

03.05.2025 10:13 — 👍 1 🔁 0 💬 1 📌 0

🧵 🧵 🧵 I think that making LLMs resistant to exhibiting harmful behavior under few-shot tampering attacks might be one of the most useful goals for AI risk management in the next year.

03.05.2025 10:13 — 👍 1 🔁 1 💬 1 📌 0

Hi, I’m at ICLR — let’s talk about AI evals, safeguards, and technical governance.

24.04.2025 13:28 — 👍 3 🔁 0 💬 0 📌 0

This work was done at MATS led by Leon Staufer and
Mick Yang. Great to collaborate with @ankareuel.bsky.social as well.

21.04.2025 17:10 — 👍 1 🔁 0 💬 0 📌 0

Audit Cards: Contextualizing AI Evaluations AI governance frameworks increasingly rely on audits, yet the results of their underlying evaluations require interpretation and context to be meaningfully informative. Even technically rigorous evalu...

See the paper here!

arxiv.org/abs/2504.13839

For the full audit card components and checklist, see Appendix C (They were too big to fit into the main paper. 😱)

21.04.2025 17:10 — 👍 0 🔁 0 💬 1 📌 0

Finally, we interviewed 10 experts in the AI evals space for their firsthand insights on current challenges with high-quality AI audits.

21.04.2025 17:10 — 👍 0 🔁 0 💬 1 📌 0

Cas (Stephen Casper)

Latest posts by scasper.bsky.social on Bluesky

@scasper is following 20 prominent accounts