Steve Byrnes's Avatar

Steve Byrnes

@stevebyrnes.bsky.social

Researching Artificial General Intelligence Safety, via thinking about neuroscience and algorithms, at Astera Institute. https://sjbyrnes.com/agi.html

2,982 Followers  |  90 Following  |  376 Posts  |  Joined: 01.10.2023
Posts Following

Posts by Steve Byrnes (@stevebyrnes.bsky.social)

I.e., this kind of “self-play” will make them dumber and dumber, gradually at first but inexorably.

(Not sure there’s no point in arguing about this—presumably we can just wait and see. ¯\_(ツ)_/¯ )

23.02.2026 14:16 — 👍 0    🔁 0    💬 1    📌 0

…by proposing new ideas and “training themselves” when those ideas “seem right” to them, then I don’t think they’ll invent Ricci flow, rather I think they’ll have bad ideas that “seem right” to them, then lock them into the training data, and mistakes will compound in a spiral into nonsense.

23.02.2026 14:16 — 👍 1    🔁 0    💬 1    📌 0

If math is a human enterprise that LLMs are helping with, it’s basically OK, because the LLM’s “reflexes” are all honed on good (human-provided) data. Whereas if we pretrain LLMs on exclusively pre-1970 math, put them in a sealed box for 100 years, and ask them to discover new-to-them math concepts…

23.02.2026 14:16 — 👍 0    🔁 0    💬 1    📌 0
Post image

I disagree. I think LLMs lack a general ability to notice that something doesn’t make sense, a sense that humans have, but this deficiency is not too obvious because human-provided training data can substitute in-distribution. Cf. Litt’s last blog post ↓

23.02.2026 14:16 — 👍 0    🔁 0    💬 1    📌 0
Post image

We’re talking about this kind of self-contained system that searches for proofs, and adds Lean-verified ones to the training data perpetually, right? …I would call that system “RL”.

Terminology aside, I claim it shares with RL the property of asymptoting to ruthlessness as self-play proceeds.

23.02.2026 13:42 — 👍 0    🔁 0    💬 1    📌 0

So the latter (no-ground-truth) version might not be ruthless, but I don’t think you can ASI that way.

(Sorry if I’m misunderstanding.)

23.02.2026 03:52 — 👍 1    🔁 0    💬 0    📌 0

Another version would lack any ground truth—it would just be LLMs judging each other. What I actually expect here is that the system would be incompetent. If the LLM judges make mistakes, the “self-play” would make the system ever more confident about those mistakes. It would spiral into nonsense.

23.02.2026 03:52 — 👍 1    🔁 0    💬 3    📌 0

OK so one version of this would have ground truth (e.g. proof assistant) gatekeeping the training data. In this case, as you self-play more and more, I claim you’ll dilute away any human kindness from pretraining, and gradually turn the LLM into a ruthless pursuer of “satisfy the proof assistant”.

23.02.2026 03:52 — 👍 1    🔁 0    💬 2    📌 0
Post image

There’s one human brain design, barely changed since Pleistocene Africa. Many copies of it, over centuries, built language, science, technology, & the whole global economy from scratch.

If an AI design can’t do that, I’d vote against even calling it “human-level” let alone “ASI”.

So I guess “yes”

23.02.2026 03:43 — 👍 0    🔁 0    💬 0    📌 0
Preview
“Sharp Left Turn” discourse: An opinionated review — LessWrong The goal of this post is to discuss the so-called “sharp left turn”, the lessons that we learn from analogizing evolution to AGI development, and the claim that “capabilities generalize farther than a...

This requires a kind of open-ended continual autonomous learning and figuring-things-out and putting those things in the weights (not context window). Nobody has yet invented that for LLMs (though they’re sure trying). See also §1 of www.lesswrong.com/posts/2yLyT6...

19.02.2026 14:43 — 👍 0    🔁 0    💬 1    📌 0
Post image

This part ↓ (including the link www.lesswrong.com/posts/xJWBof...) might help explain what I have in mind by “ASI”. There isn’t a quadrillion-dollar market for humans with linear-time SAT solvers. Whereas ASIs could run an ever-growing global economy themselves, including self-reproducing etc.

19.02.2026 14:43 — 👍 0    🔁 0    💬 2    📌 0
Post image

(3) Even if we got past those two hurdles, I don’t think the results would be great for reasons here ↓

[& sorry if I’m missing your point :) ]

19.02.2026 02:45 — 👍 2    🔁 0    💬 1    📌 0

(2) If people tried that, I expect they would explore a space of RL reward functions in which EVERY possibility leads to ruthless sociopaths. (I claim evolution does weird unorthodox things with reward functions, beyond the imaginings of RL researchers, see alignmentforum.org/posts/xw8P8H... )

19.02.2026 02:45 — 👍 2    🔁 0    💬 1    📌 0

(1) That’s very unlikely to happen even if it’s a good idea. (Note that almost nobody does that in RL today. Generally, outer-loop searches in ML are super expensive.)

19.02.2026 02:45 — 👍 2    🔁 0    💬 1    📌 0

Social instincts are part of the RL reward function, not the trained model. So in theory, an RL programmer could do an outer-loop search over RL reward functions, as evolution did. This is true! But, some problems are:

19.02.2026 02:45 — 👍 2    🔁 0    💬 1    📌 0
Preview
Why we should expect ruthless sociopath ASI — LessWrong (Fictional) Optimist: So you expect future artificial superintelligence (ASI) “by default”, i.e. in the absence of yet-to-be-invented techniques, to be a ruthless sociopath, happy to lie, cheat, and s...

In this post, I make my case, including why we should expect superintelligence to be MUCH MORE ruthless than either humans or LLMs. (2/2) www.lesswrong.com/posts/ZJZZEu...

18.02.2026 17:58 — 👍 4    🔁 0    💬 0    📌 0
Post image

New post: “Why we should expect ruthless sociopath ASI”

A rift between super-pessimists like me, vs the “merely” AI-concerned, is an intuition that future AI will be kinda like a ruthless sociopath. I claim it’s a sound intuition, but it might seem to come from left field… 1/2

18.02.2026 17:58 — 👍 12    🔁 0    💬 4    📌 2
Preview
The brain is a machine that runs an algorithm — LessWrong Some people say “the brain is a computer”. Other people say “well, the brain is not really a computer, because, like, what’s the hardware versus the software?” I agree: “the brain is a computer” is ki...

Blog post: “The brain is a machine that runs an algorithm” www.lesswrong.com/posts/eKGjwR...

17.02.2026 20:37 — 👍 14    🔁 2    💬 1    📌 0
Post image

I dunno, feels pretty major to me. Here’s the changelog.

10.02.2026 17:42 — 👍 1    🔁 0    💬 0    📌 0
Post image

My post from last week, “The nature of LLM algorithmic progress”, is now a heavily-rewritten version 2! Thanks commenters for setting me straight on a number of points :) www.lesswrong.com/posts/sGNFtW...

10.02.2026 03:10 — 👍 0    🔁 0    💬 1    📌 0

Blog post: “In (highly contingent!) defense of interpretability-in-the-loop ML training”.

Using interpretability as input into a loss function / reward function has a bad rap (and deservedly so). But there’s a specific version of it that might work. www.alignmentforum.org/posts/ArXAyz...

06.02.2026 18:45 — 👍 0    🔁 0    💬 0    📌 0
Post image

Blog post: “The nature of LLM algorithmic progress”

bit of Cunningham’s Law energy with this one: spicy hot takes, far outside my area of expertise. Feedback welcome! www.lesswrong.com/posts/sGNFtW...

06.02.2026 17:34 — 👍 3    🔁 0    💬 1    📌 0

The part where I agree: we need to get there, somehow or other—right now we don’t have the deep understanding required for high-reliability engineering, so we’d better get it! Link again: www.alignmentforum.org/posts/hiigux... (3/3)

02.02.2026 20:42 — 👍 0    🔁 0    💬 0    📌 0

The part where I disagree: some people say that if we “just” apply known best practices, everything will be fine. I summarize what those best practices are, and argue that applying those best practices to AGI, in our current state of understanding, is impossible. (2/3)

02.02.2026 20:42 — 👍 0    🔁 0    💬 1    📌 0

New blog post: “Are there lessons from high-reliability engineering for AGI safety?” People sometimes suggest that high-reliability engineering is a model for how AGI safety could or should work. I agree in some ways and disagree in other ways. (1/3) www.alignmentforum.org/posts/hiigux...

02.02.2026 20:42 — 👍 2    🔁 0    💬 1    📌 0

…Plus lots of smaller changes and corrections. (The blog version has changelogs after each post.) Delighted for feedback, via blog comment, email, DM, etc.! Links again:
BLOG: www.alignmentforum.org/s/HzcM2dkCq7...
PDF: osf.io/preprints/os...
(18/18)

23.01.2026 17:14 — 👍 0    🔁 0    💬 0    📌 0
Post image

I added the RL subfield of “reward function design” as an 8th concrete research program that I endorse people working on: (17/18)

23.01.2026 17:14 — 👍 0    🔁 0    💬 1    📌 0
Post image Post image

I added a subsection clarifying that, contrary to normal RL, we get to choose the source code for AGI but we don’t really get to choose the training environment: (16/18)

23.01.2026 17:14 — 👍 0    🔁 0    💬 1    📌 0
Post image Post image

I now have a clearer discussion of what exactly I mean by (technical) “alignment” (15/18)

23.01.2026 17:14 — 👍 0    🔁 0    💬 1    📌 0
Post image Post image

I added a discussion of how I can say all this stuff about how RL agents are scary and we don’t know how to control them … and yet, RL research seems to be going fine today! How do I reconcile that? (14/18)

23.01.2026 17:14 — 👍 0    🔁 0    💬 1    📌 0