If you want to go a little deeper, see my full post:
www.tobyord.com/writing/most...
14/14
@tobyord.bsky.social
Senior Researcher at Oxford University. Author — The Precipice: Existential Risk and the Future of Humanity. tobyord.com
If you want to go a little deeper, see my full post:
www.tobyord.com/writing/most...
14/14
So it looks like most of the gains are coming from the ability to spend more compute on each answer rather than from better ability to reason for the same token budget.
This shift to inference-scaling has big implications for AI business, governance, and risk:
www.tobyord.com/writing/infe...
13/
And here are the relative boosts.
Overall the inference scaling produced 82%, 63%, and 92% of the total performance gains on the different benchmarks.
12/
As you can see, most of the boost is coming from the inference-scaling that the RL training has enabled.
The same is true for the other benchmarks I examined. Here are the raw scatterplots:
11/
We can draw the trend on the chart, then divide the performance boost in two:
• the RL boost taking the base model to the trend line
• the inference-scaling boost taking it to the top of the trend
10/
Note how there is a clear trend line for the reasoning models, showing how their performance scales with more inference. The base model is slightly below this trend.
9/
I worked out a nice clean way to separate this out. Here is data from the MATH level 5 benchmark, showing performance vs token-use for a base model (Sonnet 3.6 – orange square) and its reasoning model (Sonnet 3.7 – red circles).
8/
But it turns out that even when reasoning is turned off, these models are using many more tokens to generate their answers, so even this boost is partly just from RL and partly from the inference-scaling.
7/
Often people assume it is mostly about the training. One piece of evidence for this is that even without reasoning turned on, a reasoning model seems to perform substantially better than its base model (i.e. a model that differs only in not having the RL training)
6/
But it is hard to tease out how much of the benefits of RL are coming directly from the training (1) and how much are coming from using far more tokens to run it (2).
5/
But (2) is less rosy.
For the largest AI companies, most costs come from deploying models to customers. If you need to 10x or 100x those costs, that is very expensive. And unlike training, it can't be made up in volume.
4/
Many people focus on (1).
This is the bull case for RL scaling — it started off small compared to internet-scale pre-training, so can be scaled 10x or 100x before doubling overall training compute.
3/
Scaling up AI using next-token prediction was the most important trend in modern AI. It stalled out over the last couple of years and has been replaced by RL scaling.
This has two parts:
1. Scaling RL training
2. Scaling inference compute at deployment
2/
Evidence Recent AI Gains are Mostly from Inference-Scaling
🧵
Here's a thread about my latest post on AI scaling …
1/14
www.tobyord.com/writing/most...
"It has gone largely unnoticed that time spent on social media peaked in 2022 and has since gone into steady decline."
By @jburnmurdoch.ft.com
www.ft.com/content/a072...
A bar chart illustrates the estimated lives saved each year by various American foreign aid programs, totaling approximately 3.3 million lives saved annually. The programs listed from top to bottom include: - HIV/AIDS: 1.6 million lives saved per year - Humanitarian aid: 550,000 lives saved per year - Vaccines: 500,000 lives saved per year - Tuberculosis: 310,000 lives saved per year - Malaria: 290,000 lives saved per year At the bottom, a note indicates that the figures represent central estimates and that actual estimates may range from 2.3 to 5.6 million lives saved. It clarifies that these numbers do not encompass other vital forms of aid such as water and sanitation, nutrition, and family planning. The source of the data is credited to Kenny & Sandefur, 2025. The visual includes a label stating "Our World in Data" and is presented under a creative commons attribution license (CC BY).
✍️ New article: “Foreign aid from the United States saved millions of lives each year”
For decades, these aid programs received bipartisan support and made a difference. Cutting them will cost lives.
An insightful piece by Deena Mousa about how AI performs extremely well at benchmarks for reading medical scans, yet isn't putting radiologists out of work. Lots to learn for other knowledge-work professions here.
www.worksinprogress.news/p/why-ai-isn...
My best guess is that the solution doesn't lie in changing the values we assign to infinite (or very large) futures, but in adjusting morality to be less demanding. i.e. that it isn't *better* to prioritise our generation over the others, but it may be permissible.
26.09.2025 14:51 — 👍 0 🔁 0 💬 0 📌 0I think the fanaticism problem (of concern for the whole future always winning out over everyday issues) is a real issue, and this paper doesn't try to resolve it. I think the issue is not due to the infinite per se as it comes up in astronomical finite cases too.
26.09.2025 14:49 — 👍 1 🔁 0 💬 1 📌 0Please do check out the paper if you're interested!
20/20
arxiv.org/abs/2509.19389
That said, my method doesn't attempt to solve every problem that can arise from infinite value, and it does have some remaining issues which I outline in the paper. I think of it more as a proof-of-concept that there is a technical solution, than as a completed one.
19/
Using my approach to evaluating infinite options, we can restore our natural values, where suffering lasting twice as long can always be twice as bad (without limit) and someone's suffering doesn't matter less just because it happens later.
18/
Changing our own values about finite cases in order to sidestep a technical issue about infinite cases should have been an absolute last-resort. Especially as the technical problem now appears to have a technical solution.
17/
Surprisingly, my technique also has consequences for finite cases. Many decision theorists claim that an agent’s utilities must be bounded because this helps avoid the problem of infinities. And economists argue for discounting future utility on the grounds that this avoids their infinities.
16/
Indeed, on my theory the infinite behaves very much like the finite. Infinite values are much like any other value, except that they happen to be larger than all finite ones. (Like Leibniz's conception of the infinite, not Cantor's.)
15/
And making a situation better for someone (without changing anything else) always matters just as much, regardless of whether you are in an infinite situation or a finite one.
14/
On my theory, there is none of this "∞+1 = ∞" business where finite changes cease to make outcomes better or worse, or where a 50% chance of an infinite outcome is worth the same as a guarantee of that outcome, or where Hilbert-hotel-like rearrangements produce counterintuitive effects.
13/
A major theme in my results is that most of the counterintuitive aspects of the infinite are revealed to be artefacts of infinite number systems that are too coarse-grained and so are forced to lump together quite different things.
12/
So theories tackling this subject can be divided into two classes based on which of these principles they keep. Mine sides with Pareto.
11/
All theories assessing such infinite wholes in terms of their parts face a dilemma — they can only choose one of:
• Pareto: improving every part of the whole must make the whole better.
• Unrestricted Permutation: a whole with the same parts but in a different order would be equally valuable.
10/