Check out our website for the original Claudiness analysis and much more!
epoch.ai/gradient-up...
@epochai.bsky.social
We are a research institute investigating the trajectory of AI for the benefit of society. epoch.ai
Check out our website for the original Claudiness analysis and much more!
epoch.ai/gradient-up...
How Claudey is Opus 4.5?
We previously described Claudiness as "good at agentic tasks while being weaker at multimodal and math". This pattern remains when comparing Opus 4.5 to other newly-released models, though the gap on agentic coding and tool-calling benchmarks is small.
See our benchmarking hub for this data and much more!
epoch.ai/benchmarks/...
On the harder FrontierMath Tier 4, Opus 4.5 scored 4%, solving 2 out of 48 problems. This matches the best score achieved by previous Anthropic models but is below the scores of GPT-5.1 (13%) and Gemini 3 Pro (19%).
25.11.2025 21:26 β π 1 π 1 π¬ 1 π 0The explanation for this may lie in the scaffold. We run FrontierMath in an agentic loop, whereas we run OTIS Mock AIME as a single-turn question. Opus 4.5 may take advantage of multiple turns in the agentic loop to compensate for the lack of an extended thinking budget.
25.11.2025 21:26 β π 1 π 0 π¬ 1 π 0Opus 4.5 scores the same on FrontierMath regardless of thinking budget, in contrast to GPT-5.1 where higher reasoning settings correspond to higher scores.
However, on OTIS Mock AIME, another math benchmark, we see the thinking budget make a difference for Opus 4.5 as well.
We benchmarked Opus 4.5 on FrontierMath. It scored 21% on FrontierMath Tiers 1β3, continuing a trend of improvement for Anthropic models.
This score is behind Gemini 3 Pro and GPT-5.1 (high) while being on par with earlier frontier models like o3 (high) and Grok 4.
This analysis is made possible by our detailed benchmarking data, which is fully public. Check it out!
For instance, here is a link to Gemini 3 Proβs correct answer to an organic chemistry question that GPT-5.1 (high) got right no better than chance.
logs.epoch.ai/inspect-vie...
We had previously noted that frontier models did worse on organic chemistry than any of the other scientific domains in GPQA. In other words, organic chemistry had the most room for improvement.
epoch.ai/gradient-up...
Gemini 3 Pro set a new record on GPQA Diamond: 93% vs. the previous record of 88%. What you canβt tell from the headline: almost all of this gain came in organic chemistry. π§¬π§΅
25.11.2025 16:57 β π 7 π 2 π¬ 1 π 0Weβve optimized our Frontier Data Centers hub for mobile.
You can now examine annotated, recent, high-resolution satellite imagery of the world's largest compute clusters directly from your phone at epoch.ai/data/data-c....
Hereβs a look at the updated Satellite Viewer:
The ECI score for Gemini 3 Pro currently includes results from ARC-AGI, FrontierMath, GeoBench, GPQA Diamond, OTIS Mock AIME 2024-2025, SimpleBench, Terminal-Bench, and WeirdML.
See our benchmarking hub for all this and more!
epoch.ai/benchmarks
Gemini 3 Pro set a new record on FrontierMath: 38% on Tiers 1β3 and 19% on Tier 4.
On the Epoch Capabilities Index (ECI), which combines multiple benchmarks, Gemini 3 Pro scored 154, up from GPT-5.1βs previous high score of 151.
This Gradient Update was written by Greg Burnham. You can read the full post here:
epoch.ai/gradient-up...
Even if we are in the contingent world, developers can keep up the βeverything at onceβ regime so long as they can afford the right training data and enough compute to make use of it. Whether the flywheel of growth can support this indefinitely is a trillion dollar question.
20.11.2025 21:09 β π 0 π 0 π¬ 1 π 0First, developers seem to be trying to put everything in the training distribution, suggesting they donβt want to bet the farm on deep generality. Also, the Claudiness dimension hints at a limit: if you make a great agentic coding model, it isnβt automatically great at math.
20.11.2025 21:09 β π 0 π 0 π¬ 1 π 0Is the existence of the main dimension due to a βdeepβ fact about the generality of intelligence, or a βcontingentβ situation where model developers are investing in improving at all benchmarks at once? Itβs hard to say, but we discuss a few points for contingency.
20.11.2025 21:09 β π 0 π 0 π¬ 1 π 0This second component picks out models that are good at agentic tasks while being weaker at multimodal and math. Tongue-in-cheek, we call this Claudiness. Here are the most and least Claude-y models.
20.11.2025 21:09 β π 0 π 0 π¬ 1 π 0Is that all that benchmarks capture? Mostly yes. A Principal Component Analysis shows a single large βGeneral Capabilityβ component, though there is a second borderline-significant component too.
20.11.2025 21:09 β π 1 π 0 π¬ 1 π 0The chart above shows how our Epoch Capabilities Index (ECI) captures most of the variance in 39 different benchmarks, despite being one-dimensional.
20.11.2025 21:09 β π 0 π 0 π¬ 1 π 0Benchmarking data is dominated by a single βGeneral Capabilityβ dimension. Is this due to good generalization across tasks, or to developers pushing on all benchmarks at once?
π§΅ with some analysis, including the discovery of a βClaudinessβ dimension.
Or, explore high resolution aerial images of the largest data centers under construction at our Frontier Data Centers Hub:
epoch.ai/data/data-c...
To see more about our analysis of data center size, visit our data insight:
epoch.ai/data-insigh...
Itβs easy to talk about βlarge AI data centersβ and still underestimate the scale.
Our Frontier Data Centers database shows that some upcoming campuses will cover a substantial portion of Manhattan. Meta's Hyperion data center will be nearly four times the size of Central Park.
Thus, GPT-5.1 joins GPT-5 at the frontier but, despite greater token use, does not improve on the capabilities these benchmarks measure.
Check out our website for all this and more!
epoch.ai/benchmarks
OpenAI described GPT-5.1 as spending βless time on easy tasks and more time on hard tasksβ compared to GPT-5. Assuming our benchmarks count as βhard tasksβ, this appears to be true.
19.11.2025 12:10 β π 2 π 0 π¬ 1 π 0ECI also incorporates benchmarks run by others. We see a similar picture of broadly comparable scores across these benchmarks.
19.11.2025 12:10 β π 0 π 0 π¬ 1 π 0For the benchmarks we run ourselves, scores of the two models are all within the margin of error of each other.
19.11.2025 12:10 β π 0 π 0 π¬ 1 π 0GPT-5.1 is about as capable as GPT-5.
Thatβs according to the Epoch Capabilities Index, our tool for combining results across multiple benchmarks. With βhighβ reasoning, both GPT-5.1 and GPT-5 score 151 on ECI.
See π§΅ for individual benchmark scores!
Data centers supporting AI training runs could require 1-5 GW by 2030, enough to power entire cities.
Join us for a live webinar/Q&A on our new Frontier Data Centers Hub, exploring what this infrastructure buildout means for AI.
Nov 20, 1-2 PM PT
luma.com/oste01d0