Epoch AI's Avatar

Epoch AI

@epochai.bsky.social

We are a research institute investigating the trajectory of AI for the benefit of society. epoch.ai

869 Followers  |  20 Following  |  948 Posts  |  Joined: 22.11.2024  |  2.7334

Latest posts by epochai.bsky.social on Bluesky

Preview
Benchmark Scores = General Capability + Claudiness Is this because skills generalize very well, or because developers are pushing on all benchmarks at once?

Check out our website for the original Claudiness analysis and much more!

epoch.ai/gradient-up...

25.11.2025 22:26 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

How Claudey is Opus 4.5?

We previously described Claudiness as "good at agentic tasks while being weaker at multimodal and math". This pattern remains when comparing Opus 4.5 to other newly-released models, though the gap on agentic coding and tool-calling benchmarks is small.

25.11.2025 22:26 β€” πŸ‘ 5    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
FrontierMath 350 expert-written problems in advanced mathematics, requiring multiple hours or even days to solve.

See our benchmarking hub for this data and much more!

epoch.ai/benchmarks/...

25.11.2025 21:26 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

On the harder FrontierMath Tier 4, Opus 4.5 scored 4%, solving 2 out of 48 problems. This matches the best score achieved by previous Anthropic models but is below the scores of GPT-5.1 (13%) and Gemini 3 Pro (19%).

25.11.2025 21:26 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0

The explanation for this may lie in the scaffold. We run FrontierMath in an agentic loop, whereas we run OTIS Mock AIME as a single-turn question. Opus 4.5 may take advantage of multiple turns in the agentic loop to compensate for the lack of an extended thinking budget.

25.11.2025 21:26 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Opus 4.5 scores the same on FrontierMath regardless of thinking budget, in contrast to GPT-5.1 where higher reasoning settings correspond to higher scores.

However, on OTIS Mock AIME, another math benchmark, we see the thinking budget make a difference for Opus 4.5 as well.

25.11.2025 21:26 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

We benchmarked Opus 4.5 on FrontierMath. It scored 21% on FrontierMath Tiers 1–3, continuing a trend of improvement for Anthropic models.

This score is behind Gemini 3 Pro and GPT-5.1 (high) while being on par with earlier frontier models like o3 (high) and Grok 4.

25.11.2025 21:26 β€” πŸ‘ 7    πŸ” 2    πŸ’¬ 2    πŸ“Œ 0

This analysis is made possible by our detailed benchmarking data, which is fully public. Check it out!

For instance, here is a link to Gemini 3 Pro’s correct answer to an organic chemistry question that GPT-5.1 (high) got right no better than chance.

logs.epoch.ai/inspect-vie...

25.11.2025 16:57 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
GPQA Diamond: What’s left? Investigate GPQA Diamond benchmark’s validity: uncover flawed questions, model challenges, and why it still informs AI evaluation.

We had previously noted that frontier models did worse on organic chemistry than any of the other scientific domains in GPQA. In other words, organic chemistry had the most room for improvement.

epoch.ai/gradient-up...

25.11.2025 16:57 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Gemini 3 Pro set a new record on GPQA Diamond: 93% vs. the previous record of 88%. What you can’t tell from the headline: almost all of this gain came in organic chemistry. 🧬🧡

25.11.2025 16:57 β€” πŸ‘ 7    πŸ” 2    πŸ’¬ 1    πŸ“Œ 0
Video thumbnail

We’ve optimized our Frontier Data Centers hub for mobile.

You can now examine annotated, recent, high-resolution satellite imagery of the world's largest compute clusters directly from your phone at epoch.ai/data/data-c....

Here’s a look at the updated Satellite Viewer:

25.11.2025 02:15 β€” πŸ‘ 7    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
Preview
Data on AI Benchmarking Our database of benchmark results, featuring the performance of leading AI models on challenging tasks. It includes results from benchmarks evaluated internally by Epoch AI as well as data collected from external sources. Explore trends across time, by benchmark, or by model.

The ECI score for Gemini 3 Pro currently includes results from ARC-AGI, FrontierMath, GeoBench, GPQA Diamond, OTIS Mock AIME 2024-2025, SimpleBench, Terminal-Bench, and WeirdML.

See our benchmarking hub for all this and more!

epoch.ai/benchmarks

21.11.2025 19:04 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

Gemini 3 Pro set a new record on FrontierMath: 38% on Tiers 1–3 and 19% on Tier 4.

On the Epoch Capabilities Index (ECI), which combines multiple benchmarks, Gemini 3 Pro scored 154, up from GPT-5.1’s previous high score of 151.

21.11.2025 19:04 β€” πŸ‘ 9    πŸ” 2    πŸ’¬ 2    πŸ“Œ 1
Preview
Benchmark Scores = General Capability + Claudiness Is this because skills generalize very well, or because developers are pushing on all benchmarks at once?

This Gradient Update was written by Greg Burnham. You can read the full post here:

epoch.ai/gradient-up...

20.11.2025 21:09 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Even if we are in the contingent world, developers can keep up the β€œeverything at once” regime so long as they can afford the right training data and enough compute to make use of it. Whether the flywheel of growth can support this indefinitely is a trillion dollar question.

20.11.2025 21:09 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

First, developers seem to be trying to put everything in the training distribution, suggesting they don’t want to bet the farm on deep generality. Also, the Claudiness dimension hints at a limit: if you make a great agentic coding model, it isn’t automatically great at math.

20.11.2025 21:09 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Is the existence of the main dimension due to a β€œdeep” fact about the generality of intelligence, or a β€œcontingent” situation where model developers are investing in improving at all benchmarks at once? It’s hard to say, but we discuss a few points for contingency.

20.11.2025 21:09 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

This second component picks out models that are good at agentic tasks while being weaker at multimodal and math. Tongue-in-cheek, we call this Claudiness. Here are the most and least Claude-y models.

20.11.2025 21:09 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Is that all that benchmarks capture? Mostly yes. A Principal Component Analysis shows a single large β€œGeneral Capability” component, though there is a second borderline-significant component too.

20.11.2025 21:09 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

The chart above shows how our Epoch Capabilities Index (ECI) captures most of the variance in 39 different benchmarks, despite being one-dimensional.

20.11.2025 21:09 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Benchmarking data is dominated by a single β€œGeneral Capability” dimension. Is this due to good generalization across tasks, or to developers pushing on all benchmarks at once?

🧡 with some analysis, including the discovery of a β€œClaudiness” dimension.

20.11.2025 21:09 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0
Post image

Or, explore high resolution aerial images of the largest data centers under construction at our Frontier Data Centers Hub:
epoch.ai/data/data-c...

19.11.2025 19:54 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

To see more about our analysis of data center size, visit our data insight:
epoch.ai/data-insigh...

19.11.2025 19:54 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

It’s easy to talk about β€˜large AI data centers’ and still underestimate the scale.

Our Frontier Data Centers database shows that some upcoming campuses will cover a substantial portion of Manhattan. Meta's Hyperion data center will be nearly four times the size of Central Park.

19.11.2025 19:54 β€” πŸ‘ 10    πŸ” 2    πŸ’¬ 2    πŸ“Œ 3
Preview
Data on AI Benchmarking Our database of benchmark results, featuring the performance of leading AI models on challenging tasks. It includes results from benchmarks evaluated internally by Epoch AI as well as data collected f...

Thus, GPT-5.1 joins GPT-5 at the frontier but, despite greater token use, does not improve on the capabilities these benchmarks measure.

Check out our website for all this and more!

epoch.ai/benchmarks

19.11.2025 12:10 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

OpenAI described GPT-5.1 as spending β€œless time on easy tasks and more time on hard tasks” compared to GPT-5. Assuming our benchmarks count as β€œhard tasks”, this appears to be true.

19.11.2025 12:10 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

ECI also incorporates benchmarks run by others. We see a similar picture of broadly comparable scores across these benchmarks.

19.11.2025 12:10 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

For the benchmarks we run ourselves, scores of the two models are all within the margin of error of each other.

19.11.2025 12:10 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

GPT-5.1 is about as capable as GPT-5.

That’s according to the Epoch Capabilities Index, our tool for combining results across multiple benchmarks. With β€œhigh” reasoning, both GPT-5.1 and GPT-5 score 151 on ECI.

See 🧡 for individual benchmark scores!

19.11.2025 12:10 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
Frontier Data Centers Webinar and Q&A | Epoch AI Β· Luma Join us for a webinar and live Q&A on the Frontier Data Centers Hub, our open database that maps the construction, power, compute, and cost of the largest AI…

Data centers supporting AI training runs could require 1-5 GW by 2030, enough to power entire cities.

Join us for a live webinar/Q&A on our new Frontier Data Centers Hub, exploring what this infrastructure buildout means for AI.

Nov 20, 1-2 PM PT
luma.com/oste01d0

18.11.2025 20:17 β€” πŸ‘ 3    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0

@epochai is following 20 prominent accounts