Epoch AI's Avatar

Epoch AI

@epochai.bsky.social

We are a research institute investigating the trajectory of AI for the benefit of society. epoch.ai

686 Followers  |  20 Following  |  582 Posts  |  Joined: 22.11.2024  |  2.0574

Latest posts by epochai.bsky.social on Bluesky

Preview
Compute is not a bottleneck for robotic manipulation Compute is not a bottleneck for robotics, while training data is. Frontier-level compute could accelerate progress if data improves.

The gap we observe suggests that compute available for cutting-edge AI could meaningfully accelerate robotics, if the field addresses challenges like data scarcity.

See our full analysis, methodology, and code here:
epoch.ai/data-insigh...

11.08.2025 22:57 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

If compute is not yet a bottleneck for robotic manipulation, then what is?

Roboticists often point to data scarcity: compared to internet text and video, there’s far less robotics data (e.g. recorded actions for manual tasks). Without more data, scaling compute may be inefficient.

11.08.2025 22:57 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

A roboticist at a major AI company told us that naively scaling up now wouldn’t help: β€œIf you magically allowed every leading group 100x the compute for their next hero run, I believe 99% of the field simply does not know how to use it, and would not see any improved results.”

11.08.2025 22:57 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Since 2018, leading manipulation models have typically trained on about 1% of the FLOP invested in frontier AI. Many of these robotics models are developed by the same labs behind frontier language models, so access to compute is unlikely to be the bottleneck right now.

11.08.2025 22:57 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

The past 5 years have seen big successes in language, image and video generation, but relatively limited success in robotic manipulation. Why don’t we have laundry robots in every house?

One thing seems clear: training compute is not the blocker. 🧡

11.08.2025 22:57 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

We’ll be watching closely for more evidence on how GPT-5 was trained. Stay tuned!

11.08.2025 17:57 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

GPT-5’s compute scale has implications for AI's trajectory.

OpenAI might feel that scaling is relatively unpromising for now, perhaps due to inference costs.

But if GPT-5 doesn’t set a new compute frontier, they have headroom for faster iteration cycles and future scale-ups.

11.08.2025 17:57 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

But most models to date were majority pre-training compute. Efficiently scaling up RL will require research on data, environments, and reward models, and GPT-5 is probably too early to achieve GPT-4.5 scale through RL alone, much less reach a new frontier in compute.

11.08.2025 17:57 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Companies are also rapidly scaling reinforcement learning, which follows traditional pretraining, to improve reasoning and other skills. For example, OpenAI scaled up RL compute by 10x between o1 and o3.

11.08.2025 17:57 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Our conclusion that GPT-5 isn’t a 100x scale-up from GPT-4 was confirmed by Rohan Pandey (formerly OpenAI), at least in terms of pre-training. x.com/khoomeik/st...

11.08.2025 17:57 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

We don’t know how much data GPT-5 was trained on. But since scaling pre-training data was a major challenge for GPT-4.5 just six months ago, GPT-5 likely didn’t use significantly more real data. It also used synthetic data from o3, but with a focus on quality, not quantity.

11.08.2025 17:57 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Training compute scales with model size Γ— training data.

GPT-5 is fast and fairly cheap on the API, with output tokens 15x cheaper and served ~2-4x faster than GPT-4.5 on launch! This suggests GPT-5 is a much smaller model than GPT-4.5.

11.08.2025 17:57 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

GPT-4 was trained on 2e25 floating-point operations, and OpenAI said GPT-4.5 was about an order-of-magnitude (10x) scale-up.

We don’t have a rigorous estimate yet, but GPT-5’s compute scale may be *between* GPT-4 and GPT-4.5, and it is probably not a large scale-up from 4.5.

11.08.2025 17:57 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

OpenAI has historically scaled up training compute by around 100x with each new generation of its GPT.

However, GPT-5 appears to be an exception to this trend.🧡

11.08.2025 17:57 β€” πŸ‘ 2    πŸ” 1    πŸ’¬ 1    πŸ“Œ 1
Preview
We didn’t learn much from the IMO The problems gave AI only a slim chance to show new capabilities

But, as it stands, we’ll have to wait for other evaluations to get a better sense of the new models’ limits.

Check out the full article for more baselines from prior AI systems, plus a deeper look into some of the problems and AI solutions.
epoch.ai/gradient-up...

11.08.2025 16:14 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Even if we can’t infer capabilities progress, the gold medals may still tell us something about reliability. The experimental systems managed logically flawless solutions in their one-and-only submissions, a feat we haven’t seen much from LLMs in hard-to-verify domains.

11.08.2025 16:14 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Looking into the problems themselves confirms this picture. The 5 solved ones are straightforward. The 6th is anything but: success here would have been very impressive, but failure doesn’t tell us much. Something between β€œmedium” and β€œbrutal” would have been more informative.

11.08.2025 16:14 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Solving these problems doesn’t represent much progress because previously-available models already do fine on them. Based on data from MathArena, in a best-of-4 setting Gemini 2.5 Pro solved the hardest problem (P3, MOHS β€œmedium”) and got substantial partial credit on two more.

11.08.2025 16:14 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

The difficulty of this year’s IMO problems was unusually lopsided: the 5 problems solved by AI were all β€œeasy” to β€œmedium” difficulty, according to US IMO team coach Evan Chen’s Math Olympiad Hardness Scale (MOHS). The last time this many problems were this low on MOHS was 2001.

11.08.2025 16:14 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Multiple AI systems won gold medals at the 2025 International Mathematical Olympiad (IMO). Exciting as that sounds,
@GregHBurnham
argues that it represents little progress: an unlucky draw of problems made the event relatively uninformative.

Is that cope? Judge for yourself. 🧡

11.08.2025 16:14 β€” πŸ‘ 3    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Preview
AI Benchmarking Dashboard Our database of benchmark results, featuring the performance of leading AI models on challenging tasks. It includes results from benchmarks evaluated internally by Epoch AI as well as data collected from external sources. The dashboard tracks AI progress over time, and correlates benchmark scores with key factors like compute or model accessibility.

You can find results for the GPT-5 family of models on other benchmarks on our website!

epoch.ai/benchmarks

08.08.2025 11:33 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

GPT-5 in the high reasoning setting hit the 100K token limit for our evaluations on 10/290 Tier 1-3 samples (3%). This means our evaluation might slightly underestimate the reasoning capabilities of GPT-5.

08.08.2025 11:33 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

GPT-5 sets a new record on FrontierMath! On our scaffold, GPT-5 with high reasoning effort scores 24.8% (Β±2.5%) and 8.3% (Β±4.0%) in tiers 1-3 and 4, respectively.

08.08.2025 11:33 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
AI Benchmarking Dashboard Our database of benchmark results, featuring the performance of leading AI models on challenging tasks. It includes results from benchmarks evaluated internally by Epoch AI as well as data collected from external sources. The dashboard tracks AI progress over time, and correlates benchmark scores with key factors like compute or model accessibility.

See more evaluations and trends in AI capabilities in the Epoch AI benchmarking hub!

epoch.ai/benchmarks

06.08.2025 14:57 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

We have independently evaluated the new Claude Opus 4.1. We see 63% on SWE-bench verified and 7% on FrontierMath Tier 1-3 β€” a minor improvement over Claude Opus 4.

06.08.2025 14:57 β€” πŸ‘ 5    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
Why China isn’t about to leap ahead of the West on compute Chinese hardware is closing the gap, but major bottlenecks remain

You can read the full article here:
epoch.ai/gradient-up...

29.07.2025 23:04 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

These obstacles don’t prevent Chinese developers from training and running frontier AI models. But they do make things much more costly for China, enough to put China at a significant disadvantage in scaling AI for at least the rest of the decade.

29.07.2025 23:04 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Thus, Chinese developers have used a hybrid approach, leveraging Huawei chips for inference but reserving limited NVIDIA GPUs for large-scale training. The Chinese government has also provided investment support, but manufacturing and software problems take time to resolve.

29.07.2025 23:04 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

These challenges are compounded by the weakness of China’s software ecosystem. NVIDIA’s CUDA stack has been refined for ~2 decades, and is well-integrated into ML libraries like PyTorch. Huawei’s CANN framework is far newer and bug-prone, causing inefficiencies and raising costs.

29.07.2025 23:04 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Export controls make this harder, restricting China’s access to advanced lithography equipment. Without the best equipment, a smaller fraction of produced Chinese chips actually work: under 50% at SMIC, compared to TSMC’s ~90% at the 7nm node.

29.07.2025 23:04 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

@epochai is following 20 prominent accounts