The gap we observe suggests that compute available for cutting-edge AI could meaningfully accelerate robotics, if the field addresses challenges like data scarcity.
See our full analysis, methodology, and code here:
epoch.ai/data-insigh...
@epochai.bsky.social
We are a research institute investigating the trajectory of AI for the benefit of society. epoch.ai
The gap we observe suggests that compute available for cutting-edge AI could meaningfully accelerate robotics, if the field addresses challenges like data scarcity.
See our full analysis, methodology, and code here:
epoch.ai/data-insigh...
If compute is not yet a bottleneck for robotic manipulation, then what is?
Roboticists often point to data scarcity: compared to internet text and video, thereβs far less robotics data (e.g. recorded actions for manual tasks). Without more data, scaling compute may be inefficient.
A roboticist at a major AI company told us that naively scaling up now wouldnβt help: βIf you magically allowed every leading group 100x the compute for their next hero run, I believe 99% of the field simply does not know how to use it, and would not see any improved results.β
11.08.2025 22:57 β π 1 π 0 π¬ 1 π 0Since 2018, leading manipulation models have typically trained on about 1% of the FLOP invested in frontier AI. Many of these robotics models are developed by the same labs behind frontier language models, so access to compute is unlikely to be the bottleneck right now.
11.08.2025 22:57 β π 0 π 0 π¬ 1 π 0The past 5 years have seen big successes in language, image and video generation, but relatively limited success in robotic manipulation. Why donβt we have laundry robots in every house?
One thing seems clear: training compute is not the blocker. π§΅
Weβll be watching closely for more evidence on how GPT-5 was trained. Stay tuned!
11.08.2025 17:57 β π 0 π 0 π¬ 0 π 0GPT-5βs compute scale has implications for AI's trajectory.
OpenAI might feel that scaling is relatively unpromising for now, perhaps due to inference costs.
But if GPT-5 doesnβt set a new compute frontier, they have headroom for faster iteration cycles and future scale-ups.
But most models to date were majority pre-training compute. Efficiently scaling up RL will require research on data, environments, and reward models, and GPT-5 is probably too early to achieve GPT-4.5 scale through RL alone, much less reach a new frontier in compute.
11.08.2025 17:57 β π 0 π 0 π¬ 1 π 0Companies are also rapidly scaling reinforcement learning, which follows traditional pretraining, to improve reasoning and other skills. For example, OpenAI scaled up RL compute by 10x between o1 and o3.
11.08.2025 17:57 β π 0 π 0 π¬ 1 π 0Our conclusion that GPT-5 isnβt a 100x scale-up from GPT-4 was confirmed by Rohan Pandey (formerly OpenAI), at least in terms of pre-training. x.com/khoomeik/st...
11.08.2025 17:57 β π 0 π 0 π¬ 1 π 0We donβt know how much data GPT-5 was trained on. But since scaling pre-training data was a major challenge for GPT-4.5 just six months ago, GPT-5 likely didnβt use significantly more real data. It also used synthetic data from o3, but with a focus on quality, not quantity.
11.08.2025 17:57 β π 0 π 0 π¬ 1 π 0Training compute scales with model size Γ training data.
GPT-5 is fast and fairly cheap on the API, with output tokens 15x cheaper and served ~2-4x faster than GPT-4.5 on launch! This suggests GPT-5 is a much smaller model than GPT-4.5.
GPT-4 was trained on 2e25 floating-point operations, and OpenAI said GPT-4.5 was about an order-of-magnitude (10x) scale-up.
We donβt have a rigorous estimate yet, but GPT-5βs compute scale may be *between* GPT-4 and GPT-4.5, and it is probably not a large scale-up from 4.5.
OpenAI has historically scaled up training compute by around 100x with each new generation of its GPT.
However, GPT-5 appears to be an exception to this trend.π§΅
But, as it stands, weβll have to wait for other evaluations to get a better sense of the new modelsβ limits.
Check out the full article for more baselines from prior AI systems, plus a deeper look into some of the problems and AI solutions.
epoch.ai/gradient-up...
Even if we canβt infer capabilities progress, the gold medals may still tell us something about reliability. The experimental systems managed logically flawless solutions in their one-and-only submissions, a feat we havenβt seen much from LLMs in hard-to-verify domains.
11.08.2025 16:14 β π 0 π 0 π¬ 1 π 0Looking into the problems themselves confirms this picture. The 5 solved ones are straightforward. The 6th is anything but: success here would have been very impressive, but failure doesnβt tell us much. Something between βmediumβ and βbrutalβ would have been more informative.
11.08.2025 16:14 β π 0 π 0 π¬ 1 π 0Solving these problems doesnβt represent much progress because previously-available models already do fine on them. Based on data from MathArena, in a best-of-4 setting Gemini 2.5 Pro solved the hardest problem (P3, MOHS βmediumβ) and got substantial partial credit on two more.
11.08.2025 16:14 β π 0 π 0 π¬ 1 π 0The difficulty of this yearβs IMO problems was unusually lopsided: the 5 problems solved by AI were all βeasyβ to βmediumβ difficulty, according to US IMO team coach Evan Chenβs Math Olympiad Hardness Scale (MOHS). The last time this many problems were this low on MOHS was 2001.
11.08.2025 16:14 β π 0 π 0 π¬ 1 π 0Multiple AI systems won gold medals at the 2025 International Mathematical Olympiad (IMO). Exciting as that sounds,
@GregHBurnham
argues that it represents little progress: an unlucky draw of problems made the event relatively uninformative.
Is that cope? Judge for yourself. π§΅
You can find results for the GPT-5 family of models on other benchmarks on our website!
epoch.ai/benchmarks
GPT-5 in the high reasoning setting hit the 100K token limit for our evaluations on 10/290 Tier 1-3 samples (3%). This means our evaluation might slightly underestimate the reasoning capabilities of GPT-5.
08.08.2025 11:33 β π 0 π 0 π¬ 1 π 0GPT-5 sets a new record on FrontierMath! On our scaffold, GPT-5 with high reasoning effort scores 24.8% (Β±2.5%) and 8.3% (Β±4.0%) in tiers 1-3 and 4, respectively.
08.08.2025 11:33 β π 4 π 0 π¬ 1 π 0See more evaluations and trends in AI capabilities in the Epoch AI benchmarking hub!
epoch.ai/benchmarks
We have independently evaluated the new Claude Opus 4.1. We see 63% on SWE-bench verified and 7% on FrontierMath Tier 1-3 β a minor improvement over Claude Opus 4.
06.08.2025 14:57 β π 5 π 0 π¬ 1 π 0You can read the full article here:
epoch.ai/gradient-up...
These obstacles donβt prevent Chinese developers from training and running frontier AI models. But they do make things much more costly for China, enough to put China at a significant disadvantage in scaling AI for at least the rest of the decade.
29.07.2025 23:04 β π 0 π 0 π¬ 1 π 0Thus, Chinese developers have used a hybrid approach, leveraging Huawei chips for inference but reserving limited NVIDIA GPUs for large-scale training. The Chinese government has also provided investment support, but manufacturing and software problems take time to resolve.
29.07.2025 23:04 β π 0 π 0 π¬ 1 π 0These challenges are compounded by the weakness of Chinaβs software ecosystem. NVIDIAβs CUDA stack has been refined for ~2 decades, and is well-integrated into ML libraries like PyTorch. Huaweiβs CANN framework is far newer and bug-prone, causing inefficiencies and raising costs.
29.07.2025 23:04 β π 1 π 0 π¬ 1 π 0Export controls make this harder, restricting Chinaβs access to advanced lithography equipment. Without the best equipment, a smaller fraction of produced Chinese chips actually work: under 50% at SMIC, compared to TSMCβs ~90% at the 7nm node.
29.07.2025 23:04 β π 0 π 0 π¬ 1 π 0