Alexi Gladstone @alexiglad - Bluesky Profile

Energy-Based Transformers: Outscaling Transformers and Generalizable Reasoning Learn how Energy-Based Transformers (EBTs) enable improved scalability over traditional transformers while generalizing reasoning/thinking capabilities to be learned on any problem. #AI #DeepLearning ...

[12/N] Website: energy-based-transformers.github.io
Paper: arxiv.org/abs/2507.02092
HF Daily Paper Page: huggingface.co/papers/2507....

We’re just getting started with EBMs, we see EBMs as a generalizing framework and anticipate them experiencing a surge in popularity!

07.07.2025 20:33 — 👍 0 🔁 0 💬 0 📌 1

[11/N] ⛓️‍💥It’s common wisdom that “a chain is only as strong as its weakest link.”

Following this wisdom, we believe that each step in a chain of thought should receive sufficient computation to avoid failure “links” that result in bad reasoning, which EBTs enable.

07.07.2025 20:33 — 👍 0 🔁 0 💬 1 📌 0

[10/N] We also compare EBTs to diffusion models on relatively toy image denoising tasks, where we observe that EBTs outperform diffusion models while using 99% less forward passes.

EBTs also learn better representations of images than diffusion models, achieving a ~10x higher ImageNet accuracy.

07.07.2025 20:33 — 👍 0 🔁 0 💬 1 📌 0

[9/N] EBTs outscaling the Transformer++ also holds across modalities! We test this on video.📹

We think this performance improvement occurs because verification is often easier than generation and because EBTs can learn to express uncertainty in continuous spaces.

07.07.2025 20:33 — 👍 0 🔁 0 💬 1 📌 0

[8/N] In line with these results, we also find that even with the same or worse pretraining performance EBTs usually perform better on downstream tasks than the feed-forward Transformer++, further suggesting improved generalization.🎯

07.07.2025 20:33 — 👍 0 🔁 0 💬 1 📌 0

[7/N] 🧠We can also investigate the thinking capabilities of EBTs compared to the Transformer++ by increasing the amount of compute at inference time.

We find that EBTs can out-generalize the Transformer++ on Out-of-Distribution data by thinking longer and that Thinking also improves with scale.

07.07.2025 20:33 — 👍 0 🔁 0 💬 1 📌 0

[6/N] Of particular note is the data scaling, where we consistently observe EBTs being more data-efficient than the Transformer++ by > 30%. This is especially important because frontier labs are saying we are now data-constrained and that more data-efficient algorithms are the bottleneck.

07.07.2025 20:33 — 👍 0 🔁 0 💬 1 📌 0

[5/N] We compared autoregressive EBTs against the SOTA recipe (Transformer++) in language modeling. We observe that EBTs consistently scale at a higher rate than the Transformer++ with respect to data, batch size, depth, FLOPs, and parameters.📈

07.07.2025 20:33 — 👍 0 🔁 0 💬 1 📌 0

[4/N] So if EBMs are so promising, why are they uncommon, and why haven’t they been used at scale?

EBMs have struggled to scale due to issues with stability and parallelization. Therefore, we create Transformers specifically for solving these issues, which we call Energy-Based Transformers (EBTs).

07.07.2025 20:33 — 👍 0 🔁 0 💬 1 📌 0

[3/N] So what are EBMs?💭

EBMs learn to assign a scalar energy value denoting the compatibility of inputs.

Then, EBMs learn to optimize predictions to minimize this energy.

This allows EBMs to know when a problem is difficult (high energy), and adjust resources until a good solution is found.

07.07.2025 20:33 — 👍 0 🔁 0 💬 1 📌 0

[2/N] 🤔So how can models learn to think from unsupervised learning?

It turns out that there’s an elegant solution:💡
Learn to verify predictions
Optimization predictions with respect to this verifier

This is exactly what Energy-Based Models (EBM) are! EBMs enable thinking longer and self-verifying.

07.07.2025 20:33 — 👍 1 🔁 0 💬 1 📌 0

[1/N] First, how can we generalize reasoning/System 2 Thinking to any problem/modality?🧐

Current approaches rely on verifiable rewards, but humans are able to think on any problem.

To achieve such general thinking, we argue that models should learn to think directly from unsupervised learning.

07.07.2025 20:33 — 👍 1 🔁 0 💬 1 📌 0

[0/N]
TLDR:
- EBTs are the first model to outscale the Transformer++ during pretraining across modalities and with respect to data, parameters, FLOPs, depth, etc
- EBTs achieve a +29% improvement over the Transformer++ via thinking longer
- EBTs exhibit better generalization than existing models

07.07.2025 20:33 — 👍 0 🔁 0 💬 1 📌 0

How can we unlock generalized reasoning?

⚡️Introducing Energy-Based Transformers (EBTs), an approach that out-scales (feed-forward) transformers and unlocks generalized reasoning/thinking on any modality/problem without rewards.

🧵Thread:

07.07.2025 20:33 — 👍 9 🔁 3 💬 1 📌 0

Alexi Gladstone

Latest posts by alexiglad.bsky.social on Bluesky

@alexiglad is following 3 prominent accounts