Thodoris Kouzelis @nicolabourbaki

Boosting Generative Image Modeling via Joint Image-Feature Synthesis Latent diffusion models (LDMs) dominate high-quality image generation, yet integrating representation learning with generative modeling remains a challenge. We introduce a novel generative image model...

11/n Joint work with @sta8is.bsky.social @ikakogeorgiou.bsky.social , @spyrosgidaris.bsky.social , Nikos Komodakis
Paper: arxiv.org/abs/2504.16064
Code: github.com/zelaki/ReDi

25.04.2025 07:23 — 👍 1 🔁 0 💬 0 📌 0

10/n We apply PCA to DINOv2 to retain expressivity without dominating model capacity. Just a few PCs suffice to significantly boost generative performance.

25.04.2025 07:23 — 👍 1 🔁 0 💬 1 📌 0

9/n Unconditional generation gets a huge upgrade too. ReDi + Representation Guidance (RG) nearly closes the gap with conditional models. E.g., unconditional DiT-XL/2 with ReDi+RG hits FID 22.6, close to class-conditioned DiT-XL’s FID 19.5! 💪

25.04.2025 07:23 — 👍 1 🔁 0 💬 1 📌 0

8/n ReDi delivers delivers state-of-the-art results with exceptional generation performance, across the board. 🔥

25.04.2025 07:23 — 👍 2 🔁 0 💬 1 📌 0

7/n Training speed? Massive improvements for both DiT and SiT:
~23x faster convergence than baseline DiT/SiT.
~6x faster than REPA.🚀

25.04.2025 07:23 — 👍 1 🔁 0 💬 1 📌 0

6/n ReDi requires no extra distillation losses, just pure diffusion, significantly simplifying training. Plus, it unlocks Representation Guidance (RG), a new inference strategy that uses learned semantics to steer and refine image generation. 🎯

25.04.2025 07:23 — 👍 3 🔁 0 💬 1 📌 0

5/n We explore two ways to fuse tokens for image latents & features
- Merged Tokens (MR): Efficient, keeps token count constant
- Separate Tokens (SP): More expressive, ~2x compute
Both boost performance, but MR hits the sweet spot for speed vs. quality.

25.04.2025 07:23 — 👍 2 🔁 0 💬 1 📌 0

4/n Integrating ReDi into DiT/SiT-style architectures is seamless
- Apply noise to both image latents and semantic features
- Fuse them into one token sequence
- Denoise both with standard DiT/SiT
That’s it.

25.04.2025 07:23 — 👍 1 🔁 0 💬 1 📌 0

3/n ReDi builds on the insight that some latent representations are inherently easier to model (h/t @sedielem.bsky.social's blog), enabling a unified dual-space diffusion approach that generates coherent image–feature pairs from pure noise.

25.04.2025 07:23 — 👍 1 🔁 0 💬 1 📌 0

2/n The result?
🔗 A powerful new method for generative image modeling that bridges generation and representation learning.
⚡️Brings massive gains in performance/training efficiency and a new paradigm for representation-aware generative modeling.

25.04.2025 07:23 — 👍 1 🔁 0 💬 1 📌 0

1/n Introducing ReDi (Representation Diffusion): a new generative approach that leverages a diffusion model to jointly capture
– Low-level image details (via VAE latents)
– High-level semantic features (via DINOv2)🧵

25.04.2025 07:23 — 👍 21 🔁 3 💬 1 📌 1

EQ-VAE: Equivariance Regularized Latent Space for Improved Generative Image Modeling Latent generative models have emerged as a leading approach for high-quality image synthesis. These models rely on an autoencoder to compress images into a latent space, followed by a generative model...

10/n
Joint work with @ikakogeorgiou.bsky.social, @spyrosgidaris.bsky.social and Nikos Komodakis
Paper: arxiv.org/abs/2502.09509
Code: github.com/zelaki/eqvae
HuggingFace Model: huggingface.co/zelaki/eq-va...

18.02.2025 14:31 — 👍 0 🔁 0 💬 0 📌 0

9/n How fast does EQ-VAE refine the latents?
We trained DiT-B/2 on the resulting latents at each fine-tuning epoch. Even after just a few epochs, gFID drops significantly—showing how quickly EQ-VAE improves the latent space.

18.02.2025 14:26 — 👍 0 🔁 0 💬 1 📌 0

8/n Why does EQ-VAE help so much?
We find a strong correlation between latent space complexity and generative performance.
🔹 EQ-VAE reduces the intrinsic dimension (ID) of the latent manifold.
🔹 This makes the latent space simpler and easier to model.

18.02.2025 14:26 — 👍 0 🔁 0 💬 1 📌 0

7/n Performance gains across the board:
✅ DiT-XL/2: gFID drops from 19.5 → 14.5 at 400K iterations
✅ REPA: Training time 4M → 1M iterations (4× speedup)
✅ MaskGIT: Training time 300 → 130 epochs (2× speedup)

18.02.2025 14:26 — 👍 1 🔁 0 💬 1 📌 0

6/n EQ-VAE provides a plug-and-play enhancement — no architectural changes are needed, working seamlessly with:
✅ Continuous autoencoders (SD-VAE, SDXL-VAE, SD3-VAE)
✅ Discrete autoencoders (VQ-GAN)

18.02.2025 14:26 — 👍 0 🔁 0 💬 1 📌 0

5/n EQ-VAE fixes this by introducing a simple regularization objective:
👉 It aligns reconstructions of transformed latents with the corresponding transformed inputs.

18.02.2025 14:26 — 👍 0 🔁 0 💬 1 📌 0

4/n The motivation:
SOTA autoencoders reconstruct images well but fail to maintain equivariance in latent space.
✅ If you scale an input image, its reconstruction is fine
❌ But if you scale the latent representation directly, the reconstruction degrades significantly.

18.02.2025 14:26 — 👍 0 🔁 0 💬 1 📌 0

3/n Fine-tuning pre-trained autoencoders with EQ-VAE for just 5 epochs unlocks major speedups:
✅ 7× faster training convergence on DiT-XL/2
✅ 4× faster training on REPA

18.02.2025 14:26 — 👍 0 🔁 0 💬 1 📌 0

2/n Why EQ-VAE?
🔹Smoother latent space = easier to model & better generative performance.
🔹No trade-off in reconstruction quality—rFID improves too!
🔹Works as a plug-and-play enhancement—no architectural changes needed!

18.02.2025 14:26 — 👍 0 🔁 0 💬 1 📌 0

1/n🚀If you’re working on generative image modeling, check out our latest work! We introduce EQ-VAE, a simple yet powerful regularization approach that makes latent representations equivariant to spatial transformations, leading to smoother latents and better generative models.👇

18.02.2025 14:26 — 👍 18 🔁 8 💬 1 📌 1

Thodoris Kouzelis

Latest posts by nicolabourbaki.bsky.social on Bluesky

@nicolabourbaki is following 20 prominent accounts