Thodoris Kouzelis's Avatar

Thodoris Kouzelis

@nicolabourbaki.bsky.social

1st year PhD Candidate Archimedes, Athena RC & NTUA

36 Followers  |  38 Following  |  21 Posts  |  Joined: 18.02.2025  |  2.1399

Latest posts by nicolabourbaki.bsky.social on Bluesky

Preview
Boosting Generative Image Modeling via Joint Image-Feature Synthesis Latent diffusion models (LDMs) dominate high-quality image generation, yet integrating representation learning with generative modeling remains a challenge. We introduce a novel generative image model...

11/n Joint work with @sta8is.bsky.social @ikakogeorgiou.bsky.social , @spyrosgidaris.bsky.social , Nikos Komodakis
Paper: arxiv.org/abs/2504.16064
Code: github.com/zelaki/ReDi

25.04.2025 07:23 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

10/n We apply PCA to DINOv2 to retain expressivity without dominating model capacity. Just a few PCs suffice to significantly boost generative performance.

25.04.2025 07:23 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

9/n Unconditional generation gets a huge upgrade too. ReDi + Representation Guidance (RG) nearly closes the gap with conditional models. E.g., unconditional DiT-XL/2 with ReDi+RG hits FID 22.6, close to class-conditioned DiT-XL’s FID 19.5! πŸ’ͺ

25.04.2025 07:23 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image Post image

8/n ReDi delivers delivers state-of-the-art results with exceptional generation performance, across the board. πŸ”₯

25.04.2025 07:23 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

7/n Training speed? Massive improvements for both DiT and SiT:
~23x faster convergence than baseline DiT/SiT.
~6x faster than REPA.πŸš€

25.04.2025 07:23 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

6/n ReDi requires no extra distillation losses, just pure diffusion, significantly simplifying training. Plus, it unlocks Representation Guidance (RG), a new inference strategy that uses learned semantics to steer and refine image generation. 🎯

25.04.2025 07:23 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

5/n We explore two ways to fuse tokens for image latents & features
- Merged Tokens (MR): Efficient, keeps token count constant
- Separate Tokens (SP): More expressive, ~2x compute
Both boost performance, but MR hits the sweet spot for speed vs. quality.

25.04.2025 07:23 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

4/n Integrating ReDi into DiT/SiT-style architectures is seamless
- Apply noise to both image latents and semantic features
- Fuse them into one token sequence
- Denoise both with standard DiT/SiT
That’s it.

25.04.2025 07:23 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

3/n ReDi builds on the insight that some latent representations are inherently easier to model (h/t @sedielem.bsky.social's blog), enabling a unified dual-space diffusion approach that generates coherent image–feature pairs from pure noise.

25.04.2025 07:23 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

2/n The result?
πŸ”— A powerful new method for generative image modeling that bridges generation and representation learning.
⚑️Brings massive gains in performance/training efficiency and a new paradigm for representation-aware generative modeling.

25.04.2025 07:23 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

1/n Introducing ReDi (Representation Diffusion): a new generative approach that leverages a diffusion model to jointly capture
– Low-level image details (via VAE latents)
– High-level semantic features (via DINOv2)🧡

25.04.2025 07:23 β€” πŸ‘ 21    πŸ” 3    πŸ’¬ 1    πŸ“Œ 1
Preview
EQ-VAE: Equivariance Regularized Latent Space for Improved Generative Image Modeling Latent generative models have emerged as a leading approach for high-quality image synthesis. These models rely on an autoencoder to compress images into a latent space, followed by a generative model...

10/n
Joint work with @ikakogeorgiou.bsky.social, @spyrosgidaris.bsky.social and Nikos Komodakis
Paper: arxiv.org/abs/2502.09509
Code: github.com/zelaki/eqvae
HuggingFace Model: huggingface.co/zelaki/eq-va...

18.02.2025 14:31 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

9/n How fast does EQ-VAE refine the latents?
We trained DiT-B/2 on the resulting latents at each fine-tuning epoch. Even after just a few epochs, gFID drops significantlyβ€”showing how quickly EQ-VAE improves the latent space.

18.02.2025 14:26 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

8/n Why does EQ-VAE help so much?
We find a strong correlation between latent space complexity and generative performance.
πŸ”Ή EQ-VAE reduces the intrinsic dimension (ID) of the latent manifold.
πŸ”Ή This makes the latent space simpler and easier to model.

18.02.2025 14:26 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

7/n Performance gains across the board:
βœ… DiT-XL/2: gFID drops from 19.5 β†’ 14.5 at 400K iterations
βœ… REPA: Training time 4M β†’ 1M iterations (4Γ— speedup)
βœ… MaskGIT: Training time 300 β†’ 130 epochs (2Γ— speedup)

18.02.2025 14:26 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

6/n EQ-VAE provides a plug-and-play enhancement β€” no architectural changes are needed, working seamlessly with:
βœ… Continuous autoencoders (SD-VAE, SDXL-VAE, SD3-VAE)
βœ… Discrete autoencoders (VQ-GAN)

18.02.2025 14:26 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

5/n EQ-VAE fixes this by introducing a simple regularization objective:
πŸ‘‰ It aligns reconstructions of transformed latents with the corresponding transformed inputs.

18.02.2025 14:26 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

4/n The motivation:
SOTA autoencoders reconstruct images well but fail to maintain equivariance in latent space.
βœ… If you scale an input image, its reconstruction is fine
❌ But if you scale the latent representation directly, the reconstruction degrades significantly.

18.02.2025 14:26 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

3/n Fine-tuning pre-trained autoencoders with EQ-VAE for just 5 epochs unlocks major speedups:
βœ… 7Γ— faster training convergence on DiT-XL/2
βœ… 4Γ— faster training on REPA

18.02.2025 14:26 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

2/n Why EQ-VAE?
πŸ”ΉSmoother latent space = easier to model & better generative performance.
πŸ”ΉNo trade-off in reconstruction qualityβ€”rFID improves too!
πŸ”ΉWorks as a plug-and-play enhancementβ€”no architectural changes needed!

18.02.2025 14:26 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

1/nπŸš€If you’re working on generative image modeling, check out our latest work! We introduce EQ-VAE, a simple yet powerful regularization approach that makes latent representations equivariant to spatial transformations, leading to smoother latents and better generative models.πŸ‘‡

18.02.2025 14:26 β€” πŸ‘ 18    πŸ” 8    πŸ’¬ 1    πŸ“Œ 1

@nicolabourbaki is following 20 prominent accounts