11/n Joint work with @sta8is.bsky.social @ikakogeorgiou.bsky.social , @spyrosgidaris.bsky.social , Nikos Komodakis
Paper: arxiv.org/abs/2504.16064
Code: github.com/zelaki/ReDi
@nicolabourbaki.bsky.social
1st year PhD Candidate Archimedes, Athena RC & NTUA
11/n Joint work with @sta8is.bsky.social @ikakogeorgiou.bsky.social , @spyrosgidaris.bsky.social , Nikos Komodakis
Paper: arxiv.org/abs/2504.16064
Code: github.com/zelaki/ReDi
10/n We apply PCA to DINOv2 to retain expressivity without dominating model capacity. Just a few PCs suffice to significantly boost generative performance.
25.04.2025 07:23 β π 1 π 0 π¬ 1 π 09/n Unconditional generation gets a huge upgrade too. ReDi + Representation Guidance (RG) nearly closes the gap with conditional models. E.g., unconditional DiT-XL/2 with ReDi+RG hits FID 22.6, close to class-conditioned DiT-XLβs FID 19.5! πͺ
25.04.2025 07:23 β π 1 π 0 π¬ 1 π 08/n ReDi delivers delivers state-of-the-art results with exceptional generation performance, across the board. π₯
25.04.2025 07:23 β π 2 π 0 π¬ 1 π 07/n Training speed? Massive improvements for both DiT and SiT:
~23x faster convergence than baseline DiT/SiT.
~6x faster than REPA.π
6/n ReDi requires no extra distillation losses, just pure diffusion, significantly simplifying training. Plus, it unlocks Representation Guidance (RG), a new inference strategy that uses learned semantics to steer and refine image generation. π―
25.04.2025 07:23 β π 3 π 0 π¬ 1 π 05/n We explore two ways to fuse tokens for image latents & features
- Merged Tokens (MR): Efficient, keeps token count constant
- Separate Tokens (SP): More expressive, ~2x compute
Both boost performance, but MR hits the sweet spot for speed vs. quality.
4/n Integrating ReDi into DiT/SiT-style architectures is seamless
- Apply noise to both image latents and semantic features
- Fuse them into one token sequence
- Denoise both with standard DiT/SiT
Thatβs it.
3/n ReDi builds on the insight that some latent representations are inherently easier to model (h/t @sedielem.bsky.social's blog), enabling a unified dual-space diffusion approach that generates coherent imageβfeature pairs from pure noise.
25.04.2025 07:23 β π 1 π 0 π¬ 1 π 02/n The result?
π A powerful new method for generative image modeling that bridges generation and representation learning.
β‘οΈBrings massive gains in performance/training efficiency and a new paradigm for representation-aware generative modeling.
1/n Introducing ReDi (Representation Diffusion): a new generative approach that leverages a diffusion model to jointly capture
β Low-level image details (via VAE latents)
β High-level semantic features (via DINOv2)π§΅
10/n
Joint work with @ikakogeorgiou.bsky.social, @spyrosgidaris.bsky.social and Nikos Komodakis
Paper: arxiv.org/abs/2502.09509
Code: github.com/zelaki/eqvae
HuggingFace Model: huggingface.co/zelaki/eq-va...
9/n How fast does EQ-VAE refine the latents?
We trained DiT-B/2 on the resulting latents at each fine-tuning epoch. Even after just a few epochs, gFID drops significantlyβshowing how quickly EQ-VAE improves the latent space.
8/n Why does EQ-VAE help so much?
We find a strong correlation between latent space complexity and generative performance.
πΉ EQ-VAE reduces the intrinsic dimension (ID) of the latent manifold.
πΉ This makes the latent space simpler and easier to model.
7/n Performance gains across the board:
β
DiT-XL/2: gFID drops from 19.5 β 14.5 at 400K iterations
β
REPA: Training time 4M β 1M iterations (4Γ speedup)
β
MaskGIT: Training time 300 β 130 epochs (2Γ speedup)
6/n EQ-VAE provides a plug-and-play enhancement β no architectural changes are needed, working seamlessly with:
β
Continuous autoencoders (SD-VAE, SDXL-VAE, SD3-VAE)
β
Discrete autoencoders (VQ-GAN)
5/n EQ-VAE fixes this by introducing a simple regularization objective:
π It aligns reconstructions of transformed latents with the corresponding transformed inputs.
4/n The motivation:
SOTA autoencoders reconstruct images well but fail to maintain equivariance in latent space.
β
If you scale an input image, its reconstruction is fine
β But if you scale the latent representation directly, the reconstruction degrades significantly.
3/n Fine-tuning pre-trained autoencoders with EQ-VAE for just 5 epochs unlocks major speedups:
β
7Γ faster training convergence on DiT-XL/2
β
4Γ faster training on REPA
2/n Why EQ-VAE?
πΉSmoother latent space = easier to model & better generative performance.
πΉNo trade-off in reconstruction qualityβrFID improves too!
πΉWorks as a plug-and-play enhancementβno architectural changes needed!
1/nπIf youβre working on generative image modeling, check out our latest work! We introduce EQ-VAE, a simple yet powerful regularization approach that makes latent representations equivariant to spatial transformations, leading to smoother latents and better generative models.π
18.02.2025 14:26 β π 18 π 8 π¬ 1 π 1