Samuel Lavoie's Avatar

Samuel Lavoie

@lavoiems.bsky.social

PhD candidate @Mila_quebec, @UMontreal. Ex: FAIR @AIatMeta. Learning representations, minimizing free energy, running.

260 Followers  |  109 Following  |  13 Posts  |  Joined: 06.02.2024  |  1.5309

Latest posts by lavoiems.bsky.social on Bluesky


Compositionality is a central desideratum for intelligent systems...but it's a fuzzy concept and difficult to quantify. In this blog post, lab member @ericelmoznino.bsky.social outlines ideas toward formalizing it & surveys recent work. A must-read for interested researchers in AI and Neuro

19.08.2025 13:50 โ€” ๐Ÿ‘ 21    ๐Ÿ” 5    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

This work wouldnโ€™t exist without my amazing co-authors:
@mnoukhov.bsky.social & @AaronCourville๐Ÿ™

22.07.2025 14:41 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
GitHub - lavoiems/DiscreteLatentCode: Official repository for the article Compositional Discrete Latent Code for High Fidelity, Productive Diffusion Models (https://arxiv.org/abs/2507.12318) Official repository for the article Compositional Discrete Latent Code for High Fidelity, Productive Diffusion Models (https://arxiv.org/abs/2507.12318) - lavoiems/DiscreteLatentCode

Code & Models are open source:
๐Ÿ’พ github.com/lavoiems/Dis...
๐Ÿ“œhttps://arxiv.org/pdf/2507.12318

Reproduce, remix, build your own DLC-powered models.

22.07.2025 14:41 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

Example: There are no โ€œteapots on mountainsโ€ in ImageNet.

We verify this via nearest-neighbor search in DinoV2 space.
But our model can still create themโ€”by composing concepts it learned separately.

22.07.2025 14:41 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

LLMs can speak in DLC!

We fine-tune a language model to sample DLC tokens from text, giving us a pipeline:
Text โ†’ DLC โ†’ Image
This also enables generation beyond ImageNet.

22.07.2025 14:41 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

DLCs are compositional.
Swap tokens between two images (๐Ÿ•โ€ฏKomodor + ๐Ÿโ€ฏCarbonara) โ†’ the model produces coherent hybrids never seen during training.

22.07.2025 14:41 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

๐Ÿš€ Results:

DiT-XL/2 + DLC โ†’ FID 1.59 on unconditional ImageNet

Works well with and without classifier-free guidance

Learns faster and better than prior works using pre-trained encoders

๐Ÿคฏ

22.07.2025 14:41 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

Unconditional generation pipeline:
Sample a DLC (e.g., with SEDD)

Decode it into an image (e.g., with DiT)

This ancestral sampling approach is simple but powerful.

22.07.2025 14:41 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

DLCs enables exactly this.
Images โ†’ sequences of discrete tokens via a Simplicial Embedding (SEM) encoder

We take the argmax over token distributions โ†’ get the DLC sequence

Think of it as โ€œtokenizingโ€ imagesโ€”like words for LLMs.

22.07.2025 14:41 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

Text models donโ€™t have this problem! LLMs can model internet scale corpus.

Soโ€ฆ can we improve image generation of highly-modal distributions by decomposing it into:

1. Generating discrete tokens - p(c)
2. Decoding tokens into images - p(x|c)

22.07.2025 14:41 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

Modeling highly multimodal distributions in continuous space is hard.
Even a simple 2D Gaussian mixture with a large number of modes may be tricky to model directly. Good conditioning solves this!

Could this be why large image generative models are almost always conditional? ๐Ÿค”

22.07.2025 14:41 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

๐Ÿงต Everyone is chasing new diffusion modelsโ€”but what about the representations they model from?
We introduce Discrete Latent Codes (DLCs):
- Discrete representation for diffusion models
- Uncond. gen. SOTA FID (1.59 on ImageNet)
- Compositional generation
- Integrates with LLM
๐Ÿงฑ

22.07.2025 14:41 โ€” ๐Ÿ‘ 5    ๐Ÿ” 3    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Preview
Modeling Caption Diversity in Contrastive Vision-Language Pretraining There are a thousand ways to caption an image. Contrastive Language Pretraining (CLIP) on the other hand, works by mapping an image and its caption to a single vector -- limiting how well CLIP-like mo...

The code and model weights for Llip are finally out! I hope you will find this model useful!
Paper: arxiv.org/abs/2405.00740
Code: github.com/facebookrese...
Models:
- ViT-G: huggingface.co/lavoies/llip...
- ViT-B: huggingface.co/lavoies/llip...

17.07.2025 13:59 โ€” ๐Ÿ‘ 6    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 1

Congrats Lucas! Looking forward to see what will come out of your lab in Zurich!

05.12.2024 12:55 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

@lavoiems is following 20 prominent accounts