Michael Tschannen's Avatar

Michael Tschannen

@mtschannen.bsky.social

Research Scientist @GoogleDeepMind. Representation learning for multimodal understanding and generation. mitscha.github.io

855 Followers  |  382 Following  |  17 Posts  |  Joined: 21.11.2024  |  1.535

Latest posts by mtschannen.bsky.social on Bluesky

Preview
SigLIP2 - a google Collection We’re on a journey to advance and democratize artificial intelligence through open source and open science.

HF model collection for transformers:
huggingface.co/collections/...

HF model collection for OpenCLIP and timm:
huggingface.co/collections/...

And of course big_vision checkpoints:
github.com/google-resea...

22.02.2025 15:34 β€” πŸ‘ 2    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
Preview
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features We introduce SigLIP 2, a family of new multilingual vision-language encoders that build on the success of the original SigLIP. In this second iteration, we extend the original image-text training obje...

Paper:
arxiv.org/abs/2502.14786

HF blog post from @arig23498.bsky.social et al. with a gentle intro to the training recipe and a demo:
huggingface.co/blog/siglip2

Thread with results overview from Xiaohua (only on X, sorry - these are all in the paper):
x.com/XiaohuaZhai/...

22.02.2025 15:34 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Post image

πŸ“’2⃣ Yesterday we released SigLIP 2!

TL;DR: Improved high-level semantics, localization, dense features, and multilingual capabilities via drop-in replacement for v1.

Bonus: Variants supporting native aspect and variable sequence length.

A thread with interesting resourcesπŸ‘‡

22.02.2025 15:34 β€” πŸ‘ 12    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Post image

Looking for a small or medium sized VLM? PaliGemma 2 spans more than 150x of compute!

Not sure yet if you want to invest the time πŸͺ„finetuningπŸͺ„ on your data? Give it a try with our ready-to-use "mix" checkpoints:

πŸ€— huggingface.co/blog/paligem...
🎀 developers.googleblog.com/en/introduci...

19.02.2025 17:47 β€” πŸ‘ 19    πŸ” 7    πŸ’¬ 0    πŸ“Œ 0

Check out our detailed report about *Jet* 🌊 - a simple, transformer-based normalizing flow architecture without bells and whistles.

Jet is an important part of JetFormer's engine βš™οΈ As a standalone model it is very tame and behaves predictably (e.g. when scaling it up).

20.12.2024 15:17 β€” πŸ‘ 5    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Attending #NeurIPS2024? If you're interested in multimodal systems, building inclusive & culturally aware models, and how fractals relate to LLMs, we've 3 posters for you. I look forward to presenting them on behalf of our GDM team @ Zurich & collaborators. Details below (1/4)

07.12.2024 18:50 β€” πŸ‘ 12    πŸ” 5    πŸ’¬ 1    πŸ“Œ 0
Post image

πŸš€πŸš€PaliGemma 2 is our updated and improved PaliGemma release using the Gemma 2 models and providing new pre-trained checkpoints for the full cross product of {224px,448px,896px} resolutions and {3B,10B,28B} model sizes.

1/7

05.12.2024 18:16 β€” πŸ‘ 69    πŸ” 21    πŸ’¬ 1    πŸ“Œ 5

It’s not, good catch.

03.12.2024 21:51 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Very nice! I knew some soft-token TTS papers, but none so far using AR + normalizing flows. Thanks for sharing!

03.12.2024 09:44 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

The noise curriculum guides the (image generation) learning process to first learn high-level, global structure and later low-level structure/texture. Maximum likelihood β€œtends to focus” mostly on the latter.

03.12.2024 08:02 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

In arxiv.org/abs/2303.00848, @dpkingma.bsky.social and @ruiqigao.bsky.social had suggested that noise augmentation could be used to make other likelihood-based models optimise perceptually weighted losses, like diffusion models do. So cool to see this working well in practice!

02.12.2024 18:36 β€” πŸ‘ 53    πŸ” 11    πŸ’¬ 0    πŸ“Œ 0

I always dreamed of a model that simultaneously

1. optimizes NLL of raw pixel data,
2. generates competitive high-res. natural images,
3. is practical.

But it seemed too good to be true. Until today!

Our new JetFormer model (arxiv.org/abs/2411.19722) ticks on all of these.

🧡

02.12.2024 17:19 β€” πŸ‘ 37    πŸ” 5    πŸ’¬ 3    πŸ“Œ 0

Did you ever try to get an auto-regressive transformer to operate in a continuous latent space which is not fixed ahead of time but learned end to end from scratch?

Enter JetFormer: arxiv.org/abs/2411.19722 -- joint work in a dream team: @mtschannen.bsky.social and @kolesnikov.ch

02.12.2024 18:17 β€” πŸ‘ 14    πŸ” 2    πŸ’¬ 0    πŸ“Œ 0

Joint work with @asusanopinto.bsky.social and @kolesnikov.ch done @googledeepind.bsky.social.

8/8

02.12.2024 16:41 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

To our knowledge, JetFormer is the first model capable of generating high fidelity images and producing strong log-likelihood bounds.

So far we explored a simple setup (image/text pairs, no post-training), and hope JetFormer inspires more (visual) tokenizer-free models!

7/

02.12.2024 16:41 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Finally, why getting rid of visual tokenizers/VQ-VAEs?
- They can induce information loss (e.g. small text)
- Removing specialized components was a key driver of recent progress (bitter lesson)
- Raw likelihoods are comparable across models (for hill climbing, scaling laws)

6/

02.12.2024 16:41 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Importantly, this is simple additive Gaussian noise on the training images (i.e. a data augmentation). JetFormer does neither depend on it (or its parameters), nor is it trained for denoising like diffusion models.

5/

02.12.2024 16:41 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Learning to generate high-fidelity images with maximum likelihood is tricky. To bias the model towards nicer-looking images we introduce a noise curriculum: Gaussian noise added to the input image and annealed to 0 during training, s.t. high-level details are learned first.

4/

02.12.2024 16:41 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 2    πŸ“Œ 1
Post image

Conceptually, the normalizing flow serves as both an image encoder for perception tasks and an image decoder for image generation tasks during inference.

We train JetFormer to maximize the likelihood of the multimodal data, without auxiliary losses (perceptual or similar).

3/

02.12.2024 16:41 β€” πŸ‘ 5    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
GIVT: Generative Infinite-Vocabulary Transformers We introduce Generative Infinite-Vocabulary Transformers (GIVT) which generate vector sequences with real-valued entries, instead of discrete tokens from a finite vocabulary. To this end, we propose t...

We leverage a normalizing flow (β€œjet”) to obtain a soft-token image representation that is end-to-end trained with a multimodal transformer for next-token prediction. The soft token distribution is modeled with a GMM Γ  la GIVT.

arxiv.org/abs/2312.02116

2/

02.12.2024 16:41 β€” πŸ‘ 5    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Post image

Have you ever wondered how to train an autoregressive generative transformer on text and raw pixels, without a pretrained visual tokenizer (e.g. VQ-VAE)?

We have been pondering this during summer and developed a new model: JetFormer πŸŒŠπŸ€–

arxiv.org/abs/2411.19722

A thread πŸ‘‡

1/

02.12.2024 16:41 β€” πŸ‘ 155    πŸ” 36    πŸ’¬ 4    πŸ“Œ 7

Thank you!

25.11.2024 18:25 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

πŸ™‹β€β™‚οΈ

24.11.2024 20:14 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

@mtschannen is following 20 prominent accounts