Stefan Baumann's Avatar

Stefan Baumann

@stefanabaumann.bsky.social

PhD Student at @compvis.bsky.social & @ellis.eu working on generative computer vision. Interested in extracting world understanding from models and more controlled generation. 🌐 https://stefan-baumann.eu/

1,254 Followers  |  649 Following  |  59 Posts  |  Joined: 17.11.2024  |  2.0536

Latest posts by stefanabaumann.bsky.social on Bluesky

Congrats!

06.08.2025 11:28 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Did you also happen to participate in creating LLM preference annotations?

05.08.2025 06:39 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

As an author, I honestly prefer forum-style comments over one-page rebuttals (as long as we get some way to include figures). As a reviewer, I prefer a single page

01.08.2025 11:08 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

tl;dr: do importance weighting/sampling on a sequence level, not a token level.
Makes everything behave much better (see below) and makes more sense from a theoretical perspective, too.

Paper: www.arxiv.org/abs/2507.18071

26.07.2025 19:43 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

I'm calling it now, GSPO will be the next big hype in LLM RL algos after GRPO.

It makes so much more sense intuitively to work on a sequence rather than on a token level when our rewards are on a sequence level.

26.07.2025 19:40 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Absolutely 100% this. Who would want to read papers like VGG-T

25.07.2025 09:41 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
Genie: Generative Interactive Environments We introduce Genie, the first generative interactive environment trained in an unsupervised manner from unlabelled Internet videos. The model can be prompted to generate an endless variety of action-c...

Genie did this in a really cool manner: arxiv.org/abs/2402.15391

03.07.2025 16:11 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

I don't think the implicit assumptions are problematic likely, as long as the frequency range is reasonable. Keep in mind that we add an MLP afterwards that can freely learn to modulate the model with different frequencies

25.06.2025 12:16 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

approaches, but I don't think I've seen this in public yet. I considered doing it a while ago, but I never found a good justification to spend the time carefully ablating something like this. It might lead to some cool interpretable insights into the model's behavior across time though (3/3)

25.06.2025 07:02 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

So, just interpolating between two vectors wouldn't represent that too well, likely. Similarly, interpolating between N vectors for a small N might not be nicely aligned with the learned behavior. For a somewhat large N, this should work quite well and might be more efficient than the current (2/n)

25.06.2025 07:01 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

If I understand your suggestion correctly, you're proposing to forgo Fourier embeddings completely though, just replacing them with an interpolation between two vectors. That should also work, but, imho, you can identify at least three somewhat distinct phases in diffusion sampling (1/n)

25.06.2025 06:59 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Random Fourier projections are totally fine for timestep embeddings -- HDiT and some others use them --, but you still have to get the frequency range right. If your variance is too high, you're gonna end up with the same problem; if it's too small, you'll have actual problems

25.06.2025 06:58 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
Simo Ryu on X: "Right, it looks like roughly 25% of FLUX's timestep embedding is just identitcal values, and that is *with* multiple of 1000 on t \in [0, 1]. theta = 300 looks much informative (right side) https://t.co/KH59kiT9aR" / X Right, it looks like roughly 25% of FLUX's timestep embedding is just identitcal values, and that is *with* multiple of 1000 on t \in [0, 1]. theta = 300 looks much informative (right side) https://t.co/KH59kiT9aR

For the Fourier embedding of the timestep, most models range from slightly to severely suboptimal when considering the utilization of the embedding's range. Simo Ryu posted an analysis of this a while ago: x.com/cloneofsimo/...

25.06.2025 06:55 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

I guess the question is how much time is worth investing for all these sessions and when to make the harsh decision that a presentation isn't good enough. I feel the current system likely leads to improvements across the board, but definitely didn't make every presentation perfect

19.06.2025 22:57 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

I should note that for us, the general feedback was that the presentation was good enough in general, but could be improved. We still needed the full 30min to get good use out of the coaching. For presentations that start out somewhat problematic, I imagine 30min might be too tight

19.06.2025 22:51 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

For our presentation, we presented our talk to the coach once and then got a bunch of incredibly useful and, most importantly, constructive feedback. We then used that to create an improved second version

19.06.2025 22:48 β€” πŸ‘ 4    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Post image

πŸŽ‰ Excited to share that our lab has three papers accepted at CVPR 2025!

Come say hi in Nashville!
πŸ‘‹ Johannes, Ming, Kolja, Stefan, and BjΓΆrn will be attending.

09.06.2025 07:28 β€” πŸ‘ 1    πŸ” 2    πŸ’¬ 1    πŸ“Œ 0
Preview
Diff2Flow: Training Flow Matching Models via Diffusion Model Alignment Diffusion models have revolutionized generative tasks through high-fidelity outputs, yet flow matching (FM) offers faster inference and empirical performance gains. However, current foundation FM mode...

If you are interested, feel free to check the paper (arxiv.org/abs/2506.02221) or come by at CVPR:

πŸ“Œ Poster Session 6, Sunday 4:00 to 6:00 PM, Poster #208

06.06.2025 15:47 β€” πŸ‘ 5    πŸ” 2    πŸ’¬ 0    πŸ“Œ 0
History of Diffusion -  Sander Dieleman
YouTube video by Bain Capital Ventures History of Diffusion - Sander Dieleman

Here's the third and final part of Slater Stich's "History of diffusion" interview series!

The other two interviewees' research played a pivotal role in the rise of diffusion models, whereas I just like to yap about them 😬 this was a wonderful opportunity to do exactly that!

14.05.2025 16:11 β€” πŸ‘ 21    πŸ” 7    πŸ’¬ 0    πŸ“Œ 0

#KostasThoughts: Another major conference review drop is around the corner. In baseball, a .300 average is elite. In research, it’s a familiar reality: submitting to top conferences means rejections happen. Keep swinging!

07.05.2025 18:16 β€” πŸ‘ 4    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0

Awesome, I'll look forward to it!

29.04.2025 15:05 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Yeah, your EDM2 results look great, both qualitatively and especially quantitatively!
I'd be really interested to see whether it scales to such big models with all of their complications like CFG, as it really messes up the relation with the diffusion loss

29.04.2025 15:01 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

I am very happy to share our latest work on the information theory of generative diffusion:

"Entropic Time Schedulers for Generative Diffusion Models"

We find that the conditional entropy offers a natural data-dependent notion of time during generation

Link: arxiv.org/abs/2504.13612

29.04.2025 13:17 β€” πŸ‘ 25    πŸ” 5    πŸ’¬ 2    πŸ“Œ 0

I always wanted to do this, thank you for making it a reality! Will this also work with sampling tricks such as CFG? Also, have you tried this on larger-scale models such as T2I ones (e.g., FLUX)?

29.04.2025 14:43 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
Generative modelling in latent space Latent representations for generative models.

New blog post: let's talk about latents!
sander.ai/2025/04/15/l...

15.04.2025 09:43 β€” πŸ‘ 74    πŸ” 18    πŸ’¬ 3    πŸ“Œ 5

And the CVPR oral decisions are out! (on Openreview)

04.04.2025 15:25 β€” πŸ‘ 4    πŸ” 3    πŸ’¬ 1    πŸ“Œ 0

Ah sorry, it seems I didn't read the thread closely enough πŸ˜…

20.03.2025 13:01 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Or we just make the model classify points as sky or not and, in the case of them being sky, ignore the predicted depth - like some papers in depth estimation have been doing. Then everything is still neatly in one model, but you don't have the problem of having to threshold confidences etc

20.03.2025 11:48 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Video thumbnail

Introducing VGGT (CVPR'25), a feedforward Transformer that directly infers all key 3D attributes from one, a few, or hundreds of images, in seconds!

Project Page: vgg-t.github.io
Code & Weights: github.com/facebookrese...

17.03.2025 02:08 β€” πŸ‘ 41    πŸ” 14    πŸ’¬ 3    πŸ“Œ 1

Oh, interesting. I wasn't aware of this. Thank you, I'm gonna take a look at this!

11.03.2025 17:33 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

@stefanabaumann is following 20 prominent accounts