Stefan Baumann's Avatar

Stefan Baumann

@stefanabaumann.bsky.social

PhD Student at @compvis.bsky.social & @ellis.eu working on generative computer vision. Interested in extracting world understanding from models and more controlled generation. 🌐 https://stefan-baumann.eu/

1,283 Followers  |  651 Following  |  126 Posts  |  Joined: 17.11.2024
Posts Following

Posts by Stefan Baumann (@stefanabaumann.bsky.social)

Sure, but how do you explain an H100 being lower than an A100? It should be better in every (memory-related) way. That single data point pair excludes most reasonable alternatives

31.01.2026 15:12 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

I thought about that, but that also doesn't track though - an H100 has a much higher bandwidth than an A100

31.01.2026 10:30 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

To clarify: I'm talking about the fact that the points do not correspond to the amount of VRAM, unlike implied

31.01.2026 00:22 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

I'm talking about the fact that GPUs that have the same amount of VRAM are on different positions. This has nothing to do with axis scaling

31.01.2026 00:18 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0

The y axis for memory capacity looks a bit weird πŸ€”

30.01.2026 16:49 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 4    πŸ“Œ 0

For me, looking at both the reviews on my submissions and others' submissions, I see only ~10% clearly LLM-written reviews. Still bad, but better than last year's conferences imo

22.01.2026 22:38 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Interesting, thanks for the additional context! I assumed that a modern architeture with good PE should mostly fix these problems purely by inductive bias

12.01.2026 12:48 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Isn't this a problem primarily caused by additive PE that should be lessened significantly by attention-only PEs (RoPE, ALiBi)?

12.01.2026 10:41 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

For LLMs, NoPE is a thing because of the causal attention mask - I don't quite see how you're imagining these findings should transfer to vision

12.01.2026 10:39 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

I love it, as long as I have the time to do it. Personally, I prefer doing it for complex problems (e.g., developing our lab's shared large-scale distributed training codebase).

I also actively try to use vibe coding in risk-free places to learn to use those tools better

10.01.2026 18:18 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Feels like something that could be vibe coded quite easily (and safely): monitor that repo and auto-create a PR to yours (assuming you maintain your bot similarly) with missing dates. No risk, as you'd approve any changes and less work, as you'd get notified once dates are known

10.01.2026 16:20 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
GitHub - ccfddl/ccf-deadlines: ⏰ Collaboratively track worldwide conference deadlines (Website, Python Cli, Wechat Applet) / If you find it useful, please star this project, thanks~ ⏰ Collaboratively track worldwide conference deadlines (Website, Python Cli, Wechat Applet) / If you find it useful, please star this project, thanks~ - ccfddl/ccf-deadlines

If you want to automate some of this, the repo at github.com/ccfddl/ccf-d... is openly licensed and quite reliable. You could auto-add dates for some conferences to the bot once they're added there

10.01.2026 16:14 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image Post image Post image

Last year Molmo set SOTA on image benchmarks + pioneered image pointing. Millions of downloads later, Molmo 2 brings Molmo’s grounded multimodal capabilities to video πŸŽ₯β€”and leads many open models on challenging industry video benchmarks. 🧡

16.12.2025 16:51 β€” πŸ‘ 14    πŸ” 3    πŸ’¬ 1    πŸ“Œ 0

Iirc, that's the nickname under which the exploit was circulated on Chinese social media

28.11.2025 15:51 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

Oof

28.11.2025 14:52 β€” πŸ‘ 18    πŸ” 4    πŸ’¬ 2    πŸ“Œ 1

You shall be forgiven ;)

20.11.2025 13:37 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Awesome work! Casually fumbled naming the sections "harder", "better", "faster", "denser" though

20.11.2025 13:02 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image Post image

RoMa v2 is now out! (github.com/Parskatt/rom..., arxiv.org/abs/2511.15706)

Here are the main improvements we made since RoMa:

20.11.2025 09:25 β€” πŸ‘ 35    πŸ” 5    πŸ’¬ 3    πŸ“Œ 1

One of the first works applying transformers to diffusion models actually had such skip connections.
Similarly, at 1024^2, pixel-space U-Nets and HDiT effectively have a 64^2 patch size in the middle while still doing eps/EDM prediction, which is enabled by just having high-resolution skips

18.11.2025 17:38 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Also, just having a skip connection from early in the network to late in the network should let you sidestep the problem analyzed in the paper almost completely

18.11.2025 17:34 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

For any work that actually achieved *very* high-fidelity generation in pixel space, they all tend to be very expensive (still). While JiT gets a good FID on ImageNet while being less expensive, I'm not (yet) convinced it reaches as high a fidelity as I'd expect it to to be competitive with them

18.11.2025 17:33 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Pixel-space has been working well since ~2 years ago, but is just too expensive to be practically relevant for anything we'd consider mainstream in generative computer vision. People in computational photography etc care about quality enough, but for a meme, you don't need the quality

18.11.2025 17:13 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Generally, independently of this noise bottleneck, you get significant improvements when decreasing patch size (or even just increasing resolution while keeping patch size & params the same) because of added capacity in the attention. So I don't know why I would ever choose a large patch size

18.11.2025 17:12 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Their general insights are actually independent of whether it's on pixels or latents, although no sane person (imo) would use sufficiently high patch sizes for things to matter with latents - they're just too information-dense

18.11.2025 17:10 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Yes, you're correct. Currently, I'm not aware of any models predicting the clean data, very few that predict the noise (this used to be common years ago), and most either predict some kind of velocity or use the preconditioner from EDM (Karras et al.), where the model also doesn't predict clean data

18.11.2025 17:09 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Their insights only really apply if you choose huge patch sizes, which is not what we do in any practical model (currently). For any practical model, we typically have hidden dim >> data dim, where noise transport is not gonna be a bottleneck

18.11.2025 17:07 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

It was both

17.11.2025 10:31 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Looking back at these reviews now, after spending twice as much time reviewing in this community, I could've done some things better. None of them were things that the system suggested

17.11.2025 09:54 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

It was a mixture of things; the primary comments I got were trying to help me "strengthen this critique" or "make it more actionable", with suggestions either just changing tone from concise to long LLM-like without adding substance, or requesting additional ablations that completely miss the point

17.11.2025 09:49 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0

I guess the main problem could be that it was intended to be one revision cycle back then. For good improvements with current-gen LLMs, I think you still need a few rounds of back and forth

17.11.2025 09:42 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0