Sure, but how do you explain an H100 being lower than an A100? It should be better in every (memory-related) way. That single data point pair excludes most reasonable alternatives
31.01.2026 15:12 β π 1 π 0 π¬ 0 π 0@stefanabaumann.bsky.social
PhD Student at @compvis.bsky.social & @ellis.eu working on generative computer vision. Interested in extracting world understanding from models and more controlled generation. π https://stefan-baumann.eu/
Sure, but how do you explain an H100 being lower than an A100? It should be better in every (memory-related) way. That single data point pair excludes most reasonable alternatives
31.01.2026 15:12 β π 1 π 0 π¬ 0 π 0I thought about that, but that also doesn't track though - an H100 has a much higher bandwidth than an A100
31.01.2026 10:30 β π 0 π 0 π¬ 0 π 0To clarify: I'm talking about the fact that the points do not correspond to the amount of VRAM, unlike implied
31.01.2026 00:22 β π 0 π 0 π¬ 0 π 0I'm talking about the fact that GPUs that have the same amount of VRAM are on different positions. This has nothing to do with axis scaling
31.01.2026 00:18 β π 0 π 0 π¬ 2 π 0The y axis for memory capacity looks a bit weird π€
30.01.2026 16:49 β π 3 π 0 π¬ 4 π 0For me, looking at both the reviews on my submissions and others' submissions, I see only ~10% clearly LLM-written reviews. Still bad, but better than last year's conferences imo
22.01.2026 22:38 β π 1 π 0 π¬ 0 π 0Interesting, thanks for the additional context! I assumed that a modern architeture with good PE should mostly fix these problems purely by inductive bias
12.01.2026 12:48 β π 0 π 0 π¬ 0 π 0Isn't this a problem primarily caused by additive PE that should be lessened significantly by attention-only PEs (RoPE, ALiBi)?
12.01.2026 10:41 β π 1 π 0 π¬ 1 π 0For LLMs, NoPE is a thing because of the causal attention mask - I don't quite see how you're imagining these findings should transfer to vision
12.01.2026 10:39 β π 0 π 0 π¬ 0 π 0
I love it, as long as I have the time to do it. Personally, I prefer doing it for complex problems (e.g., developing our lab's shared large-scale distributed training codebase).
I also actively try to use vibe coding in risk-free places to learn to use those tools better
Feels like something that could be vibe coded quite easily (and safely): monitor that repo and auto-create a PR to yours (assuming you maintain your bot similarly) with missing dates. No risk, as you'd approve any changes and less work, as you'd get notified once dates are known
10.01.2026 16:20 β π 0 π 0 π¬ 1 π 0If you want to automate some of this, the repo at github.com/ccfddl/ccf-d... is openly licensed and quite reliable. You could auto-add dates for some conferences to the bot once they're added there
10.01.2026 16:14 β π 1 π 0 π¬ 1 π 0Last year Molmo set SOTA on image benchmarks + pioneered image pointing. Millions of downloads later, Molmo 2 brings Molmoβs grounded multimodal capabilities to video π₯βand leads many open models on challenging industry video benchmarks. π§΅
16.12.2025 16:51 β π 14 π 3 π¬ 1 π 0Iirc, that's the nickname under which the exploit was circulated on Chinese social media
28.11.2025 15:51 β π 1 π 0 π¬ 0 π 0Oof
28.11.2025 14:52 β π 18 π 4 π¬ 2 π 1You shall be forgiven ;)
20.11.2025 13:37 β π 1 π 0 π¬ 0 π 0Awesome work! Casually fumbled naming the sections "harder", "better", "faster", "denser" though
20.11.2025 13:02 β π 2 π 0 π¬ 1 π 0
RoMa v2 is now out! (github.com/Parskatt/rom..., arxiv.org/abs/2511.15706)
Here are the main improvements we made since RoMa:
One of the first works applying transformers to diffusion models actually had such skip connections.
Similarly, at 1024^2, pixel-space U-Nets and HDiT effectively have a 64^2 patch size in the middle while still doing eps/EDM prediction, which is enabled by just having high-resolution skips
Also, just having a skip connection from early in the network to late in the network should let you sidestep the problem analyzed in the paper almost completely
18.11.2025 17:34 β π 1 π 0 π¬ 1 π 0For any work that actually achieved *very* high-fidelity generation in pixel space, they all tend to be very expensive (still). While JiT gets a good FID on ImageNet while being less expensive, I'm not (yet) convinced it reaches as high a fidelity as I'd expect it to to be competitive with them
18.11.2025 17:33 β π 1 π 0 π¬ 1 π 0Pixel-space has been working well since ~2 years ago, but is just too expensive to be practically relevant for anything we'd consider mainstream in generative computer vision. People in computational photography etc care about quality enough, but for a meme, you don't need the quality
18.11.2025 17:13 β π 0 π 0 π¬ 1 π 0Generally, independently of this noise bottleneck, you get significant improvements when decreasing patch size (or even just increasing resolution while keeping patch size & params the same) because of added capacity in the attention. So I don't know why I would ever choose a large patch size
18.11.2025 17:12 β π 2 π 0 π¬ 0 π 0Their general insights are actually independent of whether it's on pixels or latents, although no sane person (imo) would use sufficiently high patch sizes for things to matter with latents - they're just too information-dense
18.11.2025 17:10 β π 2 π 0 π¬ 1 π 0Yes, you're correct. Currently, I'm not aware of any models predicting the clean data, very few that predict the noise (this used to be common years ago), and most either predict some kind of velocity or use the preconditioner from EDM (Karras et al.), where the model also doesn't predict clean data
18.11.2025 17:09 β π 1 π 0 π¬ 0 π 0Their insights only really apply if you choose huge patch sizes, which is not what we do in any practical model (currently). For any practical model, we typically have hidden dim >> data dim, where noise transport is not gonna be a bottleneck
18.11.2025 17:07 β π 2 π 0 π¬ 1 π 0It was both
17.11.2025 10:31 β π 1 π 0 π¬ 1 π 0Looking back at these reviews now, after spending twice as much time reviewing in this community, I could've done some things better. None of them were things that the system suggested
17.11.2025 09:54 β π 0 π 0 π¬ 0 π 0It was a mixture of things; the primary comments I got were trying to help me "strengthen this critique" or "make it more actionable", with suggestions either just changing tone from concise to long LLM-like without adding substance, or requesting additional ablations that completely miss the point
17.11.2025 09:49 β π 0 π 0 π¬ 2 π 0I guess the main problem could be that it was intended to be one revision cycle back then. For good improvements with current-gen LLMs, I think you still need a few rounds of back and forth
17.11.2025 09:42 β π 0 π 0 π¬ 0 π 0