Thomas Wimmer @wimmerthomas

All the links can be found here. Great collaborators!

bsky.app/profile/odue...

26.06.2025 14:30 — 👍 2 🔁 0 💬 0 📌 0

🚀 Just accepted to ICCV 2025!

In DIY-SC, we improve foundational features using a light-weight adapter trained with carefully filtered and refined pseudo-labels.

🔧 Drop-in alternative to plain DINOv2 features!
📦 Code + pre-trained weights available now.
🔥 Try it in your next vision project!

26.06.2025 14:28 — 👍 9 🔁 2 💬 1 📌 0

The CVML group at the @mpi-inf.mpg.de has been busy for CVPR. Check out our papers and come by the presentations!

11.06.2025 12:07 — 👍 4 🔁 1 💬 0 📌 0

Hello world, we are now on Bluesky 🦋! Follow us to receive updates on exciting research and projects from our group!

#computervision #machinelearning #research

09.04.2025 13:03 — 👍 10 🔁 4 💬 0 📌 0

Gaussians-to-Life: Text-Driven Animation of 3D Gaussian Splatting Scenes We introduce a method to animate given 3D scenes that uses pre-trained models to lift 2D motion into 3D. We propose a training-free, autoregressive method to generate more 3D-consi...

We only use open-sourced models and the implementation of our method is readily available. Please check out the paper website for more details:

wimmerth.github.io/gaussians2li...

28.03.2025 08:35 — 👍 1 🔁 0 💬 0 📌 0

We can animate arbitrary 3D scenes within 10 minutes on a RTX4090 while keeping scene appearance and geometry in tact.

Note, that since the time I worked on this, open-sourced video diffusion models have improved significantly, which will directly improve the results of this method as well.

🧵⬇️

28.03.2025 08:35 — 👍 0 🔁 0 💬 1 📌 0

$Improvement of multi-view consistency of generated videos through latent interpolation. In addition to the rendering of the dynamic scene f, using the rendering function g from the current viewpoint g(f)_s, we compute the latent embedding of the warped video output v_{s-1} of the previous optimization step (from a different viewpoint). We linearly interpolate the latents before passing them through the video diffusion model (VDM), which is additionally conditioned on the static scene view from the current viewpoint. The resulting output is finally decoded to a new video output v_s.$

Improvement of multi-view consistency of generated videos through latent interpolation. In addition to the rendering of the dynamic scene f, using the rendering function g from the current viewpoint g(f)_s, we compute the latent embedding of the warped video output v_{s-1} of the previous optimization step (from a different viewpoint). We linearly interpolate the latents before passing them through the video diffusion model (VDM), which is additionally conditioned on the static scene view from the current viewpoint. The resulting output is finally decoded to a new video output v_s.

While we can now transfer motion into 3D, we still have to deal with a fundamental problem: Lacking 3D consistency of generated videos.
With limited resources, we can't fine-tune or retrain a VDM to be pose-conditioned. Thus, we propose a zero-shot technique to generate more 3D-consistent videos!
🧵⬇️

28.03.2025 08:35 — 👍 0 🔁 0 💬 1 📌 0

Method overview for lifting 2D dynamics into 3D. Pre-trained models are shown in blue. We detect 2D point tracks and use aligned estimated depth values to lift them into 3D. The 4D (dynamic 3D) Gaussians are initialized with the static 3D scene input.

Standard practices like SDS fail for this task as VDMs provide a guidance signal that is too noisy, resulting in "exploding" scenes.

Instead, we propose to employ several pre-trained 2D models to directly lift motion from tracked points in the generated videos to 3D Gaussians.

🧵⬇️

28.03.2025 08:35 — 👍 1 🔁 0 💬 1 📌 0

Had the honor to present "Gaussians-to-Life" at #3DV2025 yesterday. In this work, we used video diffusion models to animate arbitrary 3D Gaussian Splatting scenes.
This work was a great collaboration with @moechsle.bsky.social, @miniemeyer.bsky.social, and Federico Tombari.

🧵⬇️

28.03.2025 08:35 — 👍 13 🔁 1 💬 2 📌 1

Can you do reasoning with diffusion models?

The answer is yes!

Take a look at Spatial Reasoning Models. Hats off for this amazing work!

03.03.2025 17:48 — 👍 3 🔁 0 💬 0 📌 0

I wonder to which degree one could artificially make real images (with GT depth) more abstract during training in order to make depth models learn these priors that we would have (like green=field, blue=sky) and whether that would actually give us any benefit, like increased robustness...

14.02.2025 14:30 — 👍 1 🔁 0 💬 1 📌 0

Ah, thanks, I overlooked that :)

14.02.2025 14:19 — 👍 1 🔁 0 💬 1 📌 0

Nice experiments! What model did you use?

14.02.2025 14:07 — 👍 1 🔁 0 💬 1 📌 0

🏔️⛷️ Looking back on a fantastic week full of talks, research discussions, and skiing in the Austrian mountains!

31.01.2025 19:38 — 👍 31 🔁 11 💬 0 📌 0

Give a warm welcome to @janericlenssen.bsky.social!

16.01.2025 17:29 — 👍 2 🔁 0 💬 0 📌 0

Well well, it turns out that GIFs aren't yet supported on this platform. Here is the teaser video as an MP4 instead:

15.01.2025 17:27 — 👍 1 🔁 0 💬 0 📌 0

MEt3R Measuring Multi-View Consistency in Generated Images.

This work was led by @mohammadasim98.bsky.social and is a collaboration with Christopher Wewer, Bernt Schiele and Jan Eric Lenssen.

Check out the website with lots of nice visuals that show how our metric works and use it in your next diffusion model project!

geometric-rl.mpi-inf.mpg.de/met3r/

15.01.2025 17:21 — 👍 2 🔁 1 💬 0 📌 0

Important note: Our metric is not here to measure the visual quality / appearance of generated content. It is, instead, meant to act orthogonal to existing image quality metrics by focusing on the 3D consistency of generated frames.

15.01.2025 17:21 — 👍 0 🔁 0 💬 1 📌 0

Especially for video generation methods where no ground truth camera poses are given, our proposed metric can help to shed light on the quality of the generated videos, rather than just reporting results from yet another human survey.

15.01.2025 17:21 — 👍 0 🔁 0 💬 1 📌 0

Speaking of multi-view diffusion models, we also trained a new open-source multi-view latent diffusion model built on top of Stable Diffusion and heavily inspired by the closed-source CAT3D model.

Weights and code are already public. Check it out!

github.com/mohammadasim...

15.01.2025 17:21 — 👍 0 🔁 0 💬 1 📌 0

The MET3R scores correlate well with the 3D awareness of different multi-view image generation methods, as we show in our experiments. The metric is also differentiable, which means you could use it even for training! The code is easy to run and already open-sourced!

github.com/mohammadasim...

15.01.2025 17:21 — 👍 0 🔁 0 💬 1 📌 0

We propose MET3R, a new metric for measuring multi-view consistency in generated images. Our method is built upon DUSt3R and evaluates the consistency of projected DINO features between two views. It is able to accurately capture the 3D consistency in generated images.

15.01.2025 17:21 — 👍 5 🔁 2 💬 1 📌 0

Quantitative evaluation of diffusion model outputs is hard!

We realized that we are often lacking metrics for comparing the quality of video and multi-view diffusion models. Especially the quantification of multi-view 3D consistency across frames is difficult.

But not anymore: Introducing MET3R 🧵

15.01.2025 17:21 — 👍 2 🔁 0 💬 2 📌 0

MEt3R: Measuring Multi-View Consistency in Generated Images

Mohammad Asim, Christopher Wewer, Thomas Wimmer, Bernt Schiele, Jan Eric Lenssen
tl;dr: DUSt3R + DINO + FeatUp together want to be FID for multiview generation
arxiv.org/abs/2501.06336

15.01.2025 09:36 — 👍 16 🔁 1 💬 1 📌 0

14.01.2025 13:19 — 👍 1 🔁 0 💬 0 📌 0

After my general computer vision starter pack is now full (150/150 entries reached), here is one specific to 3D Vision: go.bsky.app/Cfm9XFe

21.11.2024 08:15 — 👍 105 🔁 29 💬 10 📌 1

Thomas Wimmer

Latest posts by wimmerthomas.bsky.social on Bluesky

@wimmerthomas is following 20 prominent accounts