All the links can be found here. Great collaborators!
bsky.app/profile/odue...
@wimmerthomas.bsky.social
PhD Candidate at the Max Planck ETH Center for Learning Systems working on 3D Computer Vision. https://wimmerth.github.io
All the links can be found here. Great collaborators!
bsky.app/profile/odue...
๐ Just accepted to ICCV 2025!
In DIY-SC, we improve foundational features using a light-weight adapter trained with carefully filtered and refined pseudo-labels.
๐ง Drop-in alternative to plain DINOv2 features!
๐ฆ Code + pre-trained weights available now.
๐ฅ Try it in your next vision project!
The CVML group at the @mpi-inf.mpg.de has been busy for CVPR. Check out our papers and come by the presentations!
11.06.2025 12:07 โ ๐ 4 ๐ 1 ๐ฌ 0 ๐ 0Hello world, we are now on Bluesky ๐ฆ! Follow us to receive updates on exciting research and projects from our group!
#computervision #machinelearning #research
We only use open-sourced models and the implementation of our method is readily available. Please check out the paper website for more details:
wimmerth.github.io/gaussians2li...
We can animate arbitrary 3D scenes within 10 minutes on a RTX4090 while keeping scene appearance and geometry in tact.
Note, that since the time I worked on this, open-sourced video diffusion models have improved significantly, which will directly improve the results of this method as well.
๐งตโฌ๏ธ
Improvement of multi-view consistency of generated videos through latent interpolation. In addition to the rendering of the dynamic scene f, using the rendering function g from the current viewpoint g(f)_s, we compute the latent embedding of the warped video output v_{s-1} of the previous optimization step (from a different viewpoint). We linearly interpolate the latents before passing them through the video diffusion model (VDM), which is additionally conditioned on the static scene view from the current viewpoint. The resulting output is finally decoded to a new video output v_s.
While we can now transfer motion into 3D, we still have to deal with a fundamental problem: Lacking 3D consistency of generated videos.
With limited resources, we can't fine-tune or retrain a VDM to be pose-conditioned. Thus, we propose a zero-shot technique to generate more 3D-consistent videos!
๐งตโฌ๏ธ
Method overview for lifting 2D dynamics into 3D. Pre-trained models are shown in blue. We detect 2D point tracks and use aligned estimated depth values to lift them into 3D. The 4D (dynamic 3D) Gaussians are initialized with the static 3D scene input.
Standard practices like SDS fail for this task as VDMs provide a guidance signal that is too noisy, resulting in "exploding" scenes.
Instead, we propose to employ several pre-trained 2D models to directly lift motion from tracked points in the generated videos to 3D Gaussians.
๐งตโฌ๏ธ
Had the honor to present "Gaussians-to-Life" at #3DV2025 yesterday. In this work, we used video diffusion models to animate arbitrary 3D Gaussian Splatting scenes.
This work was a great collaboration with @moechsle.bsky.social, @miniemeyer.bsky.social, and Federico Tombari.
๐งตโฌ๏ธ
Can you do reasoning with diffusion models?
The answer is yes!
Take a look at Spatial Reasoning Models. Hats off for this amazing work!
I wonder to which degree one could artificially make real images (with GT depth) more abstract during training in order to make depth models learn these priors that we would have (like green=field, blue=sky) and whether that would actually give us any benefit, like increased robustness...
14.02.2025 14:30 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0Ah, thanks, I overlooked that :)
14.02.2025 14:19 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0Nice experiments! What model did you use?
14.02.2025 14:07 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0๐๏ธโท๏ธ Looking back on a fantastic week full of talks, research discussions, and skiing in the Austrian mountains!
31.01.2025 19:38 โ ๐ 31 ๐ 11 ๐ฌ 0 ๐ 0Give a warm welcome to @janericlenssen.bsky.social!
16.01.2025 17:29 โ ๐ 2 ๐ 0 ๐ฌ 0 ๐ 0Well well, it turns out that GIFs aren't yet supported on this platform. Here is the teaser video as an MP4 instead:
15.01.2025 17:27 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0This work was led by @mohammadasim98.bsky.social and is a collaboration with Christopher Wewer, Bernt Schiele and Jan Eric Lenssen.
Check out the website with lots of nice visuals that show how our metric works and use it in your next diffusion model project!
geometric-rl.mpi-inf.mpg.de/met3r/
Important note: Our metric is not here to measure the visual quality / appearance of generated content. It is, instead, meant to act orthogonal to existing image quality metrics by focusing on the 3D consistency of generated frames.
15.01.2025 17:21 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0Especially for video generation methods where no ground truth camera poses are given, our proposed metric can help to shed light on the quality of the generated videos, rather than just reporting results from yet another human survey.
15.01.2025 17:21 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0Speaking of multi-view diffusion models, we also trained a new open-source multi-view latent diffusion model built on top of Stable Diffusion and heavily inspired by the closed-source CAT3D model.
Weights and code are already public. Check it out!
github.com/mohammadasim...
The MET3R scores correlate well with the 3D awareness of different multi-view image generation methods, as we show in our experiments. The metric is also differentiable, which means you could use it even for training! The code is easy to run and already open-sourced!
github.com/mohammadasim...
We propose MET3R, a new metric for measuring multi-view consistency in generated images. Our method is built upon DUSt3R and evaluates the consistency of projected DINO features between two views. It is able to accurately capture the 3D consistency in generated images.
15.01.2025 17:21 โ ๐ 5 ๐ 2 ๐ฌ 1 ๐ 0Quantitative evaluation of diffusion model outputs is hard!
We realized that we are often lacking metrics for comparing the quality of video and multi-view diffusion models. Especially the quantification of multi-view 3D consistency across frames is difficult.
But not anymore: Introducing MET3R ๐งต
MEt3R: Measuring Multi-View Consistency in Generated Images
Mohammad Asim, Christopher Wewer, Thomas Wimmer, Bernt Schiele, Jan Eric Lenssen
tl;dr: DUSt3R + DINO + FeatUp together want to be FID for multiview generation
arxiv.org/abs/2501.06336
After my general computer vision starter pack is now full (150/150 entries reached), here is one specific to 3D Vision: go.bsky.app/Cfm9XFe
21.11.2024 08:15 โ ๐ 105 ๐ 29 ๐ฌ 10 ๐ 1