I am trying to. Probably we could hear about this around next submission ddl π
18.03.2025 04:51 β π 1 π 0 π¬ 1 π 0I am trying to. Probably we could hear about this around next submission ddl π
18.03.2025 04:51 β π 1 π 0 π¬ 1 π 0It seems so (with a short glance only). The techniques used by Fast3R can also be applied to VGGT
18.03.2025 04:49 β π 1 π 0 π¬ 0 π 0Haha, this probably serves as an indirect validation of NVIDIAβs stock value.
18.03.2025 04:48 β π 2 π 0 π¬ 0 π 0Currently, this training approach is not very stable, but I believe thatβs likely because I havenβt yet found the correct training method. I hope this can achieve better results in the future, which could then avoid an explicit modelling of point map.
17.03.2025 12:40 β π 2 π 0 π¬ 1 π 0
Finally, great work together with Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny!
@oxford-vgg.bsky.social
Interesting observation: VGGTβs camera & depth predictions are highly accurate and consistent. Unprojecting our predicted depth with predicted camera parameters yields even more precise point clouds than directly predicted point maps! Try this yourself using the Hugging Face demo π€
17.03.2025 02:12 β π 5 π 1 π¬ 3 π 1Compared to concurrent CVPR'25 Transformer-based 3D reconstruction methods, VGGT achieves significantly higher accuracy, with speed similar to the fastest variant Fast3R.
17.03.2025 02:12 β π 1 π 0 π¬ 1 π 0
Bonus insight: Using pretrained VGGT significantly enhances downstream tasks like:
π Non-rigid point tracking
π Feed-forward novel view synthesis
A strong advantage of our method is the ability to predict 3D attributes without any expensive optimization. For example, πΈ VGGT can easily process ~200 images in ~10s on a single 40GB A100 GPU πΈ 50x faster than optimization-based methods, using far less memory.
17.03.2025 02:11 β π 1 π 0 π¬ 1 π 0
Try our demo live on Hugging Face Spaces!
π€: huggingface.co/spaces/faceb...
(See demo illustration below) π
No expensive optimization needed, yet delivers SOTA results for:
β
Camera Pose Estimation
β
Multi-view Depth Estimation
β
Dense Point Cloud Reconstruction
β
Point Tracking
Introducing VGGT (CVPR'25), a feedforward Transformer that directly infers all key 3D attributes from one, a few, or hundreds of images, in seconds!
Project Page: vgg-t.github.io
Code & Weights: github.com/facebookrese...