Jiatao Gu's Avatar

Jiatao Gu

@jgu32.bsky.social

Machine Learning Researcher @Apple MLR Incoming Assistant Professor @Penn CIS See more details https://jiataogu.me

355 Followers  |  356 Following  |  10 Posts  |  Joined: 20.11.2024
Posts Following

Posts by Jiatao Gu (@jgu32.bsky.social)

Preview
World-consistent Video Diffusion with Explicit 3D Modeling Recent advancements in diffusion models have set new benchmarks in image and video generation, enabling realistic visual synthesis across single- and multi-frame contexts. However, these models still ...

Joint work by our awesome research intern Qihang Zhang, together with colleagues Shuangfei, Miguel, Kevin, Alex and Josh at Apple MLR!

Want to dive deeper? Check out our paper for full details

ArXiv: arxiv.org/abs/2412.01821
Project page: zqh0253.github.io/wvd/ (9/n, n=9)

04.12.2024 13:41 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Video thumbnail

WVD also supports controllable video generation. Given a single image, we estimate the 3D geometry via standard WVD inference, and project it to get partial XYZ images. Finally, WVD generates the RGB images jointly with the projected XYZ images through in-painting. (6/n)

04.12.2024 13:41 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image Post image Post image Post image

For example, WVD can be directly applied to various single-image tasks. WVD can also take unposed images (video) as input, and infer XYZ images via β€œin-painting” strategy. With a post optimization procedure, the XYZ images can be converted to camera poses, and depth maps. (5/n)

04.12.2024 13:41 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

At inference time, this joint distribution can be leveraged to estimate conditional distributions, such as P (XYZ | RGB) or P (RGB | XYZ). This capability makes WVD a foundation for supporting a wide range of downstream tasks. (4/n)

04.12.2024 13:41 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

During training, WVD learns to generate 6D (RGB + XYZ) videos by modeling the joint probability P (RGB, XYZ), effectively capturing their interdependent structures and features. (3/n)

04.12.2024 13:41 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Existing multi-view/video diffusion model usually lack explicit 3D supervision (or guarantee), leading to potential 3D inconsistency and inefficient training.

In contrast, WVD models multi-view images, and explicit 3D geometry. Specifically, we represent the 3D geometry via XYZ images. (2/n)

04.12.2024 13:41 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
WVD Pipeline

WVD Pipeline

πŸ€”Image-to-3D, monocular depth estimation, camera pose estimation, …, can we achieve all of this with just ONE model easily?

πŸš€Our answer is Yes -- Excited to introduce our latest work: World-consistent Video Diffusion (WVD) with Explicit 3D Modeling!

arxiv.org/abs/2412.01821

04.12.2024 13:41 β€” πŸ‘ 14    πŸ” 6    πŸ’¬ 1    πŸ“Œ 0

More interesting research work πŸ€”

29.11.2024 14:37 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Can anyone help add me to some starter packπŸ₯²πŸ˜°

29.11.2024 05:48 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Doctoral Program Doctoral Program

I am seeking multiple PhD students passionate about Generative Intelligence and its applications in empowering AI agents to interact with the physical world to join us at UPenn CIS for the 2024-2025 academic cycle. You can find more information at www.cis.upenn.edu/graduate/pro...

27.11.2024 01:18 β€” πŸ‘ 17    πŸ” 4    πŸ’¬ 0    πŸ“Œ 1