Cheng and Shaikh et al., "MERG3R: A Divide-and-Conquer Approach to Large-Scale Neural Visual Geometry"
Simple, practical idea that works. Sort->split->merge for faster, scalable reconstruction.
Jain et al., "NERFIFY: A Multi-Agent Framework for Turning NeRF Papers into Code"
Given what we have now, it makes sense that we should no longer spend time reproducing work. Let agents read paper, code, verify with nerfstudio.
Zhu et al., "Marginalized Bundle Adjustment: Multi-View Camera Pose from Monocular Depth Estimates"
With monocular depth estimators being more and more accurate, they are often better for SfM in many cases. Good, but not the best for IMC phototourism yet!
Cong et al., "Flow3r: Factored Flow Prediction for Scalable Visual Geometry Learning"
More data is better, and you can use unlabeled videos with off-the-shelf dense flow predictors to do this. 800k unlabeled video to further enhance geometry estimation.
Adamkiewicz et al., "When Pretty Isn't Useful: Investigating Why Modern Text-to-Image Models Fail as Reliable Training Data Generators"
Interesting that while we have "better" image generators, their usefulness as synthetic data generators is declining. Do we need a pivot?
Xie, Sun, Neall et al., "Generated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control"
Camera & 3D hand pose conditioned video generation with DiTs.
Ye et al., "World Action Models are Zero-shot Policies"
Video + Action training. A LOT of engineering gems that allow making it work with an actual robot.
Liu at al., "Unpaired Image-to-Image Translation via a Self-Supervised Semantic Bridge"
Train a diffusion bridge to go to Self-Supervised foundational features and back. Allows image-to-image translation.
Ke et al., "CAPA: Depth Completion as Parameter-Efficient Test-Time Adaptation"
Fine-tune your foundational model at test time with sparse measurements. Makes a lot of sense if you have, e.g. Lidar measurements with you.
Yue et al., "Image Generation with a Sphere Encoder"
Sampling Gaussian noise independently for each pixel causes you to actually sample more-or-less on a sphere. So let our latent space be on a sphere.
Qin and Sun et al., "Variation-aware Flexible 3D Gaussian Editing"
Distill 2D edits into a feed-forward model that predicts edit fields. Quick 3D editing of 3D Gaussians.
Bajpai et al., "FastFlow: Accelerating The Generative Flow Matching Models with Bandit Inference"
You can use Multi-Armed Bandits with Flow Matching models to accelerate inference by 2.6x, all without any training and minimal overhead.
Luo et al., "4RC: 4D Reconstruction via Conditional Querying Anytime and Anywhere"
Encode videos into 4D latents, then query motion and geometry between any two frames.
Luo et al., "4RC: 4D Reconstruction via Conditional Querying Anytime and Anywhere"
Encode into a 4D latents, which is used to conditionally decode geometry and motion.
Mauel and HΓΌbers et al., "Foundation Inference Models for Ordinary Differential Equations"
Feed-forward solver for generic ODEs. Also easily fine-tunable for a given task. These things always fascinate me. Sort of learning to learn (solve?)
Shavin and Benaim, "Splat and Distill: Augmenting Teachers with Feed-Forward 3D Reconstruction For 3D-Aware Distillation"
When distilling vision foundation models with a focus on geometric consistency, insert a feed-forward Gaussian Splatting in the middle.