Generating 3D-Consistent Videos from Unposed Internet Photos

1Cornell University, 2Adobe Research

TL;DR We propose KFC-W (KeyFrame-Conditioned video generation in-the-Wild), a method that generates a video by interpolating between unposed internet photos. Commercial video models like Luma Dream Machine can fail to produce 3D-consistent videos. For instance, in the videos below, it tends to hallucinate new buildings. Thus, we jointly train video synthesis with a scalable 3D-aware objective, which teaches our model to identify the scene geometry and layout.

Scene 1/N
The generated video simulates a camera movement linking all input images, progressing from input 1 to n.
The green border denotes the first video frame, which should correspond to the first input image.
The red border denotes the last video frame, which should correspond to the last input image.

Unposed internet photos from the Phototourism Dataset

Scene 1/N

Unposed photos from the Re10k Dataset

Abstract

We address the problem of generating videos from unposed internet photos. A handful of input images serve as keyframes, and our model interpolates between them to simulate a path moving between the cameras. Given random images, a model's ability to capture underlying geometry, recognize scene identity, and relate frames in terms of camera position and orientation reflects a fundamental understanding of 3D structure and scene layout. However, existing video models such as Luma Dream Machine fail at this task. We design a self-supervised method that takes advantage of the consistency of videos and variability of multiview internet photos to train a scalable, 3D-aware video model without any 3D annotations such as camera parameters. We validate that our method outperforms commercial models in terms of geometric and appearance consistency. We also show our model benefits applications that enable camera control, such as 3D Gaussian Splatting. Our results suggest that we can scale up scene-level 3D learning using only 2D data such as videos and multiview internet photos.

Self-Supervised Learning without 3D Supervision

We design two objectives: 1. Multiview inpainting addresses geometric understanding by training the model to extract 3D relationships from wide-baseline, unposed images. 2. View interpolation addresses temporal coherence by training the model to generate smooth, consistent camera trajectories, which is our desired output.

Multiview Inpainting of Internet Photos

View Interpolation of Videos

Model Architecture

We finetune from a text-to-video diffusion transformer. Left: training; Right: inference.

Qualitative Comparisons

Scene 1/N

Input Images

Applications

3D Reconstruction via COLMAP

We validate whether our generated frames are consistent in geometry, and therefore suitable for downstream applications such as 3D reconstruction. We run COLMAP on the original input views, then include our generated frames. The improvement in reconstruction success rate shows that our generated frames provide reliable feature correspondences that connect distant views.


3D Gaussian Splatting via InstantSplat

We also experiment with running 3D Gaussian Splatting (3DGS) on our generated frames. Internet photos from the Phototourism dataset have wide baselines, significant occlusions, and varying illumination, which make it very difficult to train 3DGS methods based on a pixel-rendering loss. Our generated frames are denser and with more consistent illumination, leading to substantial improvements in reconstruction metrics.

Input Images

Left comparison image 1
Left comparison image 2
Left comparison image 3

Left: on input images; Right: on generated frames

Input Images

Right comparison image 1
Right comparison image 2
Right comparison image 3

Left: on input images; Right: on generated frames

Takeaways

    We posit that brute-force scaling will not help video models understand the physical world, as even the most advanced video models today have difficulty understanding physics or scene layouts. However, rather than incorporating conditions such as camera poses, which can be difficult to reliably estimate at scale, we jointly train a scalable 3D-aware objective. We suggest that this concept can be applied to other tasks as well, such as modeling motion that respects physical constraints. Additionally, our model generates videos from internet photos even though it never sees this specific input-output pairing during training, suggesting multitask learning leads to emergent capabilities.

Acknowledgments

We thank Kalyan Sunkavalli and Nathan Carr for supporting this project. Gene Chou was supported by an NSF graduate fellowship.

BibTeX


    @misc{
      chou2024kfcw,
      title={Generating 3D-Consistent Videos from Unposed Internet Photos}, 
      author={Gene Chou and Kai Zhang and Sai Bi and Hao Tan and Zexiang Xu and Fujun Luan and Bharath Hariharan and Noah Snavely},
      year={2024},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.13549}, 
    }