SimVS: Simulating World Inconsistencies for Robust View Synthesis (original) (raw)

1UC San Diego 2Google DeepMind 3Google Research
4Columbia University 5University of Maryland

CVPR 2025

A teaser of our proposed method is shown, whereby an inconsistent in-the-wild capture is made consistent..

TL;DR: Turn inconsistent captures into consistent multiview images

Challenge

3D reconstruction requires everything in a scene to be frozen, but the real world breaks this assumption constantly; elements of the scene often move, and the lighting naturally changes over time. Gathering paired data to solve this challenge at scale would be extremely challenging. Instead, we propose a generative augmentation strategy that simulates those inconsistencies, and train another generative model to sample consistent multiview images conditioned on the inconsistent information. At test-time, our model generalizes to the real world due to the improving quality of video models.

How it works

Given a multi-view dataset, we perform generative augmentation with a video model, simulating inconsistencies from the individual images and generated inconsistency text prompts. These images are fed to a multiview generative model along with a held out "target state" image. The model is trained to predict a consistent set of images corresponding to the target state, i.e., the original set of multiview images.

A visualization of the training pipeline of our method.

Harmonizing Sparse Images of Dynamic Scenes

We take (unordered) sparse image sets from DyCheck and make them consistent with the image in orange. Compare the renders, depth maps, and diffusion samples of our method SimVS (right) with CAT3D (left). Note that the diffusion samples follow the input video trajectory which is not smooth.

Input Image

Input

Baseline

Harmonizing Sparse Images of Varying Illumination

We collect our own dataset of varying illuminations with an iPhone camera, in which the same scene is observed under 3 different lighting conditions. The models reconstruct the scene under the target state specified in orange conditioned on the images with varying illumination. Compare the renders, depth maps, and diffusion samples of our method SimVS (right) with CAT3D (left). In this case, the baseline CAT3D samples supervise a GLO-based NeRF, and the GLO embedding corresponding to the target state image is used for testing.

Baseline

Interactive Grid of Video Samples

Click on the interactive grid to view our generative video augmentation in comparison to the original image and two heuristic augmentation strategies. These videos were sampled from Lumiere using the inconsistency prompt generation methods specified in the paper. We use this data for training SimVS. Note that some videos include flashing lights.

Acknowledgements

We would like to thank Paul-Edouard Sarlin, Jiamu Sun, Songyou Peng, Linyi Jin, Richard Tucker, Rick Szeliski and Stan Szymanowicz for insightful conversations and help. We also extend our gratitude to Shlomi Fruchter, Kevin Murphy, Mohammad Babaeizadeh, Han Zhang and Amir Hertz for training the base text-to-image latent diffusion model. This work was supported in part by an NSF Fellowship, ONR grant N00014-23-1-2526, gifts from Google, Adobe, Qualcomm and Rembrand, the Ronald L. Graham Chair, and the UC San Diego Center for Visual Computing.

BibTeX