You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale (original) (raw)

Baorui Ma, Huachen Gao∗, Haoge Deng∗, Zhengxiong Luo, Tiejun Huang, Lulu Tang†, Xinlong Wang†

Beijing Academy of Artificial Intelligence (BAAI)

Abstract

Recent 3D generation models typically rely on limited-scale 3D ‘gold-labels’ or 2D diffusion priors for 3D content creation. However, their performance is upper-bounded by constrained 3D priors due to the lack of scalable learning paradigms. In this work, we present See3D, a visual-conditional multi-view diffusion model trained on large-scale Internet videos for open-world 3D creation. The model aims to Get 3D knowledge by solely Seeing the visual contents from the vast and rapidly growing video data — You See it, You Got it. To achieve this, we first scale up the training data using a proposed data curation pipeline that automatically filters out multi-view inconsistencies and insufficient observations from source videos. This results in a high-quality, richly diverse, large-scale dataset of multi-view images, termed WebVi3D, containing 320M frames from 16M video clips. Nevertheless, learning generic 3D priors from videos without explicit 3D geometry or camera pose annotations is nontrivial, and annotating poses for web-scale videos is prohibitively expensive. To eliminate the need for pose conditions, we introduce an innovative visual-condition - a purely 2D-inductive visual signal generated by adding time-dependent noise to the masked video data. Finally, we introduce a novel visual-conditional 3D generation framework by integrating See3D into a warping-based pipeline for high-fidelity 3D generation. Our numerical and visual comparisons on single and sparse reconstruction benchmarks show that See3D, trained on cost-effective and scalable video data, achieves notable zero-shot and open-world generation capabilities, markedly outperforming models trained on costly and constrained 3D datasets. Additionally, our model naturally supports other image-conditioned 3D creation tasks, such as 3D editing, without further fine-tuning. Please refer to our project page at: https://vision.baai.ac.cn/see3d.

1 Introduction

Recent advances in 3D generation are essential for fields like virtual reality, entertainment, and simulation, offering the potential not only to recreate intricate real-world structures but also to expand human imagination. Nevertheless, developing these models is constrained by the scarcity and high costs of accessible 3D datasets. Despite recent industry efforts [94, 125, 117] create extensive proprietary 3D assets, these initiatives come with substantial financial and operational burdens. Currently, building such a large-scale 3D dataset for academia remains prohibitively expensive. This motivates us to pursue a scalable, accessible, and affordable data source that can compete with advanced closed-source solutions, thereby enabling the broader research community to train high-performance 3D generation models.

Human perception of the 3D world does not rely on specific 3D representation (e.g., point clouds[19], voxel grids [39], meshes [98], or neural fields [65]) or precise camera conditions. Instead, our 3D awareness is shaped by multi-view observations accumulated throughout our lives. This raises the question: Can models similarly learn universal 3D priors from large collections of multi-view images? Fortunately, Internet videos offer a rich source of multi-view images, captured from various locations with diverse sensors and complex camera trajectories, providing a scalable, accessible, and cost-effective data source. Thus, how can we effectively learn 3D knowledge from Internet videos?

The core challenges in achieving this goal are twofold: 1) filtering relevant, 3D-aware video data from raw sources, specifically static scenes with varied camera viewpoints that provide sufficient multi-view observations; and 2) learning generic 3D priors from videos lacking explicit 3D geometry and camera pose annotations (i.e. pose-free videos). This work carefully addresses these challenges and introduces a pose-free, visual-conditional multi-view diffusion (MVD) model, See3D, for open-world 3D creation.

Specifically, we establish a novel video data curation pipeline that automatically filters out data with dynamic content or restricted camera viewpoints from source videos. The resulting dataset, termed WebVi3D, comprises 15.99M video clips from 25.48M source videos, totaling 4.41 years in duration—orders of magnitude larger than previous 3D datasets, such as DLV3D (0.01M) [50], RealEstate10K (0.08M) [129], MVImgNet (0.22M) [122] and Objaverse (0.8M) [15].

MVD models have recently gained widespread attention due to their advantages of integrating the generative capabilities of 2D diffusion models while maintaining consistency across multiple views [80, 56, 51, 88, 128]. Typically, these models rely on precise camera poses [80, 23, 33, 54, 66, 2, 110, 111, 28, 124, 45] or warped images rendered according to camera position [121, 95] as conditional inputs to guide 3D-consistent view generation. We refer to these conditions, derived from pose or 3D annotations, as 3D-inductive conditions. However, annotating web-scale videos is prohibitively costly, or even intractable in some cases, posing significant challenges for scaling. To address this, we propose a novel, pose-free visual-condition derived from pixel-space hints within videos. It is a purely 2D-inductive visual signal, created by introducing time-dependent noise to masked input videos, free from any 3D-inductive bias. This enables training MVD model at scale, without requiring pose annotations.

Intuitively, the proposed visual-condition can generalize effectively to tasks that rely on pixel-space hints distinct from those in videos, such as warping-based 3D generation [12, 81] and mask-based 3D editing [10], without requiring additional training, see LABEL:fig:demo. For instance, in warping-based 3D generation, pixels from a reference image are rearranged through warping operations, creating a specific visual-condition to indicate camera movement. However, these warped images often exhibit artifacts or distortions, causing a significant domain gap compared to video frames. Whereas, our visual-condition functions as a generic one, capable of accommodating such unnatural images.

To further harness the potential of See3D, we introduce an innovative visual-conditional 3D generation framework utilizing a warping-based pipeline. This framework first constructs the visual-condition using See3D, then iteratively refines the geometry of novel views to build comprehensive scene observations. Finally, the generated images are used for Gaussian Splatting reconstruction [41, 35], which can be rendered from arbitrary viewpoints or converted into meshes through post-processing [59]. In summary, our key contributions are as follows.

Refer to caption

Figure 1: Overview of See3D.(a) We propose a four-step data curation pipeline to select multi-view images from Internet videos, forming the WebVi3D dataset, which includes ∼similar-to\sim∼16M video clips across diverse categories and concepts. (b) Given multiple views, we corrupt the original data into corrupted images ctisuperscriptsubscript𝑐𝑡𝑖c_{t}^{i}italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPTat timestep t𝑡titalic_t by applying random masks and time-dependent noise. We then reweight the guidance of ctisuperscriptsubscript𝑐𝑡𝑖c_{t}^{i}italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and the noisy latent xtisuperscriptsubscript𝑥𝑡𝑖x_{t}^{i}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT for the diffusion model to form visual-condition vtisuperscriptsubscript𝑣𝑡𝑖v_{t}^{i}italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT through a time-dependent mixture. (c) MVD model is capable of training at scale to generate multi-view images conditioned on vtisuperscriptsubscript𝑣𝑡𝑖v_{t}^{i}italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, without requiring pose annotations. Since vtisuperscriptsubscript𝑣𝑡𝑖v_{t}^{i}italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is a task-agnostic visual signal formed through time-dependent noise and mixture, it enables the trained model to robustly adapt to various downstream tasks.

Lifting 2D Generation into 3D. Recent advances in 3D generation have been largely driven by the success of 2D diffusion models [84, 85, 31, 76], which have revolutionized image and video generation. These works typically optimize 3D representations by maximizing the likelihood evaluated by 2D diffusion priors [72, 63, 90, 103, 42, 48, 118, 87, 53]. An alternative approach uses a warping-inpainting pipeline, integrating an offline depth estimator with a 2D diffusion-based inpainting model to iteratively generate 3D content [12, 17, 119, 121, 66, 32, 97]. However, 2D priors do not readily translate into coherent 3D representations. As a result, 2D lifting-based approaches often struggle to preserve high geometric fidelity, leading to issues like multi-view inconsistency and poor global geometry [120].

Directly Learning 3D Priors.To better preserve geometric features, some works focus on directly learning 3D priors. For instance, feed-forward approaches [33, 25, 100, 47, 106, 131, 94, 91, 115, 116, 78, 60, 132, 89, 55, 11, 8, 46] take single/few views as input and directly output 3D representations using an encoder-decoder architecture, eliminating the need for additional optimization process per instance. Another line of research involves training diffusion models to predict 3D representations, such as point clouds [123, 67], mesh [61, 1, 38], and implicit neural representation [62, 9, 108, 125]. However, these methods generally focus on object-level generation [15, 109, 91, 132, 125], limiting their applicability to scene-level generation. Although recent research has made strides in building scene-level 3D datasets [43, 50, 3, 13], their scale remains relatively limited. The reliance on costly, limited-scale 3D datasets restricts generalization to open-world or highly imaginative scenarios. In contrast, our approach curates a large-scale, richly diverse dataset of multi-view images from Internet videos. By training the model at scale, it effectively supports both object-level and scene-level 3D creation.

Learning Multi-view Priors for 3D Generation. MVD model inherits the generative capabilities of 2D diffusion models while capturing multi-view correlations, achieving both generalizability and 3D consistency. These merits have made it a focal point in recent 3D generation research [80, 79, 52, 121, 99, 58, 56, 26, 73, 77, 23]. However, as 2D diffusion models are typically trained on 2D datasets, they lack precise control over image pose. To address this, MVD-based approaches often train their models on images paired with camera poses [105, 54, 77, 107, 24], where poses serve as essential conditional inputs, represented by camera extrinsics [77, 80], relative poses [54, 79, 56], or Plücker rays [23, 111]. Yet, pose-conditional models rely heavily on costly pose-annotated data, restricting training to smaller 3D datasets, thereby constraining their adaptability to out-of-distribution scenarios. In contrast, we introduce a novel visual-conditional approach that supports scalable, pose-free MVD model training for open-world 3D generation.

3 Method

The primary objective of this work is to build a robust 3D generative model from the perspective of dataset scaling-up. Previous works [15, 95, 75] laboriously collect 3D data from designed artists, stereo matching, or Structure from Motion (SfM), which can be costly and sometimes infeasible. In contrast, multi-view images offer a highly scalable alternative, as they can be automatically extracted from the vast and rapidly growing Internet videos. By using multi-view prediction as a pretext task, we demonstrate that learned 3D priors enable various 3D creation applications, including single view generation, sparse views reconstruction, and 3D editing in open-world scenarios.

The following sections outline our approach (Fig.1). Sec. 3.1 details the data curation pipeline, which selects static scenes with sufficient multi-view observations from raw video footage. Sec. 3.2 introduces our visual-conditional multi-view diffusion model, which effectively learns general 3D priors from pose-free videos. Finally, Sec. 3.3 demonstrates a new visual-conditional 3D generation framework utilizing a warping-based pipeline.

3.1 Video Data Curation

High-quality, large-scale video data rich in 3D knowledge is essential for learning accurate and reliable 3D priors. A well-defined 3D-aware video clip should exhibit two key properties: 1) temporally static scene content and 2) significant viewpoint variation caused by the camera’s ego-motion. The first property ensures consistent 3D geometry across different viewpoints, while dynamic content can distort scene geometry and introduce biases that may degrade generation performance (Fig. 2a-Row1). The second property guarantees sufficient 3D observations from diverse viewpoints. When the model is trained on videos with limited viewpoint variation (Fig. 2a-Row2), it risks focusing on views adjacent to the reference view, rather than developing a comprehensive 3D understanding.

To obtain a massive volume of 3D data, we collect approximately 25.48M open-sourced raw videos, totaling 44.98 years from the Internet, covering a wide range of categories, such as landscapes, drones, animals, plants, games, and actions. Specifically, our dataset is sourced from four websites: Pexels [69], Artgrid [36], Airvuz [70], and Skypixel [96]. We follow Emu3 [102] to split the videos with PySceneDetect [7] to identify content changes and fade-in/out events. Additionally, we remove clips with excessive text using PaddleOCR [92]. The detailed composition of our WebVi3D dataset is presented in Tab. 1.

Table 1: WebVi3D Dataset. Sourced from four open websites, we curate ∼similar-to\sim∼2.30M videos, which are divided into 15.99M clips featuring temporally static scenes with large-range viewpoint.

However, identifying 3D-aware videos presents a nontrivial challenge. As most videos are derived from real-world footage, such videos often contains dynamic scenes or small camera movement. To address this, we propose a pipeline that automatically selects relevant, high-quality 3D-aware data (i.e., multi-view images) by leveraging priors from instance segmentation [29], optical flow [93], and pixel tracking [40]. This pipeline comprises four core steps:

a) Temporal-Spatial Downsampling.To improve data filtering efficiency, we first downsample each video clip both temporally and spatially. The final resolution is set to 480p, and the temporal downsampling rate is set to 2. Note that this downsampling operation is applied only during data curation, not during model training.

b) Semantic-Based Dynamic Recognition. We employ the instance segmentation model, Mask R-CNN [29], to generate motion masks for potential dynamic objects, such as humans, animals, and sports equipment. A threshold is applied to filter out videos based on the proportion of frames containing these objects, as they are more likely associated with dynamic scenes.

c) Flow-Based Dynamic Filtering. To precisely filter out videos with dynamic regions, we use offline optical flow estimation [93] to obtain dense matching, which enables us to identify dynamic motion masks in video frames. These masks are then analyzed based on their locations to further determine whether the video contains dynamic content.

d) Tracking-Based Small Viewpoint Filtering. The previous three steps yield videos with static scenes. To further ensure these videos contain multi-view images captured from a larger camera viewpoint, we track the motion trajectory of key points across frames and calculate the radius of the minimum outer tangent circle of the trajectory. Videos with a small trajectory radius are then filtered out. More details about the data curation pipeline are provided in the Appendix B.

Refer to caption

Figure 2: (a-Row1): Dynamic content modifies scene geometry across views; (a-Row2): Limited camera movement provides insufficient multi-view observations; (b) Our WebVi3D comprises static scenes with diverse camera trajectories.

Finally, we curate approximately 320M multi-view images from 15.99M video clips with static content and sufficient multi-view observations (see Fig.2b). To validate the effectiveness of our data acquisition method, we randomly select 10,000 video clips for human annotation, of which 8,859 were labeled as 3D-aware, representing 88.6% of the total. This indicates that our pipeline effectively identifies 3D-aware videos from massive source videos. As the volume of Internet videos continues to grow, this pipeline can continuously acquire more 3D-aware data, allowing for ongoing expansion of our dataset.

3.2 Visual Conditional Multi-View Diffusion Model

Preliminary.

Diffusion models [84, 85, 31] operate by perturbing the training data X0∼q⁢(X0)similar-tosubscript𝑋0𝑞subscript𝑋0X_{0}\sim q(X_{0})italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_q ( italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) through a forward diffusion process and learning to reverse it. The forward diffusion process Xt∼qt|0⁢(Xt|X0)similar-tosubscript𝑋𝑡subscript𝑞conditional𝑡0conditionalsubscript𝑋𝑡subscript𝑋0X_{t}\sim q_{t|0}(X_{t}|X_{0})italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_q start_POSTSUBSCRIPT italic_t | 0 end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) can be formally represented by Xt=α¯t⁢𝐗0+1−α¯t⁢ϵ,ϵ∼𝒩⁢(0,𝐈)formulae-sequencesubscript𝑋𝑡subscript¯𝛼𝑡subscript𝐗01subscript¯𝛼𝑡bold-italic-ϵsimilar-tobold-italic-ϵ𝒩0𝐈X_{t}=\sqrt{\bar{\alpha}_{t}}\mathbf{X}_{0}+\sqrt{1-\bar{\alpha}_{t}}\bm{% \epsilon},\quad\bm{\epsilon}\sim\mathcal{N}(0,\mathbf{I})italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_ϵ , bold_italic_ϵ ∼ caligraphic_N ( 0 , bold_I ), where α¯tsubscript¯𝛼𝑡\bar{\alpha}_{t}over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is variance schedule used in noise scheduler. In theory, Xtsubscript𝑋𝑡X_{t}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT approximates an isotropic Gaussian distribution for sufficiently large timesteps t𝑡titalic_t. The training objective is to learn the reverse process.

Objective. We aim for multi-view prediction: generating novel views along specified camera trajectories from a single or sparse input while ensuring consistency with the input appearance. The MVD model inherits the generalizability of the 2D diffusion model while capturing cross-view consistency, which naturally aligns with our goal. Following this line, we present See3D, a pose-free, visual-conditional MVD model trained on Internet videos to enable robust 3D generation, as shown in Fig.1.

Challenge. The main technical challenge lies in learning precise camera control from pose-free videos. Previous works commonly incorporate camera parameters for both input and target views into diffusion models to guide multi-view generation from specified viewpoints. However, training these models generally requires expensive 3D data with precise camera pose annotations, which limits scalability. To address this, we explore an alternative approach that conditions on 2D-inductive visual hints to implicitly control camera movement during training, thereby avoiding the need for hard-to-obtain camera trajectories.

Formulation. Formally, we propose training the MVD model conditioned on 2D-inductive visual signals, referred to as visual-condition, without incorporating camera parameters. This task can be formulated as designing a conditional distribution, achieved by a conditional diffusion model that minimizes:

𝔼X0,Y0,ϵ,t⁢[∥ϵθ⁢(Xt,Y0,V,t)−ϵ∥22],subscript𝔼subscript𝑋0subscript𝑌0italic-ϵ𝑡delimited-[]superscriptsubscriptdelimited-∥∥subscriptitalic-ϵ𝜃subscript𝑋𝑡subscript𝑌0𝑉𝑡italic-ϵ22\mathbb{E}_{X_{0},Y_{0},\epsilon,t}\left[\left\lVert\epsilon_{\theta}(X_{t},Y_% {0},V,t)-\epsilon\right\rVert_{2}^{2}\right],blackboard_E start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ϵ , italic_t end_POSTSUBSCRIPT [ ∥ italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_V , italic_t ) - italic_ϵ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (1)

where Xtsubscript𝑋𝑡X_{t}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denotes the noisy latent. X0={x0i}i=1Nsubscript𝑋0superscriptsubscriptsuperscriptsubscript𝑥0𝑖𝑖1𝑁X_{0}=\left\{x_{0}^{i}\right\}_{i=1}^{N}italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = { italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT represents a multi-view observation of 3D content, formed by sampling one clip from WebVi3D as described in Section 3.1, with N=S+L𝑁𝑆𝐿N=S+Litalic_N = italic_S + italic_L being the number of frames in each clip. From X0subscript𝑋0X_{0}italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, S𝑆Sitalic_S frames are randomly selected as reference views, noted as Y0={y0i}i=1Ssubscript𝑌0superscriptsubscriptsuperscriptsubscript𝑦0𝑖𝑖1𝑆Y_{0}=\left\{y_{0}^{i}\right\}_{i=1}^{S}italic_Y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = { italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT, while the remaining L𝐿Litalic_L frames are treated as target images, denoted G={gi}i=1L𝐺superscriptsubscriptsuperscript𝑔𝑖𝑖1𝐿G=\left\{g^{i}\right\}_{i=1}^{L}italic_G = { italic_g start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT. Our approach focuses on constructing the visual-condition V𝑉Vitalic_V, which guides the diffusion model to generate plausible 3D content estimates from target viewpoints, ensuring consistency with the appearance of Y0subscript𝑌0Y_{0}italic_Y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

3.2.1 Principle of Visual-Condition

A desirable visual-condition should meet the following criteria: a) it can be constructed without the need for additional 3D annotations, b) it is independent of specific downstream tasks, and c) it offers sufficient generalization to support various task-specific visual conditions, enabling precise control of camera movements.

Ideally, this visual-condition can be derived from pixel-space hints within the original videos, implicitly guiding the model to learn camera control. Moreover, it should be robust enough to handle domain gaps between task-specific visual cues and pixels extracted from video data. For example, in warping-based generation, warped images often suffer from issues like self-occlusions, artifacts, and distortions, creating a significant gap compared to real video data as shown in Fig.5 and Fig.4.

3.2.2 Time-dependent Visual Condition

Building on the analysis above, we propose constructing the visual-condition by applying masks, noise, and mixture to the input video data.

Random Masking: We first corrupt target images G𝐺Gitalic_G through random irregular masking to reduce reliance on direct pixel-space visual signals, helping the model partially mitigate the domain gap between task-specific visual cues and video data. Meanwhile, we keep the reference images Y0subscript𝑌0Y_{0}italic_Y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT clean to provide effective appearance signals.

Time-dependent Noise:We further add noise to video data to approximate a Gaussian distribution. For downstream tasks, task-specific visual inputs are similarly noised, aligning their distributions with this Gaussian profile and further bridging the gap between video data and task-specific inputs. A key challenge lies in determining the optimal noise level: excessive noise weakens conditional signals, resulting in poor visual quality and inaccurate camera control, whereas insufficient noise preserves too many details from the target images, causing the model to over-rely on visual hints from the video data.

Previous studies [66, 83, 34, 127, 18] have explored modulating noise levels by adding noise to input data. Notably, as pointed out in the previous work [127], diffusion models tend to over-rely on the conditional image at larger time steps, leading to signal leakage. Inspired by this[127], we introduce time-dependent noise to the corrupted target images. In addition, we develop a function t′=f⁢(t)superscript𝑡′𝑓𝑡t^{\prime}=f(t)italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_f ( italic_t ) to regulate signal leakage, preventing excessive noise from completely obscuring visual cues and disabling camera control. Specifically, we define:

Ct=α¯t′(1−M)𝐗0+1−α¯t′ϵ.ϵ∼𝒩(0,𝐈).C_{t}=\sqrt{\bar{\alpha}_{t^{\prime}}}(1-M)\mathbf{X}_{0}+\sqrt{1-\bar{\alpha}% _{t^{\prime}}}\bm{\epsilon}.\quad\bm{\epsilon}\sim\mathcal{N}(0,\mathbf{I}).italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG ( 1 - italic_M ) bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG bold_italic_ϵ . bold_italic_ϵ ∼ caligraphic_N ( 0 , bold_I ) . (2)

Here, f𝑓fitalic_f is a strictly monotonically increasing function, ensuring t′<tsuperscript𝑡′𝑡t^{\prime}<titalic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT < italic_t, so that Ctsubscript𝐶𝑡C_{t}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT contains at least as much information as Xtsubscript𝑋𝑡X_{t}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at earlier timesteps. α¯t′subscript¯𝛼superscript𝑡′\bar{\alpha}_{t^{\prime}}over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT are the variances used in DDIM [85]. A detailed explanation of f⁢(t)𝑓𝑡f(t)italic_f ( italic_t ) can be found in Appendix C.3.

Time-dependent Mixture: However, as t𝑡titalic_t decreases, lower noise levels increase the risk of signal leakage, causing a domain gap between the video data and task-specific visual condition distributions. To address this issue, we propose gradually replacing the corrupted data Ctsubscript𝐶𝑡C_{t}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with noisy latent variables Xtsubscript𝑋𝑡X_{t}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as timestep decreases. This encourages the model to rely more on pixel-space signals from video data at larger time steps, and transition to Xtsubscript𝑋𝑡X_{t}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at smaller timesteps. To achieve this, we further introduce a weighting factor Wt∈[0,1]subscript𝑊𝑡01W_{t}\in[0,1]italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ [ 0 , 1 ], which decreases monotonically with the timestep t𝑡titalic_t, to combine Ctsubscript𝐶𝑡C_{t}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and Xtsubscript𝑋𝑡X_{t}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Formally, our final visual-condition is defined as:

Vt=[Wt∗Ct+(1−Wt)∗Xt;M],subscript𝑉𝑡∗subscript𝑊𝑡subscript𝐶𝑡∗1subscript𝑊𝑡subscript𝑋𝑡𝑀V_{t}=[W_{t}\ast C_{t}+\left(1-W_{t}\right)\ast X_{t};M],italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∗ italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ( 1 - italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∗ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_M ] , (3)

where M={m0:S∪mS+1:N}𝑀superscript𝑚:0𝑆superscript𝑚:𝑆1𝑁M=\left\{m^{0:S}\cup m^{S+1:N}\right\}italic_M = { italic_m start_POSTSUPERSCRIPT 0 : italic_S end_POSTSUPERSCRIPT ∪ italic_m start_POSTSUPERSCRIPT italic_S + 1 : italic_N end_POSTSUPERSCRIPT }, with m0:Ssuperscript𝑚:0𝑆m^{0:S}italic_m start_POSTSUPERSCRIPT 0 : italic_S end_POSTSUPERSCRIPT as a zero matrix, keeping the reference images Y0subscript𝑌0Y_{0}italic_Y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT unmasked, and mS+1:Nsuperscript𝑚:𝑆1𝑁m^{S+1:N}italic_m start_POSTSUPERSCRIPT italic_S + 1 : italic_N end_POSTSUPERSCRIPT as random irregular masks applied to the target images G𝐺Gitalic_G. Vt={vti}i=1Nsubscript𝑉𝑡superscriptsubscriptsuperscriptsubscript𝑣𝑡𝑖𝑖1𝑁V_{t}=\left\{v_{t}^{i}\right\}_{i=1}^{N}italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT represents a mixture of Ctsubscript𝐶𝑡C_{t}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and Xtsubscript𝑋𝑡X_{t}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, concatenated with masks M𝑀Mitalic_M along the channel dimension. In practice, an additional processing step assigns vt0:Ssuperscriptsubscript𝑣𝑡:0𝑆v_{t}^{0:S}italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 : italic_S end_POSTSUPERSCRIPT to the reference images Y0subscript𝑌0Y_{0}italic_Y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT directly, in order to inject the clean information of Y0subscript𝑌0Y_{0}italic_Y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT into the model, facilitating alignment between the predicted images and the reference images. Consequently, Eq.1 can be reformulated as 𝔼X0,Y0,ϵ,t⁢[∥ϵθ⁢(Xt,Y0,Vt,t)−ϵ∥22]subscript𝔼subscript𝑋0subscript𝑌0italic-ϵ𝑡delimited-[]superscriptsubscriptdelimited-∥∥subscriptitalic-ϵ𝜃subscript𝑋𝑡subscript𝑌0subscript𝑉𝑡𝑡italic-ϵ22\mathbb{E}_{X_{0},Y_{0},\epsilon,t}\left[\left\lVert\epsilon_{\theta}(X_{t},Y_% {0},V_{t},t)-\epsilon\right\rVert_{2}^{2}\right]blackboard_E start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ϵ , italic_t end_POSTSUBSCRIPT [ ∥ italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) - italic_ϵ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]. A more detailed definition of Wtsubscript𝑊𝑡W_{t}italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is provided in Appendix C.3.

3.2.3 Model Architecture

Our model architecture is based on video diffusion model [6]. However, we removed the time embedding, as we aim for the model to control the camera movement purely through visual conditions, rather than inferring movement trends based on temporal cues. To further minimize the effect of temporality, we shuffle the frames in each video clip, treating the data as unordered X0subscript𝑋0X_{0}italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Specifically, we randomly select a subset of frames from a video clip as reference images, with the remaining frames as target images. The number of reference images is randomly selected to accommodate different downstream tasks. The multi-view diffusion model is optimized by calculating the loss only on the target images, as described in Eq.1. Additional details regarding the model architecture, including the design of self-attention layers, Zero-Initialize, trainable parameters, noise schedule, and cross-attention, can be found in the Appendix C.1.

3.3 Visual Conditional 3D Generation

Refer to caption

Figure 3: See3D for Multi-View Generation: From iteratively generated views (brown camera), we randomly select a few anchor views (yellow stars) to guide the generation of target views along the gray camera trajectory. Keypoint matching is first performed to establish correspondences between the anchor views. Next, monocular depth estimation is applied to the latest anchor view, followed by ourIterative Sparse Pixel-Wise Depth Alignment to refine the depth and recover a dense map. This dense depth is then used to warp images along the gray camera viewpoints. Subsequently, the warped images and anchor images are combined and processed according to Eq.2 and Eq.3, without random masking, forming the visual-condition, which guides MVD model to produce 3D-consistent target views. Finally, the gray camera turns to brown, guiding multi-view generation in the next iteration.

Overview. This section demonstrates the application of See3D for domain-free 3D generation, supporting long-sequence novel view synthesis with complex camera trajectories. Starting with one or a few input views, we iteratively generate warped images as visual hints, guided by predefined camera poses and estimated global depth [5]. See3D is then utilized to generate novel views along the predefined camera trajectory, conditioned on the proposed visual-condition. This iterative pipeline is illustrated in Fig.3, where the brown cameras represent the already generated views, and the gray cameras indicate the target views we aim to generate.

Challenge.Recent warping-based 3D generation approaches [12, 22, 44] rely on monocular depth or point clouds, and perform global point-cloud alignment to recover the actual geometry for subsequent generations. However, as the reference view often provides a limited scene observation, using offline methods tends to suffer from scale ambiguity and geometric estimation errors. Moreover, previous methods often overlook correcting these geometric errors, leading to distortions and stretching artifacts. These errors accumulate during iterative generation, severely degrading the generation quality. To address this, we propose an iterative strategy with sparse pixel-wise depth alignment, comprising two core steps: pixel-wise depth scale alignment and global metric depth recovery.

Pixel-wise Depth Scale Alignment. We introduce pixel-wise depth scale alignment using sparse keypoints. This approach performs high-degree-of-freedom independent optimization for all keypoints by leveraging multi-view matching priors from anchor views. Each keypoint independently identifies its multi-view correspondences, allowing for the recovery of both depth scale and surrounding geometry. The corrected scale is then propagated across the entire depth map using 2D distances between keypoints and their neighbors.

Specifically, denote {Ti}i=0Nsuperscriptsubscriptsubscript𝑇𝑖𝑖0𝑁\{T_{i}\}_{i=0}^{N}{ italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT the predefined camera trajectory. Assuming we have generated n𝑛nitalic_n images {Ii}i=0nsuperscriptsubscriptsubscript𝐼𝑖𝑖0𝑛\{I_{i}\}_{i=0}^{n}{ italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, we now proceed to generate the next m𝑚mitalic_m views using the warped image from the last anchor view Insubscript𝐼𝑛I_{n}italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, referred to as the source view. We first utilize the pre-trained MoGe [101] to estimate the affine-invariant depth 𝑫^nsubscriptbold-^𝑫𝑛\bm{\hat{D}}_{n}overbold_^ start_ARG bold_italic_D end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT of Insubscript𝐼𝑛I_{n}italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. Inspired by [112], we perform sparse alignment with 1024102410241024 pairs of matching keypoints {𝐦n,𝐦i}ksubscriptsubscript𝐦𝑛subscript𝐦𝑖𝑘\{\mathbf{m}_{n},\mathbf{m}_{i}\}_{k}{ bold_m start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, obtained by the pre-trianed extractor SuperPoint [16] and feature matcher LightGlue [49]. For each matched point, we optimize the corresponding scale αksuperscript𝛼𝑘\alpha^{k}italic_α start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT and shift βksuperscript𝛽𝑘\beta^{k}italic_β start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT parameters, where k∈[0,1024]𝑘01024k\in[0,1024]italic_k ∈ [ 0 , 1024 ], Our core idea is to recover the depth scaling by minimizing the L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT distance of re-projection between matching points. For each iteration, the warping operation Πn→isubscriptΠ→𝑛𝑖\Pi_{n\rightarrow i}roman_Π start_POSTSUBSCRIPT italic_n → italic_i end_POSTSUBSCRIPT transforms pixels from the source image’s coordinate frame to the target image’s coordinate frame, formulated as: Πn→i⁢(d^n)=d^n⁢Ki⁢Ti⁢Tn−1⁢Kn−1subscriptΠ→𝑛𝑖subscript^𝑑𝑛subscript^𝑑𝑛subscript𝐾𝑖subscript𝑇𝑖superscriptsubscript𝑇𝑛1superscriptsubscript𝐾𝑛1\Pi_{n\rightarrow i}(\hat{d}_{n})=\hat{d}_{n}K_{i}T_{i}T_{n}^{-1}K_{n}^{-1}roman_Π start_POSTSUBSCRIPT italic_n → italic_i end_POSTSUBSCRIPT ( over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, where Ki,Kn,Ti,Tnsubscript𝐾𝑖subscript𝐾𝑛subscript𝑇𝑖subscript𝑇𝑛K_{i},K_{n},T_{i},T_{n}italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT represent the intrinsic and extrinsic parameters of the source and target frames, respectively. The alignment for each pair is performed using normalized coordinates, ensuring that the warping aligns with the matching prior:

| αk⁣∗,βk⁣∗=a⁢r⁢g⁢m⁢i⁢nαk,βk‖d^nk⁣∗⁢Ki⁢Ti⁢Tn−1⁢Kn−1⁢mnt−mit‖22,superscript𝛼𝑘superscript𝛽𝑘subscript𝑎𝑟𝑔𝑚𝑖𝑛superscript𝛼𝑘superscript𝛽𝑘superscriptsubscriptnormsuperscriptsubscript^𝑑𝑛𝑘subscript𝐾𝑖subscript𝑇𝑖superscriptsubscript𝑇𝑛1superscriptsubscript𝐾𝑛1superscriptsubscript𝑚𝑛𝑡superscriptsubscript𝑚𝑖𝑡22\alpha^{k*},\beta^{k*}=\mathop{argmin}\limits_{\alpha^{k},\beta^{k}}||\hat{d}_% {n}^{k*}K_{i}T_{i}T_{n}^{-1}K_{n}^{-1}m_{n}^{t}-m_{i}^{t}||_{2}^{2},italic_α start_POSTSUPERSCRIPT italic_k ∗ end_POSTSUPERSCRIPT , italic_β start_POSTSUPERSCRIPT italic_k ∗ end_POSTSUPERSCRIPT = start_BIGOP italic_a italic_r italic_g italic_m italic_i italic_n end_BIGOP start_POSTSUBSCRIPT italic_α start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_β start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | | over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k ∗ end_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , | (4) | | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ----------------------------------------------------------------------------------------- | --- |

where the recovered depth of k𝑘kitalic_kth pixel is d^nk⁣∗=αk⊙d^nk+βksuperscriptsubscript^𝑑𝑛𝑘direct-productsuperscript𝛼𝑘superscriptsubscript^𝑑𝑛𝑘superscript𝛽𝑘\hat{d}_{n}^{k*}=\alpha^{k}\odot\hat{d}_{n}^{k}+\beta^{k}over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k ∗ end_POSTSUPERSCRIPT = italic_α start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ⊙ over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT + italic_β start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, the ⊙direct-product\odot⊙ is the pixel-wise Hadamard Product. We minimize the matching loss via gradient descent to obtain best scale αk⁣∗superscript𝛼𝑘\alpha^{k*}italic_α start_POSTSUPERSCRIPT italic_k ∗ end_POSTSUPERSCRIPT and shift parameters βk⁣∗superscript𝛽𝑘\beta^{k*}italic_β start_POSTSUPERSCRIPT italic_k ∗ end_POSTSUPERSCRIPT for each pixel. By performing individual scale recovery and geometry correction, we decouple the depth correlation among different points, achieving accurate single-view reconstruction.

Global Metric Depth Recovery. After that, we set these recovered positions as sparse guidance d^n∗superscriptsubscript^𝑑𝑛\hat{d}_{n}^{*}over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, and introduce Locally Weighted Linear Regression [112] (marked as LWLR in Fig.3) to recover the whole depth map based on the locations between guided points and the other target points. Denote (u,v)𝑢𝑣(u,v)( italic_u , italic_v ) represent the 2D positions of the remaining target points, their depth 𝑫^nsubscriptbold-^𝑫𝑛\bm{\hat{D}}_{n}overbold_^ start_ARG bold_italic_D end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT can be fitted to the sparse guided points by minimizing the squared locally weighted distance, which is reweighed by the diagonal weight matrix as:

𝑾u,v=d⁢i⁢a⁢g⁢(w1,w2,…,wm),wi=12⁢π⁢exp⁡(−disti22⁢b2),formulae-sequencesubscript𝑾𝑢𝑣𝑑𝑖𝑎𝑔subscript𝑤1subscript𝑤2…subscript𝑤𝑚subscript𝑤𝑖12𝜋superscriptsubscriptdist𝑖22superscript𝑏2\begin{gathered}\bm{W}_{u,v}=diag(w_{1},w_{2},...,w_{m}),\\ w_{i}=\frac{1}{\sqrt{2\pi}}\exp(-\frac{\mathrm{dist}_{i}^{2}}{2b^{2}}),\end{gathered}start_ROW start_CELL bold_italic_W start_POSTSUBSCRIPT italic_u , italic_v end_POSTSUBSCRIPT = italic_d italic_i italic_a italic_g ( italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) , end_CELL end_ROW start_ROW start_CELL italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG square-root start_ARG 2 italic_π end_ARG end_ARG roman_exp ( - divide start_ARG roman_dist start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_b start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) , end_CELL end_ROW (5)

where b𝑏bitalic_b is the bandwidth of Gaussian kernel, and distdist\mathrm{dist}roman_dist is the Euclidean distance between the guided point and the underestimated target point. Denote 𝑿𝑿\bm{X}bold_italic_X the homogeneous representation of 𝑫^nsubscriptbold-^𝑫𝑛\bm{\hat{D}}_{n}overbold_^ start_ARG bold_italic_D end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, the scale map 𝑺s⁢c⁢a⁢l⁢esubscript𝑺𝑠𝑐𝑎𝑙𝑒\bm{S}_{scale}bold_italic_S start_POSTSUBSCRIPT italic_s italic_c italic_a italic_l italic_e end_POSTSUBSCRIPT and shift map 𝑺s⁢h⁢i⁢f⁢tsubscript𝑺𝑠ℎ𝑖𝑓𝑡\bm{S}_{shift}bold_italic_S start_POSTSUBSCRIPT italic_s italic_h italic_i italic_f italic_t end_POSTSUBSCRIPT of target points can be calculated by iterating every location on the whole image, which can be formulated as:

min𝜷u,v(d^n∗−𝑿𝜷u,v)𝖳𝑾u,v(d^n∗−𝑿𝜷u,v)+λ𝑺s⁢h⁢i⁢f⁢t2,𝜷^u,v=(𝑿𝖳⁢𝑾u,v⁢𝑿+λ)−1⁢𝑿𝖳⁢𝑾u,v⁢d^n∗,𝜷𝒖,𝒗=[𝑺s⁢c⁢a⁢l⁢e,𝑺s⁢h⁢i⁢f⁢t]u,v𝖳,𝑫n=d^n∗⊕𝑺s⁢c⁢a⁢l⁢e⊙𝑫^n+𝑺s⁢h⁢i⁢f⁢t,\begin{gathered}\min_{\bm{\beta}_{u,v}}(\hat{d}_{n}^{*}-\bm{X}\bm{\beta}_{u,v}% )^{\mathsf{T}}\bm{W}_{u,v}(\hat{d}_{n}^{*}-\bm{X}\bm{\beta}_{u,v})+\lambda\bm{% S}_{shift}^{2},\\ \bm{\hat{\beta}}_{u,v}=(\bm{X}^{\mathsf{T}}\bm{W}_{u,v}\bm{X}+\lambda)^{-1}\bm% {X}^{\mathsf{T}}\bm{W}_{u,v}\hat{d}_{n}^{*},\\ \bm{\beta_{u,v}}=[\bm{S}_{scale},\bm{S}_{shift}]_{u,v}^{\mathsf{T}},\\ \bm{D}_{n}=\hat{d}_{n}^{*}\oplus\bm{S}_{scale}\odot\bm{\hat{D}}_{n}+\bm{S}_{% shift},\end{gathered}start_ROW start_CELL roman_min start_POSTSUBSCRIPT bold_italic_β start_POSTSUBSCRIPT italic_u , italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - bold_italic_X bold_italic_β start_POSTSUBSCRIPT italic_u , italic_v end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_u , italic_v end_POSTSUBSCRIPT ( over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - bold_italic_X bold_italic_β start_POSTSUBSCRIPT italic_u , italic_v end_POSTSUBSCRIPT ) + italic_λ bold_italic_S start_POSTSUBSCRIPT italic_s italic_h italic_i italic_f italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , end_CELL end_ROW start_ROW start_CELL overbold_^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_u , italic_v end_POSTSUBSCRIPT = ( bold_italic_X start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_u , italic_v end_POSTSUBSCRIPT bold_italic_X + italic_λ ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_X start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_u , italic_v end_POSTSUBSCRIPT over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , end_CELL end_ROW start_ROW start_CELL bold_italic_β start_POSTSUBSCRIPT bold_italic_u bold_, bold_italic_v end_POSTSUBSCRIPT = [ bold_italic_S start_POSTSUBSCRIPT italic_s italic_c italic_a italic_l italic_e end_POSTSUBSCRIPT , bold_italic_S start_POSTSUBSCRIPT italic_s italic_h italic_i italic_f italic_t end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_u , italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT , end_CELL end_ROW start_ROW start_CELL bold_italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⊕ bold_italic_S start_POSTSUBSCRIPT italic_s italic_c italic_a italic_l italic_e end_POSTSUBSCRIPT ⊙ overbold_^ start_ARG bold_italic_D end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + bold_italic_S start_POSTSUBSCRIPT italic_s italic_h italic_i italic_f italic_t end_POSTSUBSCRIPT , end_CELL end_ROW (6)

where 𝑫nsubscript𝑫𝑛\bm{D}_{n}bold_italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is the scaled whole depth map, ⊕direct-sum\oplus⊕ is the concatenation operator, λ𝜆\lambdaitalic_λ is a l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization hyperparameter used for restricting the solution to be simple. Besides, the explicit constraint of the source frame with the target frames allows each novel view to maintain contextual consistency from preceding generations.

Novel View Generation.After obtaining the aligned depth 𝑫nsubscript𝑫𝑛\bm{D}_{n}bold_italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, we generate target visual hints through warping as I^j=Πn→j⁢(𝑫n)subscript^𝐼𝑗subscriptΠ→𝑛𝑗subscript𝑫𝑛\hat{I}_{j}=\Pi_{n\rightarrow j}(\bm{D}_{n})over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = roman_Π start_POSTSUBSCRIPT italic_n → italic_j end_POSTSUBSCRIPT ( bold_italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ). The warped images {I^j}j=nn+msuperscriptsubscriptsubscript^𝐼𝑗𝑗𝑛𝑛𝑚\{\hat{I}_{j}\}_{j=n}^{n+m}{ over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n + italic_m end_POSTSUPERSCRIPT contain unfilled regions, as indicated by the binary warping mask {Mj}j=nn+msuperscriptsubscriptsubscript𝑀𝑗𝑗𝑛𝑛𝑚\{M_{j}\}_{j=n}^{n+m}{ italic_M start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n + italic_m end_POSTSUPERSCRIPT, providing strong visual hints for See3D to perform novel view generation. To ensure strong multi-view consistency between the newly generated sequence and the previous content, we randomly select k𝑘kitalic_k anchor views {Ik},k∈[1,N]subscript𝐼𝑘𝑘1𝑁\{I_{k}\},k\in[1,N]{ italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } , italic_k ∈ [ 1 , italic_N ] from the earlier generated frames to guide subsequent generation. The generation process is formulated as: Ij=See3D⁢(I^j,Mj,{I0,Ik})subscript𝐼𝑗See3Dsubscript^𝐼𝑗subscript𝑀𝑗subscript𝐼0subscript𝐼𝑘I_{j}=\textbf{See3D}(\hat{I}_{j},M_{j},\{I_{0},I_{k}\})italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = See3D ( over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_M start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , { italic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } ). We iteratively perform depth estimation, alignment, warping, and generation until all predefined multi-view images are obtained.

3D Reconstruction.We reconstruct the 3D scene using 3D Gaussian Splatting (3DGS) [41]. The training objective is to minimize the sum of photometric loss and SSIM loss, consistent with the original 3DGS approach. Additionally, we introduce a perceptual loss (LPIPS [126]) to mitigate subtle inter-frame discrepancies in multi-view generated images during 3DGS reconstruction. LPIPS emphasizes higher-level semantic consistency between Gaussian-rendered and generated multi-view images, rather than focusing on minor high-frequency differences. Furthermore, the potential inner-frame diversity may lead to inconsistencies with the corresponding camera poses. Following [20], we implement joint pose-Gaussian optimization, treating camera parameters as learnable variables alongside Gaussian attributes, thereby reducing gaps between generated viewpoints and their corresponding camera poses.

4 Experiments

In Sec. 4.1 and LABEL:{exp:sparse}, we present the single view and sparse views reconstruction with See3D as prior. Next, we conduct ablation experiments in Sec. 4.3 to validate the effectiveness of the proposed modules. Additional implementation details, more results on open-world 3D creation, and further ablation experiments are provided in the Appendix.

Table 2: Quantitative Comparison of Single/Sparse Views Generation. The top rows are results given single view as input, where ViewCrafter∗ indicates our re-implemented result. The bottom rows are novel view rendering quality given 3 views as input, where Zip-NeRF† and ZeroNVS† are modified versions with sparse views input as reported in CAT3D.

4.1 Single View to 3D

Experimental Setting. See3D supports multi-view generation from a single input view. Following prior work [121], our evaluation is conducted on the test split of three real-world datasets with various camera trajectories, including Tanks-and-Temples [43], RealEstate10K [129], CO3D [75]. We follow the approach in ViewCrafter [121] for constructing easy/hard evaluation sets based on different sampling rates applied to the original videos. We re-implement ViewCrafter using the official code released by [121] to validate our easy/hard set splitting, with results shown as ViewCrafter* in Tab. 2. We conduct comparisons with warping-based baselines, including LucidDreamer [12], camera-conditional video generation model MotionCtrl [104], warp-image conditional ViewCrafter [121], and multi-view diffusion model ZeroNVS [77]. We use the same point cloud rasterization as proposed in ViewCrafter [121] instead of depth-based warping to generate visual conditions for fair comparisons. Following [121], we evaluate only the visual quality of images generated by multi-view diffusion without rendering novel views through 3D reconstruction. We report PSNR, SSIM, and LPIPS [126] as evaluation metrics. Among these, PSNR is a traditional pixel-level metric that measures image similarity, which is significantly affected by viewpoint shifts. As such, PSNR reflects the accuracy of viewpoint control provided by our proposed visual-condition in multi-view generation.

Results. The quantitative comparison results are presented in the top rows of Tab. 2. Only average metrics for the easy and hard sets are reported here, detailed values are available in the Appendix D.1. The results for ViewCrafter* are comparable to those reported in its original paper, confirming successful alignment between our method and the baselines. Numerically, our approach outperforms all baseline methods across all metrics. Specifically, compared to the re-implemented ViewCrafter, our approach achieves a 4.63 dB improvement, demonstrating its capability to generate high-quality novel views. PSNR further demonstrates significant gains, indicating our proposed visual-condition enables precise camera control. Qualitative results are shown in the top rows of Fig. 5. See3D generates high-quality, realistic content within minutes. Despite limited visual cues provided by the warped images, our method produces more reliable and realistic results with fewer artifacts.

4.2 Sparse Views to 3D

Experimental Setting. We extend our model to the sparse-view reconstruction task, evaluating it on three datasets: LLFF [64], DTU [37], and Mip-NeRF 360 [3]. We compare our method against several few-shot 3D reconstruction baselines, including optimization-based method MuRF [113], FSGS [130], and BGGS [27]; diffusion-based methods CAT3D [23], ZeroNVS (modified to handle multi-view input) [77], and ReconFusion [107]; as well as the feed-forward method DepthSplat [114]. Following the evaluation protocols from [68, 130, 107], we use 3, 6, and 9 views as input. For few-shot reconstruction, dense multi-view images are generated from sparse views, similar to CAT3D [23], and 3DGS reconstruction is performed with pose optimization to render test views for evaluation. We report PSNR, SSIM, and LPIPS [126] to evaluate novel view synthesis performance.

Results. Qualitative and quantitative results are presented in Tab. 2 and Fig. 5, respectively, with additional comparisons for 3, 6, and 9 input views available in Appendix D.2. The 3DGS model, trained on dense multi-view images generated by See3D, surpassed state-of-the-art reconstruction models in novel view rendering. This indicates its ability to provide high-quality, consistent multi-view support for 3D reconstruction without imposing additional constraints. Compared to ReconFusion [107] and CAT3D [23], which also leverage diffusion priors for sparse-view reconstruction, our model exhibits effective scalability. Qualitative comparisons in Figure 5 reveal that NVS results produced by See3D exhibit fewer floating artifacts, suggesting its capability to generate more consistent and high-fidelity multi-view images.

Refer to caption

Figure 4: Top: Qualitative ablation of visual-condition; Bottom: As timestep decreases, visualize the trend of visual-condition.

Refer to caption

Figure 5: Qualitative Comparison of Single/Sparse View Generation. The top three rows are results with a single view input. The bottom two rows are novel view renderings from 3DGS, where Ours is trained on dense multi-view generation given 3 views as input. Our method outperformed other baselines in capturing high-frequency details, such as text and stairs.

4.3 Ablation Study

Scaling up Data. We investigate the impact of training data by ablating different proportions of our training dataset. The model is trained with 10%, 20%, 40%, 80%, and 100% of the training set, and its single-view generation performance is evaluated on RealEstate10K, achieving PSNR values of 19.32, 21.04, 22.57, 24.08, and 25.01, respectively. Additionally, training with unfiltered data results in generated content that often exhibits movement or deformation, leading to a substantial performance drop with a PSNR of 19.55. We analyze that this degradation likely stems from the lack of stationary and geometrically invariant properties in much of the source video content, which undermines multi-view consistency. In summary, these findings highlight the critical importance of data quality and diversity for effectively training large-scale MVD models.

Visual-condition. Excluding the benefits of data scaling, we investigate the effectiveness of our visual-condition on pose-free data. Previous work [121] has demonstrated that warped images can serve as a pivot condition to guide the model to generate the target viewpoint. However, due to the reliance on the annotated camera to control the projection and unprojection, warp-based conditions are inherently unscalable. Therefore, we compare the model’s ability to control cameras conditioned on pose-free visual-condition and conditioned on warped images. Specifically, we extract a subset of MVImageNet [122] for training and testing.

For each multi-view sequence in the training set, we select the point cloud of the first frame and render it into the subsequent 5 camera planes along the camera trajectory, based on the 3D annotations in the dataset. We obtain warped images and form pairs with the ground-truth multi-views to train an MVD model, referred to as MV-Posed. With the same experimental settings (training set, network architecture, batch size and predicted sequence length), we train an additional model without any 3D annotations, except for the modification of warp condition to the time-dependent visual-condition Vtsubscript𝑉𝑡V_{t}italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT described in Sec.3.2, called MV-UnPoseT. Meanwhile, we employ randomly masked multiple views as condition to train the model as an additional baseline, called MV-UnPoseM.

Table 3: Ablation Study on Visual-condition.

The results are reported in Tab.3 and Fig.4, where the performance of MV-Posed and MV-UnPoseT is comparable. In contrast, MV-UnPoseM struggles to handle the gap between the warped image and masked images, in the case of geometric distortion and self-obscuration. These findings indicate that the visual-condition offers a viable alternative to 3D-reliant warped conditions. Despite a significant domain gap between Vtsubscript𝑉𝑡V_{t}italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and warp images as shown in Fig.4, our model robustly handles this discrepancy, thanks to the time-dependent nature of the proposed condition.

5 Conclusion

We propose a scalable 3D generation framework from the perspective of dataset scaling, offering a systematic solution that includes: 1) a new dataset, WebVi3D, curated via an automated pipeline, with the potential to evolve with the growing volume of Internet data. 2) a new model, See3D, capable of scalable training without pose annotations, aligning with the concept of ‘Get 3D by solely Seeing’. 3) a novel See3D-based 3D generation framework that supports long-sequence view generation with complex camera trajectories. We show that the 3D priors learned by See3D enable a range of 3D creation applications, including single-view generation, sparse view reconstruction, and 3D editing in open-world scenarios. We believe See3D provides a new direction to advancing the upper bound of 3D generation through dataset scaling. We hope our efforts will encourage the 3D research community to pay more attention to large-scale unposed data, bypassing the costly 3D data barrier and chasing parity with powerful closed-source 3D solutions.

Acknowledgments.

We thank Wenyuan Zhang and Yu-Shen Liu from Tsinghua University, as well as Yance Jiao, Hua Zhou, Liao Zhang, Yaohui Chen, Jinxin Xie, Yiwen Shao, and other colleagues from BAAI, for their valuable support and contributions to the See3D project.

References

Contents
  1. 1 Introduction
  2. 2 Related work
  3. 3 Method
    1. 3.1 Video Data Curation
    2. 3.2 Visual Conditional Multi-View Diffusion Model
      1. 3.2.1 Principle of Visual-Condition
      2. 3.2.2 Time-dependent Visual Condition
      3. 3.2.3 Model Architecture
    3. 3.3 Visual Conditional 3D Generation
  4. 4 Experiments
    1. 4.1 Single View to 3D
    2. 4.2 Sparse Views to 3D
    3. 4.3 Ablation Study
  5. 5 Conclusion
  6. A Broader Impact and Limitations
  7. B Video Data Curation
  8. C Technical Implementations
    1. C.1 Model Architecture
    2. C.2 Training Details
    3. C.3 Definition of f⁢(t)𝑓𝑡f(t)italic_f ( italic_t ) and Wtsubscript𝑊𝑡W_{t}italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
  9. D More Experimental Results
    1. D.1 Single View to 3D
    2. D.2 Sparse Views to 3D
    3. D.3 3D Editing
  10. E Additional Ablation Studies
  11. E.1 Effectiveness of Pixel-level Depth Alignment
  12. E.2 Efficacy of Scaling up Data
  13. F Additional Visualizations

Appendix

Appendix A Broader Impact and Limitations

Broader Impact: Our model facilitates open-world 3D content creation from large-scale video data, eliminating the need for costly 3D annotations. This can make 3D generation more accessible to industries like gaming, virtual reality, and digital media. By leveraging visual data from the rapidly growing Internet videos, it accelerates 3D creation in real-world applications. However, careful consideration of ethical issues, such as potential misuse in generating misleading or harmful content, is crucial. Ensuring that the data used is curated responsibly to avoid bias and privacy concerns is vital for safe deployment.

Limitations: While our model excels at long-sequence generation, it comes with some limitations regarding: 1) Inference Speed: The model requires several minutes for inference, making it challenging for real-time applications. Future work should aim to improve inference speed for real-time generation. 2) Focus on 3D Generation: The current model focuses only on 3D generation, avoiding the modeling of object motion. Future research could extend the model to simultaneously generate 3D and 4D content for dynamic scenes. 3) Model Scalability: While the data scaling approach is effective, the scalability of the model itself has not been explored. Expanding the model’s architecture could enhance its capability to handle more complex and diverse 3D content.

Appendix B Video Data Curation

Our WebVi3D dataset is sourced from Internet videos through an automated four-step data curation pipeline. In this section, we provide further details on this pipeline process.

Step 1: Temporal-Spatial Downsampling.

To enhance data curation efficiency, we downsample each video both temporally and spatially. Temporally, we retain one frame for every two by downsampling with a factor of two. Spatially, we adjust the downsampling factor according to the original resolution to ensure consistent visual appearance across different video aspect ratios. The final resolution is standardized to 480p in our experiment.

Step 2: Semantic-Based Dynamic Recognition

We perform content recognition on each frame to identify potential dynamic regions. Following [57], we utilize the off-the-shelf instant segmentation model Mask R-CNN [29] to generate coarse motion masks ℳmsubscriptℳ𝑚\mathcal{M}_{m}caligraphic_M start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT for potential dynamic objects, including humans, animals, and sports activities. If motion masks are present in more than half of the video frames, the sequence is deemed likely to contain dynamic regions and excluded from further processing.

Step 3: Flow-Based Dynamic Filtering

After filtering out videos with common dynamic objects, we implement a precise strategy to identify and exclude videos containing potential dynamic regions, such as drifting water and swaying trees. Following [57], we use the pretrained RAFT [93] to compute the optical flow between consecutive frames. Based on the optical flow, we calculate the Sampson Distance, which measures the distance of each pixel to its corresponding epipolar line. Pixels exceeding a predefined threshold are marked to create a dynamic motion mask ℳssubscriptℳ𝑠\mathcal{M}_{s}caligraphic_M start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. The number of pixels in ℳssubscriptℳ𝑠\mathcal{M}_{s}caligraphic_M start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT serves as an indicator of the likelihood of motion in the current frame.

However, relying solely on this metric is unreliable, as most data are captured in real shots, where dynamic objects of interest are often concentrated near the center of the imaging plane. These moving regions may not occupy a significant portion of the frame. Therefore, we also consider the spatial location of the dynamic mask and propose a dynamic score 𝒮𝒮\mathcal{S}caligraphic_S to evaluate the motion probability for each frame. Let H,W𝐻𝑊H,Witalic_H , italic_W denote the height and width of an image, respectively. We define the central region as starting at W′=0.25×W,H′=0.25×Hformulae-sequencesuperscript𝑊′0.25𝑊superscript𝐻′0.25𝐻W^{\prime}=0.25\times W,H^{\prime}=0.25\times Hitalic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 0.25 × italic_W , italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 0.25 × italic_H. The proportions of the mask occupying the entire image, ΘisubscriptΘ𝑖\Theta_{i}roman_Θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and the central area ΘcsubscriptΘ𝑐\Theta_{c}roman_Θ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT are calculated as:

Θi=Σu,v=0W,H⁢ℳs⁢(u,v)H×W,Θc=Σu,v=W′,H′W−W′,H−H′⁢ℳs⁢(u,v)H/2×W/2.formulae-sequencesubscriptΘ𝑖superscriptsubscriptΣ𝑢𝑣0𝑊𝐻subscriptℳ𝑠𝑢𝑣𝐻𝑊subscriptΘ𝑐superscriptsubscriptΣformulae-sequence𝑢𝑣superscript𝑊′superscript𝐻′𝑊superscript𝑊′𝐻superscript𝐻′subscriptℳ𝑠𝑢𝑣𝐻2𝑊2\Theta_{i}=\frac{\Sigma_{u,v=0}^{W,H}\mathcal{M}_{s}(u,v)}{H\times W},\\ \Theta_{c}=\frac{\Sigma_{u,v=W^{\prime},H^{\prime}}^{W-W^{\prime},H-H^{\prime}% }\mathcal{M}_{s}(u,v)}{H/2\times W/2}.roman_Θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG roman_Σ start_POSTSUBSCRIPT italic_u , italic_v = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W , italic_H end_POSTSUPERSCRIPT caligraphic_M start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_u , italic_v ) end_ARG start_ARG italic_H × italic_W end_ARG , roman_Θ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = divide start_ARG roman_Σ start_POSTSUBSCRIPT italic_u , italic_v = italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W - italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_H - italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT caligraphic_M start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_u , italic_v ) end_ARG start_ARG italic_H / 2 × italic_W / 2 end_ARG . (7)

The dynamic score 𝒮𝒮\mathcal{S}caligraphic_S can be formulated as:

𝒮i={2,Θi≥0.12&Θc≥0.351.5,Θi≥0.12&0.2≤Θc<0.351,Θi<0.12&0.2≤Θc<0.350.5,Θi<0.12&Θc<0.2.subscript𝒮𝑖cases2subscriptΘ𝑖0.12subscriptΘ𝑐0.351.5subscriptΘ𝑖0.120.2subscriptΘ𝑐0.351subscriptΘ𝑖0.120.2subscriptΘ𝑐0.350.5subscriptΘ𝑖0.12subscriptΘ𝑐0.2\mathcal{S}_{i}=\begin{cases}2,&\Theta_{i}\geq 0.12\And\Theta_{c}\geq 0.35\\ 1.5,&\Theta_{i}\geq 0.12\And 0.2\leq\Theta_{c}<0.35\\ 1,&\Theta_{i}<0.12\And 0.2\leq\Theta_{c}<0.35\\ 0.5,&\Theta_{i}<0.12\And\Theta_{c}<0.2\end{cases}.caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { start_ROW start_CELL 2 , end_CELL start_CELL roman_Θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ 0.12 & roman_Θ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ≥ 0.35 end_CELL end_ROW start_ROW start_CELL 1.5 , end_CELL start_CELL roman_Θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ 0.12 & 0.2 ≤ roman_Θ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT < 0.35 end_CELL end_ROW start_ROW start_CELL 1 , end_CELL start_CELL roman_Θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT < 0.12 & 0.2 ≤ roman_Θ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT < 0.35 end_CELL end_ROW start_ROW start_CELL 0.5 , end_CELL start_CELL roman_Θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT < 0.12 & roman_Θ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT < 0.2 end_CELL end_ROW . (8)

This strategy targets the dynamic regions near the image center, enhancing data filtering accuracy. The final dynamic score 𝒮𝒮\mathcal{S}caligraphic_S for the entire sequence is calculated as:

𝒮=Σi=0N⁢𝒮i,𝒮superscriptsubscriptΣ𝑖0𝑁subscript𝒮𝑖\mathcal{S}=\Sigma_{i=0}^{N}\mathcal{S}_{i},caligraphic_S = roman_Σ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (9)

where N𝑁Nitalic_N represents the total number of extracted frames. If 𝒮>=0.25×N𝒮0.25𝑁\mathcal{S}>=0.25\times Ncaligraphic_S > = 0.25 × italic_N, the sequence is classified as dynamic and subsequently excluded.

Step 4: Tracking-Based Small Viewpoint Filtering.

The previous steps produced videos with static scenes. We require videos that contain multi-view images captured from a wider camera viewpoint. To achieve this, we track the motion trajectory of key points across frames and calculate the radius of the minimum outer tangent circle for each trajectory. Videos with a substantial number of radii below a defined threshold are classified as having small camera trajectories and are excluded. This procedure includes keypoint extraction, trajectory tracking, and circle fitting using RANSAC (Random Sample Consensus) [21].

Keypoint Extraction. To reduce computational complexity, we downsample the extracted video frames by selecting every fourth frame. SuperPoint [16] is then used to extract keypoints 𝐊∈ℝN×2𝐊superscriptℝ𝑁2\mathbf{K}\in\mathbb{R}^{N\times 2}bold_K ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 2 end_POSTSUPERSCRIPT from the first frame, where N=100𝑁100N=100italic_N = 100 represents the number of detected keypoints used to initialize tracking.

Trajectory Tracking. Keypoints are tracked across all frames using the pretrained CoTracker [40], which generates trajectories and visibility over time as:

𝐓pred,𝐕pred=CoTracker⁢(𝐈,queries=𝐊).subscript𝐓predsubscript𝐕predCoTracker𝐈queries𝐊\mathbf{T}_{\text{pred}},\mathbf{V}_{\text{pred}}=\text{CoTracker}(\mathbf{I},% \text{queries}=\mathbf{K}).bold_T start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT , bold_V start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT = CoTracker ( bold_I , queries = bold_K ) . (10)

Here, 𝐈𝐈\mathbf{I}bold_I denotes the input frames, 𝐓pred∈ℝ1×T×N×2subscript𝐓predsuperscriptℝ1𝑇𝑁2\mathbf{T}_{\text{pred}}\in\mathbb{R}^{1\times T\times N\times 2}bold_T start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_T × italic_N × 2 end_POSTSUPERSCRIPT represents the tracked positions of each keypoint over time, and 𝐕pred∈ℝ1×T×N×1subscript𝐕predsuperscriptℝ1𝑇𝑁1\mathbf{V}_{\text{pred}}\in\mathbb{R}^{1\times T\times N\times 1}bold_V start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_T × italic_N × 1 end_POSTSUPERSCRIPT indicates the visibility of each point.

Circle Fitting. For each tracked keypoint, a circle fitting method is applied to its trajectory, selecting only frames where the keypoint is visible (𝐕pred=1)subscript𝐕pred1(\mathbf{V}_{\text{pred}}=1)( bold_V start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT = 1 ). Let 𝐓visible∈ℝM×2subscript𝐓visiblesuperscriptℝ𝑀2\mathbf{T}_{\text{visible}}\in\mathbb{R}^{M\times 2}bold_T start_POSTSUBSCRIPT visible end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × 2 end_POSTSUPERSCRIPT be the filtered points, where M𝑀Mitalic_M is the number of visible points. We then use the RANSAC-based circle fitting algorithm on 𝐓visiblesubscript𝐓visible\mathbf{T}_{\text{visible}}bold_T start_POSTSUBSCRIPT visible end_POSTSUBSCRIPT to determine the circle’s center 𝐜=(cx,cy)𝐜subscript𝑐𝑥subscript𝑐𝑦\mathbf{c}=(c_{x},c_{y})bold_c = ( italic_c start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) and radius r𝑟ritalic_r:

𝐜,r=RANSAC⁢(𝐓visible).𝐜𝑟RANSACsubscript𝐓visible\mathbf{c},r=\text{RANSAC}(\mathbf{T}_{\text{visible}}).bold_c , italic_r = RANSAC ( bold_T start_POSTSUBSCRIPT visible end_POSTSUBSCRIPT ) . (11)

The RANSAC algorithm selects random subsets of three points to define candidate circles, computes the inliers, and optimizes for the circle with the highest inlier count and smallest radius. Finally, we count the number of circles with a radius smaller than a specified threshold, r≤20𝑟20r\leq 20italic_r ≤ 20:

count=∑i=1N𝕀⁢(ri≤20),countsuperscriptsubscript𝑖1𝑁𝕀subscript𝑟𝑖20\text{count}=\sum_{i=1}^{N}\mathbb{I}(r_{i}\leq 20),count = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT blackboard_I ( italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ 20 ) , (12)

where 𝕀𝕀\mathbb{I}blackboard_I is the indicator function. The mean radius is also computed to provide an overall measure of circular motion. If the number of small-radius circles exceeds 40 and the average circular motion is less than 5, we classify this video as having small camera trajectories.

User Study.

To verify the effectiveness of our data curation pipeline, we conducted a user study with a randomly selected set of 10,000 video clips before filtering. We require our users to evaluate videos based on two aspects: static content and large-baseline trajectories. Only videos meeting both criteria are classified as 3D-aware videos. Among these, 1,163 videos met our criteria for 3D-aware videos, accounting for 11.6% of the total validation set. After applying our data screening pipeline, we randomly selected 10,000 video clips for annotation. In this filtered set, 8,859 videos were identified as 3D-aware, yielding a ratio of 88.6%, represents a 77% improvement compared to the previous set. These results demonstrate the efficacy of our pipeline in filtering 3D-aware videos from large-scale Internet videos.

Refer to caption

Figure 6: Single-view to 3D. Compared with LucidDreamer [12] and ViewCrafter [121], which are also conditioned on warped images, our model can consistently generate high-fidelity views with detailed texture and structural information.

Refer to caption

Figure 7: Sparse-views to 3D. Given 3 input views, our model generates clear, high-fidelity novel views that closely match the ground truth (GT), without artifacts or blurring. Note that the results from DepthSplat [114] are cropped and resized following the same data processing as the official source code.

Refer to caption

Figure 8: Examples of Open-world 3D Editing. (a) Occlusion-free Editing: An Asian-style attic is added, and novel views are generated realistically. (b) Full Replacement Editing: A vase is replaced with a toy fox, seamlessly integrated into the scene from various viewpoints. (c) Occluded Editing: Hidden regions in the masked areas are inferred and completed to produce novel views.

Appendix C Technical Implementations

C.1 Model Architecture

The main backbone of See3D model is based on the structure of 2D diffusion models but integrates 3D self-attention to connect the latents of multiple images, as shown in prior work [80]. Specifically, we adapt the existing 2D self-attention layers of the original 2D diffusion model into 3D self-attention by inflating different views within the self-attention layers. To incorporate visual conditions, we introduce the necessary convolutional kernels and biases using Zero-Initialize [82]. The model is initialized from a pretrained 2D diffusion model [71] and fine-tuned with all parameters, leveraging FlashAttention for acceleration. In accordance with prior work [79], switching from a scaled-linear noise schedule to a linear schedule is essential for achieving improved global consistency across multiple views. Additionally, we implement cross-attention between the latents of multiple views and per-token CLIP embeddings of reference images using a linear guidance mechanism [86]. For training, we randomly select a subset of frames from a video clip as reference images, with the remaining frames serving as target images. The number of reference images is randomly chosen to accommodate different downstream tasks. The multi-view diffusion model is optimized by calculating the loss only on the target images, as outlined in Eq. 1.

C.2 Training Details

Brightness Control. We observe that the visual-condition effectively guides camera movement but cannot control brightness changes, posing a significant limitation. Determining the light source position is particularly challenging with limited observations from single or sparse views. In our real-world test data, camera movement often causes random highlighting or darkening in some regions of scenes, which has a significant impact on pixel-level metrics like PSNR. This issue highlights a key problem: the inability to control brightness undermines the reliability of pixel-level metrics, as brightness variations affect these metrics more than the actual quality of the generated content. To achieve illumination control, 1) we preprocess the training data by converting corrupted images into HSV format, which represents hue, saturation, and brightness. 2) We define a w×h𝑤ℎw\times hitalic_w × italic_h window and calculate the average brightness difference within this window between the ground truth image and the corrupted data. Using this difference, we apply a scaling factor to the brightness channel of the corrupted data while preserving hue and saturation, before converting the image back to RGB. This ensures brightness adjustment in the visual-condition without leaking color or content from the ground truth.

During training, we randomly drop this preprocessing with a probability of 0.5, enabling the model to infer lighting changes on its own during inference when brightness control is not required. In our evaluation experiments, brightness scaling is applied to the unmasked regions of warped images to align with ground truth, reducing the impact of brightness, and thus yielding a higher correlation between the generated content and pixel-level metrics. Meanwhile, keeping hue and saturation unchanged to avoid content or color leakage. Additionally, the model enables user-controlled brightness adjustments for specific regions in multi-view generation by modifying the visual-condition as needed.

Training Configuration. We initialize the See3D model from MVDream [80] and employ a progressive training strategy. First, the model is trained at a resolution of 512 ×\times× 512 with a sequence length of 5. This phase involves 120,000 iterations, using 1 reference view and 4 target views. Due to the relatively small sequence length, a larger batch size of 560 is used to enhance stability and accelerate convergence. Next, the sequence length is increased to 16, and the model is trained for 200,000 iterations with 1 or 3 reference views and 15 or 13 target views, maintaining the resolution of 512 ×\times× 512. In this phase, the batch size is reduced to 228. Finally, a multi-view super-resolution model is trained using the same network structure. It takes the multi-view predictions from See3D as input and outputs target images with multi-view consistency at a resolution of 1024 ×\times× 1024, using a batch size of 114. In all stages, all parameters of the diffusion model are fine-tuned with a learning rate of 1e-5. Additionally, we render some multi-views or extract clips from datasets such as Objaverse [15], CO3D [75], RealEstate10k [129] , MVImgNet [122], and DL3DV [50] datasets, forming a supplemental 3D dataset with fewer than 0.5M samples, please refer to Section E.2 for details on analysis and ablation. During training, this supplemental data is randomly sampled and incorporated into our WebVi3D dataset (∼similar-to\sim∼16M). To enhance training efficiency, we utilize FlashAttention [14] alongside DeepSpeed with ZeRO stage-2 optimizer [74] and bf16 precision. We also implement classifier-free guidance (CFG) [30] by randomly dropping visual conditions with a probability of 0.1. The See3D model is trained on 114 ×\times× NVIDIA-A100-SXM4-40GB GPUs over approximately 25 days using a progressive training scheme. During inference, a DDIM sampler [85] with classifier-free guidance is employed.

C.3 Definition of f⁢(t)𝑓𝑡f(t)italic_f ( italic_t ) and Wtsubscript𝑊𝑡W_{t}italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT

Definition for f⁢(t)𝑓𝑡f(t)italic_f ( italic_t ).

In Eq.2, Ctsubscript𝐶𝑡C_{t}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is formulated asCt=α¯t′⁢(1−M)⁢𝐗0+1−α¯t′⁢ϵsubscript𝐶𝑡subscript¯𝛼superscript𝑡′1𝑀subscript𝐗01subscript¯𝛼superscript𝑡′bold-italic-ϵC_{t}=\sqrt{\bar{\alpha}_{t^{\prime}}}(1-M)\mathbf{X}_{0}+\sqrt{1-\bar{\alpha}% _{t^{\prime}}}\bm{\epsilon}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG ( 1 - italic_M ) bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG bold_italic_ϵ , where αt′subscript𝛼superscript𝑡′\alpha_{t^{\prime}}italic_α start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT is a composite function that depends on α𝛼\alphaitalic_α and t′superscript𝑡′t^{\prime}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, with t′=f⁢(t)superscript𝑡′𝑓𝑡t^{\prime}=f(t)italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_f ( italic_t ) and f⁢(t)=β⋅t𝑓𝑡⋅𝛽𝑡f(t)=\beta\cdot titalic_f ( italic_t ) = italic_β ⋅ italic_t. In our experiments, we set the hyper-parameter β=0.2𝛽0.2\beta=0.2italic_β = 0.2, which controls the noise level added to Ctsubscript𝐶𝑡C_{t}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. A larger β𝛽\betaitalic_β increases the noise in Ctsubscript𝐶𝑡C_{t}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. As β𝛽\betaitalic_β approaches 1, Ctsubscript𝐶𝑡C_{t}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT converges toward a Gaussian distribution, improving robustness but reducing the correlation between Ctsubscript𝐶𝑡C_{t}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝐗0subscript𝐗0\mathbf{X}_{0}bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, thereby weakening camera control. Conversely, as β𝛽\betaitalic_β approaches 0, the distributions of Ctsubscript𝐶𝑡C_{t}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝐗0subscript𝐗0\mathbf{X}_{0}bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT become more similar, improving controllability. However, for downstream tasks, a very small β𝛽\betaitalic_β creates a significant domain gap between task-specific visual cues and the video data, compromising robustness. Thus, β𝛽\betaitalic_β serves as a trade-off parameter, balancing camera control and robustness.

Refer to caption

Figure 9: Piecewise Function Wtsubscript𝑊𝑡W_{t}italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, showing linear decay for timesteps t𝑡titalic_t between 300 and 1000, and a monotonically decreasing concave behavior for t<300𝑡300t<300italic_t < 300.

Formulation for Wtsubscript𝑊𝑡W_{t}italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

Recapping Eq.3 from the main manuscript, Vt=[Wt∗Ct+(1−Wt)∗Xt;M]subscript𝑉𝑡∗subscript𝑊𝑡subscript𝐶𝑡∗1subscript𝑊𝑡subscript𝑋𝑡𝑀V_{t}=[W_{t}\ast C_{t}+(1-W_{t})\ast X_{t};M]italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∗ italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ( 1 - italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∗ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_M ], where Wtsubscript𝑊𝑡W_{t}italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is defined as a piecewise function of t𝑡titalic_t.

Wt={vdecay_end⋅e−b⋅(tdecay_end−t),if ⁢t<tdecay_end,1−(1−vdecay_end)⋅tpeak−ttpeak−tdecay_end,if ⁢t≥tdecay_end,subscript𝑊𝑡cases⋅subscript𝑣decay_endsuperscript𝑒⋅𝑏subscript𝑡decay_end𝑡if 𝑡subscript𝑡decay_end1⋅1subscript𝑣decay_endsubscript𝑡peak𝑡subscript𝑡peaksubscript𝑡decay_endif 𝑡subscript𝑡decay_endW_{t}=\begin{cases}v_{\text{decay\_end}}\cdot e^{-b\cdot(t_{\text{decay\_end}}% -t)},&\text{if }t<t_{\text{decay\_end}},\\ 1-\left(1-v_{\text{decay\_end}}\right)\cdot\frac{t_{\text{peak}}-t}{t_{\text{% peak}}-t_{\text{decay\_end}}},&\text{if }t\geq t_{\text{decay\_end}},\end{cases}italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { start_ROW start_CELL italic_v start_POSTSUBSCRIPT decay_end end_POSTSUBSCRIPT ⋅ italic_e start_POSTSUPERSCRIPT - italic_b ⋅ ( italic_t start_POSTSUBSCRIPT decay_end end_POSTSUBSCRIPT - italic_t ) end_POSTSUPERSCRIPT , end_CELL start_CELL if italic_t < italic_t start_POSTSUBSCRIPT decay_end end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL 1 - ( 1 - italic_v start_POSTSUBSCRIPT decay_end end_POSTSUBSCRIPT ) ⋅ divide start_ARG italic_t start_POSTSUBSCRIPT peak end_POSTSUBSCRIPT - italic_t end_ARG start_ARG italic_t start_POSTSUBSCRIPT peak end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT decay_end end_POSTSUBSCRIPT end_ARG , end_CELL start_CELL if italic_t ≥ italic_t start_POSTSUBSCRIPT decay_end end_POSTSUBSCRIPT , end_CELL end_ROW

where tpeak=1000subscript𝑡peak1000t_{\text{peak}}=1000italic_t start_POSTSUBSCRIPT peak end_POSTSUBSCRIPT = 1000, tdecay_end=300subscript𝑡decay_end300t_{\text{decay\_end}}=300italic_t start_POSTSUBSCRIPT decay_end end_POSTSUBSCRIPT = 300, vdecay_end=0.8subscript𝑣decay_end0.8v_{\text{decay\_end}}=0.8italic_v start_POSTSUBSCRIPT decay_end end_POSTSUBSCRIPT = 0.8, and b=0.075𝑏0.075b=0.075italic_b = 0.075. To ensure that Wtsubscript𝑊𝑡W_{t}italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT remains within the range [0,1]01[0,1][ 0 , 1 ], it is clamped as: Wt=clamp⁢(Wt,0,1)subscript𝑊𝑡clampsubscript𝑊𝑡01W_{t}=\text{clamp}(W_{t},0,1)italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = clamp ( italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , 0 , 1 ). As shown in Figure 9, 1) For t𝑡titalic_t between 300 and 1000, Wtsubscript𝑊𝑡W_{t}italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT decreases linearly as t𝑡titalic_t decreases; 2) For t<300𝑡300t<300italic_t < 300, Wtsubscript𝑊𝑡W_{t}italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT transitions to a monotonically decreasing concave function of t𝑡titalic_t.

The rationale behind this design is to ensure that when Ctsubscript𝐶𝑡C_{t}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT has significant noise, it exerts a stronger influence on Vtsubscript𝑉𝑡V_{t}italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, thus affecting MVD generation. Conversely, as the noise in Ctsubscript𝐶𝑡C_{t}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT diminishes, Xtsubscript𝑋𝑡X_{t}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT rapidly replaces Ctsubscript𝐶𝑡C_{t}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, reducing the risk of information leakage from Ctsubscript𝐶𝑡C_{t}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and improving the robustness of task-specific visual cues. The formulation of Wtsubscript𝑊𝑡W_{t}italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT enables flexible parameter tuning, such as vdecay_endsubscript𝑣decay_endv_{\text{decay\_end}}italic_v start_POSTSUBSCRIPT decay_end end_POSTSUBSCRIPT and b𝑏bitalic_b, to control its monotonic behavior. Smaller parameter values emphasize the impact of Ctsubscript𝐶𝑡C_{t}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT on MVD, while larger values prioritize robustness.

Appendix D More Experimental Results

Leveraging the developed web-scale dataset WebVi3D, our model supports both object- and scene-level 3D creation tasks, including single-view-to-3D, sparse-view-to-3D, and 3D editing. Additional experimental results for these tasks are presented below.

D.1 Single View to 3D

Table 4 presents a quantitative comparison of zero-shot novel view synthesis performance on the Tanks-and-Temples [43], RealEstate10K [129], and CO3D [75] datasets. Our method consistently outperforms all others on both easy and hard sets, achieving the best results in every evaluation metric. Qualitative results are shown in Figure 6. Compared to warping-based competitors such as LucidDreamer [12] and ViewCrafter [121], our approach more effectively captures both geometric structure and texture details, producing more realistic 3D scenes. These results highlight the robustness and versatility of our method in synthesizing high-quality novel views across diverse and challenging scenarios.

Table 4: Zero-shot Novel View Synthesis (NVS) on Tanks-and-Temples[43], RealEstate10K[129] and CO3D[75] dataset.

D.2 Sparse Views to 3D

Quantitative comparisons using 3, 6, and 9 input views are presented in Table 5. The 3DGS model trained on multi-view images generated by See3D outperformed state-of-the-art models in novel view rendering, demonstrating its ability to provide consistent multi-view support for 3D reconstruction without additional constraints. Qualitative comparisons in Figure 7 reveal fewer floating artifacts in the NVS results, indicating See3D generates higher-quality and more consistent multi-view images.

Table 5: Quantitative Comparison of Sparse-view 3D Reconstruction

D.3 3D Editing

Our model, trained on large-scale videos, naturally supports open-world 3D editing without the need for additional fine-tuning. Figure 8 illustrates three distinct editing scenarios: a) Occlusion-free Editing. An Asian-style attic is placed next to a toy bulldozer in the original image, which serves as the reference view. Our model generates highly realistic images containing the Asian-style attic from various new viewpoints. b) Full Replacement Editing. The vase in the original image is completely replaced with a toy fox. Our model generates new scenes from different viewpoints, seamlessly incorporating the toy fox into the designated area with no residual traces of the vase. c) Occluded Editing. Given an occluded edited image as a reference view, our model can generate multiple novel views within the specified masked regions, inferring and filling in the hidden details of the occluded parts.

Appendix E Additional Ablation Studies

E.1 Effectiveness of Pixel-level Depth Alignment

We conducted additional ablation experiments to validate the effectiveness of the proposed pixel-level depth alignment. Specifically, we enabled and disabled pixel-level depth alignment when generating novel views through warping and visualized the warped results at a specific generation step. As shown in Figure 10, the left image shows the reference GT image, the middle image corresponds to warping with pixel-level aligned depth, and the right one depicts warping without pixel-level aligned depth. The results demonstrate that pixel-level depth alignment not only effectively restores the scale of the depth map but also significantly corrects errors in monocular depth estimation (e.g., the toy’s neck and the tabletop). Consequently, integrating our proposed 3D generation pipeline improves generation quality.

E.2 Efficacy of Scaling up Data

Refer to caption

Figure 10: Ablation on Pixel-level Depth Alignment.

Table 6: Ablation on Supplementary 3D Data.

In the main manuscript, we conducted an ablation study on the 3D dataset MVImageNet [122] to evaluate the effectiveness of the proposed visual-condition. Table 3 shows that: 1) When conditioned on purely masked images, the MV-UnPoseM model performed the worst, struggling with the domain gap issue. 2) When conditioned on pose-guided warped images, the MV-Posed model achieved the best results, benefiting from pose annotations. 3) Our MV-UnPoseT model, conditioned on the time-dependent visual-condition, demonstrated performance very close to that of the MV-Posed model.

Refer to caption

Figure 11: Examples of Long-sequence Generation. High-quality novel views generated along complex camera trajectories, maintaining spatial consistency and visual realism across extended sequences.

Refer to caption

Figure 12: More Examples of Long-sequence Generation.

Intuitively, models trained entirely on 3D data tend to achieve optimal performance at a specific data scale, establishing an upper bound at that scale. When the volume of video data matches that of 3D data, models trained on 3D still set the performance ceiling. However, as video data is virtually unlimited, scaling up the dataset can intuitively raise this upper bound.

Following the same settings in Table 3, we further investigate the impact of supplementing multi-view data with 3D annotations on model performance. We conduct an ablation study using the MV-UnPoseT model, trained on unposed multi-view data with visual-condition. In this study, we progressively introduce 3D pose annotations at levels of 10%, 20%, 60%, and 100% into the training set. When the training data is entirely composed of 3D annotations, the model configuration is equivalent to the MV-Posed model. The results in Table 6 indicate that our MV-UnPoseT model, initially trained on unposed data, improves steadily as 3D annotations are introduced. For instance, with only 20% 3D data (MV-UnPoseT-20%), the model’s performance closely approaches that of the fully 3D-annotated MV-Posed model. This suggests that even a small amount of 3D data in a largely unposed multi-view dataset can significantly boost model performance, approaching the models trained on fully annotated 3D datasets.

This insight is essential because unposed multi-view data is cost-effective and can be easily collected in large quantities. By incorporating a small volume of high-quality 3D data, we can achieve performance comparable to models trained on large, expensive 3D datasets. Therefore, in our proposed WebVi3D dataset (16M samples), we incorporated a small portion (0.5M samples) of 3D data to optimize model performance.

Appendix F Additional Visualizations

Open-world 3D Generation with Long Sequences. We manually configured complex camera trajectories, including rotation, translation, zooming in, zooming out, focus distance adjustments and various random combinations, as shown in Figure 11 and Figure 12. Our model consistently generates high-quality, continuous novel views along these trajectories. Experimental visualizations demonstrate that the model effectively preserves spatial consistency and visual realism across long sequences. This highlights its robustness in handling intricate camera paths, including rapid transitions and diverse perspectives, making it highly applicable to open-world scenarios.