MVDrag3D: Drag-based Creative 3D Editing via Multi-view Generation-Reconstruction Priors (original) (raw)

Honghua Chen, Yushi Lan, Yongwei Chen, Yifan Zhou, Xingang Pan
S-Lab, Nanyang Technological University

Abstract

Drag-based editing has become popular in 2D content creation, driven by the capabilities of image generative models. However, extending this technique to 3D remains a challenge. Existing 3D drag-based editing methods, whether employing explicit spatial transformations or relying on implicit latent optimization within limited-capacity 3D generative models, fall short in handling significant topology changes or generating new textures across diverse object categories. To overcome these limitations, we introduce MVDrag3D, a novel framework for more flexible and creative drag-based 3D editing that leverages multi-view generation and reconstruction priors. At the core of our approach is the usage of a multi-view diffusion model as a strong generative prior to perform consistent drag editing over multiple rendered views, which is followed by a reconstruction model that reconstructs 3D Gaussians of the edited object. While the initial 3D Gaussians may suffer from misalignment between different views, we address this via view-specific deformation networks that adjust the position of Gaussians to be well aligned. In addition, we propose a multi-view score function that distills generative priors from multiple views to further enhance the view consistency and visual quality. Extensive experiments demonstrate that MVDrag3D provides a precise, generative, and flexible solution for 3D drag-based editing, supporting more versatile editing effects across various object categories and 3D representations. Video demos can be found on our project webpage: https://chenhonghua.github.io/MyProjects/MvDrag3D/.

Refer to caption

Figure 1: Comparison of our MVDrag3D with state-of-the-art approaches. The first two rows present results of dragging on meshes, while the last two focus on 3D Gaussians. Notably, APAP (Yoo et al., 2024) is specifically designed for mesh structures, and thus, it was not tested on 3D Gaussians. Overall, our method demonstrates the ability to produce more plausible and generative editing results, showing better performance across both 3D Gaussians and meshes.

1 Introduction

Deforming 3D shapes by dragging point handles has been an essential interactive tool in computer graphics, enabling intuitive manipulation of complex shapes and structures. Traditionally, such drag-based 3D editing is often defined on mesh structures, utilizing optimization functions to preserve specific properties under the constraint of control handles. These properties include the mesh Laplacian (Lipman et al., 2004; 2005; Sorkine et al., 2004), local rigidity (Igarashi et al., 2005; Sorkine & Alexa, 2007), and surface Jacobians (Aigerman et al., 2022; Gao et al., 2023), as well as more recent considerations of perceptual plausibility (Yoo et al., 2024). However, these methods are constrained by the fixed topology of mesh structures, limiting their flexibility, especially in complex edits that require substantial changes to the topology or the generation of new textures, e.g., editing a bird to open its wings.

In light of the recently introduced 3D Gaussian splatting (Kerbl et al., 2023) that is more expressive and easy to edit, Interactive3D (Dong et al., 2024) introduces a series of deformable and rigid 3D operations to directly manipulate local 3D Gaussians. This is followed by Gaussian-to-NeRF reformatting and refinement through Score Distillation Sampling (SDS) (Poole et al., 2022). However, this method suffers from prolonged NeRF optimization and the typical limitations of vanilla SDS, such as over-saturation. PhysGaussian (Xie et al., 2024) also simulates drag-induced motion by integrating physically grounded dynamics into 3D Gaussians. However, it requires an accurate predefinition of the physical properties involved, which can be difficult to obtain. Besides, both methods still face challenges in making large structural changes and generating new content.

Notably, recent drag-based editing has seen considerable success in the 2D domain (Pan et al., 2023; Mou et al., 2023; 2024; Zhang et al., 2024; Shin et al., 2024), largely due to the capabilities of powerful image generative models, such as GANs (Karras et al., 2020) and diffusion models (Rombach et al., 2022). These models encompass a latent space that enables various harmonious manipulations, including object deformation, layout adjustments, and coherent new content generation. Building on this success, some 3D editing methods have begun to explore generative 3D dragging within a 3D latent space. For instance, Drag3D (Tang, 2023), adapts DragGAN (Pan et al., 2023) by incorporating a 3D GAN (Shen et al., 2021) into a motion-based latent optimization framework. Similarly, CNS-Edit (Hu et al., 2024) employs a latent-based method but combines it with a 3D neural volume diffusion model (Hui et al., 2022). This approach requires training separate models for each shape category, making it less flexible and more resource-intensive. Obviously, both of the above approaches are limited by the capacity and generalization of current 3D generative models.

In pursuit of a stronger generative prior for more powerful drag-based 3D editing, we have observed the following from existing 3D generation and reconstruction work: 1) most 3D representations can be rendered into multiple views; 2) 3D objects can be faithfully reconstructed from four and more views (Tang et al., 2024a; Xu et al., 2024b); and 3) existing multi-view diffusion models provide a strong prior for generating consistent images across four orthogonal views (Shi et al., 2023b; Kant et al., 2024). These observations inspire us to explore the potential of leveraging both large-scale multi-view generation and reconstruction models as 3D priors, agnostic to 3D representations, to facilitate precise, generative, and general 3D dragging. Ideally, we expect that the 3D dragging operation should exhibit the following properties 1) Accuracy: the ability to precisely drag any point on a 3D object’s surface to a target spatial position; 2) Generative capability: the ability to generate visually plausible new content to match the drag intention; and 3) Versatility: compatibility with various input object categories and most 3D representations, such as 3D Gaussians or meshes.

To this end, we introduce MVDrag3D, a novel framework for drag-based 3D editing that leverages multi-view generation and reconstruction priors. Our method begins by rendering four orthogonal views of a 3D object and projecting the dragging points onto the corresponding views. To ensure consistent 3D edits, we extend the score-based gradient guidance mechanism within a multi-view diffusion model and propose a multi-view guidance energy function, enabling consistent edits across all four views. Thanks to the generative capabilities of the multi-view diffusion model, edits across four views can faithfully reflect significant structural changes or newly synthesized textures. The edited views are then fused into a 3D Gaussian representation using a multi-view Gaussian reconstruction model. Although the initial 3D Gaussian appears complete, we observe a loss of appearance detail, and the 3D Gaussians in the overlapping regions between views do not align accurately, leading to noticeable discrepancies in the 2D rendering. To address these issues, we employ a deformation network that predicts the displacement of each Gaussian to correct the 3D alignment. Additionally, we formulate an image-conditioned multi-view score function to distill generative priors from the multiple views simultaneously, ensuring high-fidelity results while preserving details across all views. We summarize our contributions as follows:

    1. We propose MVDrag3D, a drag-based 3D editing framework that leverages multi-view generation-reconstruction priors. It is accurate, generative, and adaptable to diverse input categories and most 3D representations, such as 3D Gaussians and meshes.
    1. We extend the gradient guidance mechanism into a multi-view diffusion model and introduce multi-view guidance energy, which ensures consistent drag-based edits across four views.
    1. We design a lightweight deformation network that corrects each 3D Gaussian’s position and enhances geometric consistency. Furthermore, we introduce an image-conditioned multi-view score function to iteratively refine the 3D Gaussian, ensuring high-fidelity appearance and preserving fine details across all views.

We will review prior research, starting from drag-based 2D image editing techniques, and progressing to more recent developments in drag-based 3D editing and 3D generation-reconstruction priors.

Drag-based image editing. Drag-based image manipulation allows users to exert precise control over specific areas of the image via manual interactions like dragging and clicking. Most existing techniques employ iterative latent optimization in the latent space, and they can be roughly divided into two categories: methods that rely on motion tracking (Pan et al., 2023; Shi et al., 2024; Zhang et al., 2024; Cui et al., 2024; Liu et al., 2024a; Ling et al., 2024) and those based on guidance gradients (Mou et al., 2023; 2024). DragGAN (Pan et al., 2023), for instance, optimizes the latent space of GANs using iterative motion supervision and point tracking. Later, diffusion-based methods, including DragDiffusion (Shi et al., 2024), GoodDrag (Zhang et al., 2024), StableDrag (Cui et al., 2024), DragNoise (Liu et al., 2024a), and FreeDrag (Ling et al., 2024), have further refined these motion-driven techniques for more refined results. Meanwhile, DragonDiffusion (Mou et al., 2023) and DiffEditor (Mou et al., 2024) utilize a gradient-based approach by optimizing an energy function (Epstein et al., 2023) to achieve desired edits. Since both motion- and gradient-based methods require time-consuming iterations, SDEDrag (Nie et al., 2024) and FastDrag (Zhao et al., 2024) have been proposed to accelerate the editing process. More recently, InstantDrag (Shin et al., 2024) decomposes the dragging task into two components: learning motion dynamics and generating images conditioned on motion, achieving a better balance among interactivity, speed, and quality.

Drag-based 3D editing. To achieve drag-based 3D editing, classical mesh deformation techniques are commonly employed. These methods often design optimization functions to preserve specific geometric properties, such as the mesh Laplacian (Lipman et al., 2004; 2005; Sorkine et al., 2004), local rigidity (Igarashi et al., 2005; Sorkine & Alexa, 2007), and surface Jacobians (Aigerman et al., 2022; Gao et al., 2023), under the constraints of user-interactive handles like key points or cages. Despite their widespread use, these techniques frequently result in unnatural shape distortion, primarily due to their inability to ensure perceptual plausibility. To address this limitation, APAP (Yoo et al., 2024) introduced an innovative approach by incorporating SDS loss to optimize the Jacobian deformation field. However, like previous mesh deformation methods, APAP is constrained by the fixed topology of mesh structures, limiting its flexibility, particularly for complex edits that require generating entirely new content. On the other hand, Interactive3D (Dong et al., 2024) introduces a series of deformable and rigid 3D point operations on 3D Gaussians and also employs SDS to optimize the deformed or transformed Gaussians/NeRFs. Besides, PhysGaussian (Xie et al., 2024) also involves certain types of drag-related motion by integrating physically grounded dynamics into 3D Gaussians, however, it requires a suitable predefinition of the physics involved. Although these latter two methods employ more expressive 3D representations, they often require labor-intensive post-processing and face challenges in refining fine details or generating coherent new content.

As drag-based image editing techniques evolve, some 3D editing methods have begun to explore generative 3D dragging within a 3D latent space. For instance, Drag3D (Tang, 2023), built upon DragGAN (Pan et al., 2023), integrates a 3D GAN model into a motion-based latent optimization framework. However, the approach is inherently limited by the capacity and generalization constraints of current 3D GAN models. Later, CNS-Edit (Hu et al., 2024) introduces a coupled neural shape representation to facilitate 3D shape editing. This method utilizes a latent code to capture high-level global semantics, while a 3D neural feature volume provides spatial context for local shape modifications. However, CNS-Edit’s category-specific design requires separate models for different 3D shape categories. Different from them, in this work, we achieve 3D generative dragging within a more powerful multi-view latent space.

Multi-view Image Generation. 2D diffusion models (Rombach et al., 2022; Saharia et al., 2022) initially focus on generating a single-view image. Recently, several models (Shi et al., 2023b; Wang & Shi, 2023; Shi et al., 2023a; Li et al., 2023b; Long et al., 2024; Kant et al., 2024; Tang et al., 2024b; Liu et al., 2024b) turned to employ a 3D-aware multi-view diffusion approach, incorporating camera poses as additional inputs and fine-tuning the diffusion model on multi-view data (Deitke et al., 2023). This strategy enables the consistent generation of multi-view images representing the same object. Essentially, these multi-view diffusion models capture a rich, generalizable distribution of 3D data, agnostic to a specific 3D representation. Also, given the limitations of current “pure” 3D generative models—those trained directly on 3D data—we believe that leveraging multi-view diffusion models as a 3D prior proxy could offer a promising solution for flexible 3D editing.

Feed-forward Multi-view 3D Reconstruction. By generating 3D-consistent multi-view images, various optimization techniques can be employed to reconstruct 3D objects (Shi et al., 2023b; Wang & Shi, 2023; Liu et al., 2023). To improve generation speed and quality, more recent work has explored large-scale reconstruction models using multi-view images (e.g., 4 or 6) (Wang et al., 2023; Xu et al., 2023; Li et al., 2023a; Wang et al., 2024; Xu et al., 2024a). These approaches leverage transformers to directly regress triplane-based NeRF representations. Newer methods like LGM (Tang et al., 2024a) and GRM (Xu et al., 2024b) replaced triplane NeRF with 3D Gaussians (Kerbl et al., 2023), achieving high-fidelity rendering at faster speeds. In summary, these recent feed-forward multi-view reconstruction models provide a robust 3D reconstruction prior, enabling the fast and faithful recreation of complete 3D objects from sparse-view images. In this work, we utilized a 4-view reconstruction model (Tang et al., 2024a) and a 4-view diffusion model (Shi et al., 2023b) as our generation-reconstruction priors.

3 Method

In this section, we briefly introduce score-based guidance energy for image editing, followed by a detailed explanation of our method.

3.1 Preliminary

Score-based gradient guidance for image editing.Recently, DragonDiffusion (Mou et al., 2023) and DiffEditor (Mou et al., 2024) have applied score-based gradient guidance (Dhariwal & Nichol, 2021) to efficient and flexible image-editing tasks. The score function enables sampling from a more enriched distribution, generally defined as:

ϵ~θt⁢(𝐱t)=ϵθt⁢(𝐱t)+η⋅∇𝐱tℰ⁢(𝐱t,𝐲),superscriptsubscript~bold-italic-ϵ𝜃𝑡subscript𝐱𝑡superscriptsubscriptbold-italic-ϵ𝜃𝑡subscript𝐱𝑡⋅𝜂subscript∇subscript𝐱𝑡ℰsubscript𝐱𝑡𝐲\tilde{\bm{\epsilon}}_{\theta}^{t}(\mathbf{x}_{t})=\bm{\epsilon}_{\theta}^{t}(% \mathbf{x}_{t})+\eta\cdot\nabla_{\mathbf{x}_{t}}\mathcal{E}(\mathbf{x}_{t},% \mathbf{y}),over~ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = bold_italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_η ⋅ ∇ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_E ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y ) , (1)

where the first term is the unconditional denoiser, and the second term is the conditional gradient produced by an energy function. Here, η𝜂\etaitalic_η is the learning rate, and 𝐲𝐲\mathbf{y}bold_y represents the edit target, such as text embedding. During the diffusion sampling process, the gradient guidance from the energy function aligns with the editing target, gradually modifying the input image to meet the desired edit.

In recent 2D dragging task (Mou et al., 2024; 2023), the guidance energy function is constructed based on image feature correspondence within a pre-trained diffusion model as follows:

| ∇𝐳tlog⁡q⁢(𝐲|𝐳t)=α⋅𝐦e⁢d⁢i⁢t⋅∇𝐱tℰe⁢d⁢i⁢t+β⋅(1−𝐦e⁢d⁢i⁢t)⋅∇𝐱tℰc⁢o⁢n⁢t⁢e⁢n⁢t,subscript∇subscript𝐳𝑡𝑞conditional𝐲subscript𝐳𝑡⋅𝛼subscript𝐦𝑒𝑑𝑖𝑡subscript∇subscript𝐱𝑡subscriptℰ𝑒𝑑𝑖𝑡⋅𝛽1subscript𝐦𝑒𝑑𝑖𝑡subscript∇subscript𝐱𝑡subscriptℰ𝑐𝑜𝑛𝑡𝑒𝑛𝑡\nabla_{\mathbf{z}_{t}}\log q(\mathbf{y}|\mathbf{z}_{t})=\alpha\cdot\mathbf{m}% _{edit}\cdot\nabla_{\mathbf{x}_{t}}\mathcal{E}_{edit}+\beta\cdot(1-\mathbf{m}_% {edit})\cdot\nabla_{\mathbf{x}_{t}}\mathcal{E}_{content},∇ start_POSTSUBSCRIPT bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_q ( bold_y | bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_α ⋅ bold_m start_POSTSUBSCRIPT italic_e italic_d italic_i italic_t end_POSTSUBSCRIPT ⋅ ∇ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_E start_POSTSUBSCRIPT italic_e italic_d italic_i italic_t end_POSTSUBSCRIPT + italic_β ⋅ ( 1 - bold_m start_POSTSUBSCRIPT italic_e italic_d italic_i italic_t end_POSTSUBSCRIPT ) ⋅ ∇ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_E start_POSTSUBSCRIPT italic_c italic_o italic_n italic_t italic_e italic_n italic_t end_POSTSUBSCRIPT , | (2) | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --- |

where 𝐦e⁢d⁢i⁢tsubscript𝐦𝑒𝑑𝑖𝑡\mathbf{m}_{edit}bold_m start_POSTSUBSCRIPT italic_e italic_d italic_i italic_t end_POSTSUBSCRIPT is the editing region mask. The energy function ℰe⁢d⁢i⁢tsubscriptℰ𝑒𝑑𝑖𝑡\mathcal{E}_{edit}caligraphic_E start_POSTSUBSCRIPT italic_e italic_d italic_i italic_t end_POSTSUBSCRIPT measures the diffusion feature similarity between areas near the dragging start and destination points, while ℰc⁢o⁢n⁢t⁢e⁢n⁢tsubscriptℰ𝑐𝑜𝑛𝑡𝑒𝑛𝑡\mathcal{E}_{content}caligraphic_E start_POSTSUBSCRIPT italic_c italic_o italic_n italic_t italic_e italic_n italic_t end_POSTSUBSCRIPT ensures that unedited content stays consistent with the original image. α𝛼\alphaitalic_α and β𝛽\betaitalic_β are balance weights. In our work, we extend both the editing energy and content energy to a multi-view version. This ensures that modifications made in one view are coherently reflected across all views.

Figure 2: Method overview. Given a 3D model and multiple pairs of 3D dragging points, we first render the model into four orthogonal views, each with corresponding projected dragging points. Then, to ensure consistent dragging across these views, we define a multi-view guidance energy within a multi-view diffusion model. The resulting dragged images are used to regress an initial set of 3D Gaussians. Our method further employs a two-stage optimization process: first, a deformation network adjusts the positions of the Gaussians for improved geometric alignment, followed by image-conditioned multi-view score distillation to enhance the visual quality of the final output.

3.2 Overview

The entire process is visualized in Fig. 2. Given a 3D model M𝑀Mitalic_M to be edited, and k𝑘kitalic_k pairs of 3D dragging points {(𝐩j3⁢D,𝐪j3⁢D)}j=1ksuperscriptsubscriptsuperscriptsubscript𝐩𝑗3𝐷superscriptsubscript𝐪𝑗3𝐷𝑗1𝑘\{(\mathbf{p}_{j}^{3D},\mathbf{q}_{j}^{3D})\}_{j=1}^{k}{ ( bold_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 italic_D end_POSTSUPERSCRIPT , bold_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 italic_D end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, we first render M𝑀Mitalic_M into four orthogonal images ℐ={𝐈i}i=14ℐsuperscriptsubscriptsubscript𝐈𝑖𝑖14\mathcal{I}=\{\mathbf{I}_{i}\}_{i=1}^{4}caligraphic_I = { bold_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT, along with the corresponding dragging points (Sec. 3.3). We then propose a multi-view guidance energy function (Sec. 3.4), which ensures consistent and coherent dragging across all views. The edited images ℐe={𝐈e,i}i=14subscriptℐ𝑒superscriptsubscriptsubscript𝐈𝑒𝑖𝑖14\mathcal{I}_{e}=\{\mathbf{I}_{e,i}\}_{i=1}^{4}caligraphic_I start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = { bold_I start_POSTSUBSCRIPT italic_e , italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT are used to regress 3D Gaussians using (Tang et al., 2024a). While the initial reconstruction appears complete, we further use a deformation network and introduce an image-conditioned multi-view score distillation to correct the misalignment between Gaussians in the overlapping regions of each view and enhance the visual appearance across all views, resulting in the final edited results (represented in 3D Gaussians) (Sec. 3.5).

3.3 3D-2D Rendering and Projection

We decompose the 3D dragging operation in a multi-view manner. First, we render the 3D model M𝑀Mitalic_M into four orthogonal images {𝐈i}i=14superscriptsubscriptsubscript𝐈𝑖𝑖14\{\mathbf{I}_{i}\}_{i=1}^{4}{ bold_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT using any suitable renderer. Since MVDream typically generates images with gray backgrounds, we adopt a similar gray background for rendering. In terms of camera setup, we adopt the same configuration as MVDream (Shi et al., 2023b) and LGM (Tang et al., 2024a), which serve as our generation-reconstruction priors. Specifically, the four views are chosen at orthogonal azimuths (0∘,90∘,180∘,270∘)superscript0superscript90superscript180superscript270(0^{\circ},90^{\circ},180^{\circ},270^{\circ})( 0 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT , 90 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT , 180 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT , 270 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT ) and a fixed elevation (0∘)superscript0(0^{\circ})( 0 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT ). Then, the k𝑘kitalic_k pairs of 3D dragging points can be projected onto the corresponding views, represented as {(𝐩i,j2⁢D,𝐪i,j2⁢D)}j=1ksuperscriptsubscriptsuperscriptsubscript𝐩𝑖𝑗2𝐷superscriptsubscript𝐪𝑖𝑗2𝐷𝑗1𝑘\{(\mathbf{p}_{i,j}^{2D},\mathbf{q}_{i,j}^{2D})\}_{j=1}^{k}{ ( bold_p start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_D end_POSTSUPERSCRIPT , bold_q start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_D end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT. However, due to potential occlusions in certain views, we discard the point pairs if the z𝑧zitalic_z-axis value of 𝐩i,j2⁢Dsuperscriptsubscript𝐩𝑖𝑗2𝐷\mathbf{p}_{i,j}^{2D}bold_p start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_D end_POSTSUPERSCRIPT or 𝐪i,j2⁢Dsuperscriptsubscript𝐪𝑖𝑗2𝐷\mathbf{q}_{i,j}^{2D}bold_q start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_D end_POSTSUPERSCRIPT exceeds the rendered depth at the corresponding 2D position.

3.4 Multi-view gradient guidance for dragging

Refer to caption

Figure 3: Effect of DDIM inversion with random noise. For the rendered four images, when inverted into MVDream’s data distribution, the resulting noise deviates from a Gaussian distribution (b). By adding random noise (𝒩⁢(0,0.01)𝒩00.01\mathcal{N}(0,0.01)caligraphic_N ( 0 , 0.01 )) to the background’s pixel domain, we help the latent variables conform more closely to a Gaussian distribution (c). The resulting multi-view edits are shown in (d) and (e). Yellow arrows indicate the views with evident identity changes.

Since a 3D object can be rendered into multiple images and numerous drag-based 2D editing methods already exist, a straightforward approach to achieve drag-based 3D editing would be to independently edit each view and then reconstruct the 3D model. However, this leads to significant 3D inconsistencies (see the results of DiffEditor (Mou et al., 2024) in Fig. 1), as the editing results of each image become misaligned across various factors such as pose, layout, texture, and more. Based on the observation that multi-view diffusion models can simultaneously generate a consistent set of multi-view images, and recognizing the effectiveness of score-based gradient guidance in image editing, we extend gradient guidance to a multi-view version.

Specifically, we first apply DDIM inversion (Song et al., 2020) to transform each of {𝐈i}i=14superscriptsubscriptsubscript𝐈𝑖𝑖14\{\mathbf{I}_{i}\}_{i=1}^{4}{ bold_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT into a Gaussian distribution. These distributions are combined and represented as 𝐳T∈ℛ4×H×W×Csubscript𝐳𝑇superscriptℛ4𝐻𝑊𝐶\mathbf{z}_{T}\in\mathcal{R}^{4\times H\times W\times C}bold_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∈ caligraphic_R start_POSTSUPERSCRIPT 4 × italic_H × italic_W × italic_C end_POSTSUPERSCRIPT within the latent space of MVDream. Using 𝐳Tsubscript𝐳𝑇\mathbf{z}_{T}bold_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, we can extract an intermediate feature 𝐅𝐅\mathbf{F}bold_F from the UNet decoder. Note that MVDream reshapes 𝐳Tsubscript𝐳𝑇\mathbf{z}_{T}bold_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT into a 4⁢H⁢W×C4𝐻𝑊𝐶4HW\times C4 italic_H italic_W × italic_C format, thus extending self-attention to the cross-view version. This ensures that guidance from one view can influence the others. With this, we follow (Mou et al., 2023) and define a multi-view guidance energy:

ℰe⁢d⁢i⁢t=∑i=1410.5⋅cos⁡(𝐅i,te⁢d⁢i⁢[𝐦ie⁢d⁢i],s⁢g⁢(𝐅i,to⁢r⁢i⁢[𝐦io⁢r⁢i]))+0.5,subscriptℰ𝑒𝑑𝑖𝑡superscriptsubscript𝑖141⋅0.5superscriptsubscript𝐅𝑖𝑡𝑒𝑑𝑖delimited-[]subscriptsuperscript𝐦𝑒𝑑𝑖𝑖𝑠𝑔superscriptsubscript𝐅𝑖𝑡𝑜𝑟𝑖delimited-[]subscriptsuperscript𝐦𝑜𝑟𝑖𝑖0.5\displaystyle\mathcal{E}_{edit}=\sum_{i=1}^{4}\frac{1}{0.5\cdot\cos\left(% \mathbf{F}_{i,t}^{edi}[\mathbf{m}^{edi}_{i}],\ sg(\mathbf{F}_{i,t}^{ori}[% \mathbf{m}^{ori}_{i}])\right)+0.5},caligraphic_E start_POSTSUBSCRIPT italic_e italic_d italic_i italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 0.5 ⋅ roman_cos ( bold_F start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e italic_d italic_i end_POSTSUPERSCRIPT [ bold_m start_POSTSUPERSCRIPT italic_e italic_d italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] , italic_s italic_g ( bold_F start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_r italic_i end_POSTSUPERSCRIPT [ bold_m start_POSTSUPERSCRIPT italic_o italic_r italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] ) ) + 0.5 end_ARG , (3)
ℰc⁢o⁢n⁢t⁢e⁢n⁢t=∑i=1410.5⋅cos⁡(𝐅i,te⁢d⁢i⁢[𝐦iu⁢n⁢e⁢d⁢i⁢t⁢e⁢d],s⁢g⁢(𝐅i,to⁢r⁢i⁢[𝐦iu⁢n⁢e⁢d⁢i⁢t⁢e⁢d]))+0.5,subscriptℰ𝑐𝑜𝑛𝑡𝑒𝑛𝑡superscriptsubscript𝑖141⋅0.5superscriptsubscript𝐅𝑖𝑡𝑒𝑑𝑖delimited-[]subscriptsuperscript𝐦𝑢𝑛𝑒𝑑𝑖𝑡𝑒𝑑𝑖𝑠𝑔superscriptsubscript𝐅𝑖𝑡𝑜𝑟𝑖delimited-[]subscriptsuperscript𝐦𝑢𝑛𝑒𝑑𝑖𝑡𝑒𝑑𝑖0.5\displaystyle\mathcal{E}_{content}=\sum_{i=1}^{4}\frac{1}{0.5\cdot\cos\left(% \mathbf{F}_{i,t}^{edi}[\mathbf{m}^{unedited}_{i}],\ sg(\mathbf{F}_{i,t}^{ori}[% \mathbf{m}^{unedited}_{i}])\right)+0.5},caligraphic_E start_POSTSUBSCRIPT italic_c italic_o italic_n italic_t italic_e italic_n italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 0.5 ⋅ roman_cos ( bold_F start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e italic_d italic_i end_POSTSUPERSCRIPT [ bold_m start_POSTSUPERSCRIPT italic_u italic_n italic_e italic_d italic_i italic_t italic_e italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] , italic_s italic_g ( bold_F start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_r italic_i end_POSTSUPERSCRIPT [ bold_m start_POSTSUPERSCRIPT italic_u italic_n italic_e italic_d italic_i italic_t italic_e italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] ) ) + 0.5 end_ARG ,

where 𝐅i,te⁢d⁢isuperscriptsubscript𝐅𝑖𝑡𝑒𝑑𝑖\mathbf{F}_{i,t}^{edi}bold_F start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e italic_d italic_i end_POSTSUPERSCRIPT and 𝐅i,to⁢r⁢isuperscriptsubscript𝐅𝑖𝑡𝑜𝑟𝑖\mathbf{F}_{i,t}^{ori}bold_F start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_r italic_i end_POSTSUPERSCRIPT are intermediate features of 𝐳i,te⁢d⁢isuperscriptsubscript𝐳𝑖𝑡𝑒𝑑𝑖\mathbf{z}_{i,t}^{edi}bold_z start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e italic_d italic_i end_POSTSUPERSCRIPT and 𝐳i,to⁢r⁢isuperscriptsubscript𝐳𝑖𝑡𝑜𝑟𝑖\mathbf{z}_{i,t}^{ori}bold_z start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_r italic_i end_POSTSUPERSCRIPT. 𝐳i,to⁢r⁢isuperscriptsubscript𝐳𝑖𝑡𝑜𝑟𝑖\mathbf{z}_{i,t}^{ori}bold_z start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_r italic_i end_POSTSUPERSCRIPT corresponds to the latent variables of original image at time step t𝑡titalic_t, while 𝐳i,te⁢d⁢isuperscriptsubscript𝐳𝑖𝑡𝑒𝑑𝑖\mathbf{z}_{i,t}^{edi}bold_z start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e italic_d italic_i end_POSTSUPERSCRIPT represents the edited latent variable. s⁢g⁢(⋅)𝑠𝑔⋅sg(\cdot)italic_s italic_g ( ⋅ ) is the gradient clipping operation. In the dragging operation,𝐦o⁢r⁢isuperscript𝐦𝑜𝑟𝑖\mathbf{m}^{ori}bold_m start_POSTSUPERSCRIPT italic_o italic_r italic_i end_POSTSUPERSCRIPT (or 𝐦e⁢d⁢isuperscript𝐦𝑒𝑑𝑖\mathbf{m}^{edi}bold_m start_POSTSUPERSCRIPT italic_e italic_d italic_i end_POSTSUPERSCRIPT) is a 3×3333\times 33 × 3 rectangular patch centered around the 2D dragging points 𝐩2⁢Dsuperscript𝐩2𝐷\mathbf{p}^{2D}bold_p start_POSTSUPERSCRIPT 2 italic_D end_POSTSUPERSCRIPT (or 𝐪2⁢Dsuperscript𝐪2𝐷\mathbf{q}^{2D}bold_q start_POSTSUPERSCRIPT 2 italic_D end_POSTSUPERSCRIPT). 𝐦u⁢n⁢e⁢d⁢i⁢t⁢e⁢dsuperscript𝐦𝑢𝑛𝑒𝑑𝑖𝑡𝑒𝑑\mathbf{m}^{unedited}bold_m start_POSTSUPERSCRIPT italic_u italic_n italic_e italic_d italic_i italic_t italic_e italic_d end_POSTSUPERSCRIPT denotes the areas without editing. To enhance readability, the index labels on each image are omitted. Note also that all layers of the UNet decoder features are used to compute the guidance energy, ensuring more comprehensive and robust results. The gradient of ℰe⁢d⁢i⁢tsubscriptℰ𝑒𝑑𝑖𝑡\mathcal{E}_{edit}caligraphic_E start_POSTSUBSCRIPT italic_e italic_d italic_i italic_t end_POSTSUBSCRIPT is then used to generate consistently edited images {𝐈e,i}i=14superscriptsubscriptsubscript𝐈𝑒𝑖𝑖14\{\mathbf{I}_{e,i}\}_{i=1}^{4}{ bold_I start_POSTSUBSCRIPT italic_e , italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT, while ℰc⁢o⁢n⁢t⁢e⁢n⁢tsubscriptℰ𝑐𝑜𝑛𝑡𝑒𝑛𝑡\mathcal{E}_{content}caligraphic_E start_POSTSUBSCRIPT italic_c italic_o italic_n italic_t italic_e italic_n italic_t end_POSTSUBSCRIPT employed to preserve the appearance of the unedited regions, keeping them as close to the original images as possible.

DDIM inversion with random noise. During DDIM inversion, we observed that for the given four images, their latent noise does not follow a Gaussian distribution, as depicted in Fig. 3 (b). This discrepancy often causes instability during the editing process, making it difficult to preserve the object’s identity (see Fig. 3 (d)). We believe this issue arises because MVDream was never trained on images with smooth, noise-free regions like the background, leading to a domain gap during inversion (Ouyang et al., 2024). To address this issue, we found that introducing small, nearly imperceptible perturbations to the pixel domain—especially in smooth areas like the background—significantly improves the inversion process. These subtle disturbances help the latent variables conform more closely to a Gaussian distribution (see Fig. 3 (c)). The final results exhibit smoother transitions and better overall fidelity in the edited images, as shown in Fig. 3 (e).

3.5 3D Gaussian Reconstruction and Refinement

Once we obtain the four edited images, we employ LGM (Tang et al., 2024a) to regress a partial 3D Gaussians for each view and then fuse them into a unified 3D Gaussian representation. However, we encountered two significant challenges: (1) because we only use four orthogonal views, the predicted Gaussians in the overlapping regions between views are usually not aligned correctly, resulting in noticeable discrepancies in the 2D rendering (see Fig. 4 (c)), and (2) the appearance details are frequently lost during LGM’s regression process, reducing the visual fidelity of the final 3D reconstruction (see Fig. 5 (c)).

Refer to caption

Figure 4: Effect of Gaussian position optimization. (c) shows 3D reconstruction result may exhibit structural misalignment. By employing a deformation network to optimize the Gaussian position, we achieve better compactness and consistency among the Gaussians across different views, as shown in (d).

In our early tests, to address these issues, we applied vanilla SDS on the initial reconstruction, incorporating a multi-view reconstruction loss across the four views. However, these adjustments did not resolve the underlying issues. We attribute these challenges to the inherent ambiguity in the SDS and reconstruction losses. Specifically, it is difficult to directly optimize independent Gaussians consistently without regularization, and the losses do not effectively indicate when to adjust the position or when to densify or prune the Gaussians, resulting in suboptimal outcomes. To address these challenges, we propose a two-step approach: first, we adjust the Gaussian’s position via deformation fields to achieve better geometric alignment and then focus on enhancing visual quality.

Refer to caption

Figure 5: Effect of image-conditioned multi-view SDS. (c) presents the reconstruction results without appearance optimization, while (d) displays the corresponding results after optimization, which are noticeably sharper and clearer.

Gaussian position optimization. Considering that the geometric misalignment problem across views mainly involves low-frequency overall structural changes and the Gaussians belonging to the same view should be moved more consistently, for each view’ Gaussian set, we propose to use an individual deformation network f𝑓fitalic_f to predict each Gaussian’s movement (δ⁢xi,δ⁢yi,δ⁢zi)𝛿subscript𝑥𝑖𝛿subscript𝑦𝑖𝛿subscript𝑧𝑖(\delta x_{i},\delta y_{i},\delta z_{i})( italic_δ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_δ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_δ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). This means we employ a total of four lightweight individual MLPs, one for each view. Besides, since standard MLPs are generally ineffective for low-dimensional coordinate-based regression tasks (Tancik et al., 2020), we enhance the model by applying Fourier positional embeddings (p⁢e⁢(⋅)𝑝𝑒⋅pe(\cdot)italic_p italic_e ( ⋅ )) to each Gaussian’s (x,y,z)𝑥𝑦𝑧(x,y,z)( italic_x , italic_y , italic_z ) coordinates. The new position for each Gaussian is then calculated as: (x′,y′,z′)=(x,y,z)+f⁢(p⁢e⁢((x,y,z)))superscript𝑥′superscript𝑦′superscript𝑧′𝑥𝑦𝑧𝑓𝑝𝑒𝑥𝑦𝑧(x^{\prime},y^{\prime},z^{\prime})=(x,y,z)+f(pe((x,y,z)))( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = ( italic_x , italic_y , italic_z ) + italic_f ( italic_p italic_e ( ( italic_x , italic_y , italic_z ) ) ). The training loss is the VGG-based LPIPS loss, applied to the four images. This helps maintain perceptual similarity and ensures better alignment across views: ℒLPIPS=∑i=14LPIPS⁢(𝐈e,i,𝐈e,irender),subscriptℒLPIPSsuperscriptsubscript𝑖14LPIPSsubscript𝐈𝑒𝑖subscriptsuperscript𝐈render𝑒𝑖\mathcal{L}_{\text{LPIPS}}=\sum_{i=1}^{4}\text{LPIPS}(\mathbf{I}_{e,i},\mathbf% {I}^{\text{render}}_{e,i}),caligraphic_L start_POSTSUBSCRIPT LPIPS end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT LPIPS ( bold_I start_POSTSUBSCRIPT italic_e , italic_i end_POSTSUBSCRIPT , bold_I start_POSTSUPERSCRIPT render end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e , italic_i end_POSTSUBSCRIPT ) ,where 𝐈e,irendersubscriptsuperscript𝐈render𝑒𝑖\mathbf{I}^{\text{render}}_{e,i}bold_I start_POSTSUPERSCRIPT render end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e , italic_i end_POSTSUBSCRIPT is the rendered image by the optimized Gaussians after their positions have been corrected. Note that Gaussian densification and pruning are not performed at this stage. Fig. 4 (d) shows the effectiveness of the Gaussian position optimization stage.

Gaussian appearance optimization. The deformation network described above is limited to optimizing the positions of the Gaussians. When extending MLPs to optimize other Gaussian properties, such as spherical harmonics, we observe no significant improvement in appearance details. Inspired by ReconFusion (Wu et al., 2024a), we propose to frame the Gaussian appearance enhancement task as an image-conditioned multi-view SDS optimization problem. Our objectives are two-fold: (1) ensuring multi-view consistency across novel camera angles beyond the initial four views and (2) preserving the identity of the edited four views. To achieve this, we define the edited-image conditioned multi-view score function:

∇ϕℒSDS=𝔼t,ϵ,o⁢[(ϵθ⁢(I^;t,𝐈e,i,o)−ϵ)⁢∂I^∂ϕ],and ⁢i=1,2,3,or ⁢4,formulae-sequencesubscript∇italic-ϕsubscriptℒSDSsubscript𝔼𝑡italic-ϵ𝑜delimited-[]subscriptitalic-ϵ𝜃^𝐼𝑡subscript𝐈𝑒𝑖𝑜italic-ϵ^𝐼italic-ϕand 𝑖123or 4\nabla_{\phi}\mathcal{L}_{\textrm{SDS}}=\mathbb{E}_{t,\epsilon,o}[(\epsilon_{% \theta}(\hat{I};t,\mathbf{I}_{e,i},o)-\epsilon)\frac{\partial\hat{I}}{\partial% \phi}],\text{and }i=1,2,3,\text{or }4,∇ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT SDS end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_t , italic_ϵ , italic_o end_POSTSUBSCRIPT [ ( italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG italic_I end_ARG ; italic_t , bold_I start_POSTSUBSCRIPT italic_e , italic_i end_POSTSUBSCRIPT , italic_o ) - italic_ϵ ) divide start_ARG ∂ over^ start_ARG italic_I end_ARG end_ARG start_ARG ∂ italic_ϕ end_ARG ] , and italic_i = 1 , 2 , 3 , or 4 , (4)

where I^^𝐼\hat{I}over^ start_ARG italic_I end_ARG represents the rendered batch images from any four orthogonal views, and o𝑜oitalic_o denotes the corresponding camera poses. During each SDS iteration, we randomly render four orthogonal views and randomly select one edited image 𝐈e,isubscript𝐈𝑒𝑖\mathbf{I}_{e,i}bold_I start_POSTSUBSCRIPT italic_e , italic_i end_POSTSUBSCRIPT as a condition to compute the SDS loss. The multi-view diffusion model employed is ImageDream (Wang & Shi, 2023), which can be seen as an image-conditioned version of MVDream. This allows it to be seamlessly integrated into our framework. In each iteration, we also compute ℒLPIPSsubscriptℒLPIPS\mathcal{L}_{\text{LPIPS}}caligraphic_L start_POSTSUBSCRIPT LPIPS end_POSTSUBSCRIPT. Note that all Gaussian properties are optimized during this process, with densification and pruning operations enabled.

4 Experiments

4.1 Experimental Setup

Implementation Details. We conducted all experiments on a single 48 GB A6000 GPU. For multi-view image dragging, we employed DDIM sampling with 150 steps, applying random Gaussian noise 𝒩⁢(0,0.01)𝒩00.01\mathcal{N}(0,0.01)caligraphic_N ( 0 , 0.01 ) to the background. In the Gaussian deformation stage, we used 4444 MLPs, each trained for 2,00020002,0002 , 000 iterations with a learning rate of 0.000010.000010.000010.00001. Each MLP consists of a linear layer, a ReLU activation, and another linear layer arranged in a residual structure. For multi-view SDS optimization, we performed 1,00010001,0001 , 000 iterations, gradually decaying Tmaxsubscript𝑇maxT_{\mathrm{max}}italic_T start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT from 0.490.490.490.49 to 0.020.020.020.02.

Datasets. We perform dragging on two of the most popular 3D representations: meshes and 3D Gaussians. For the mesh experiments, we collected 8888 meshes from (Yoo et al., 2024) and Genie (Luma AI, ). For the 3D Gaussian experiments, we collected 8888 3D Gaussians from Tang et al. (2024a). We collect data that are representative to demonstrate drag editing but do not cherry-pick based on any results. The 3D drag points are manually specified using MeshLab, following (Yoo et al., 2024).

Metrics. In this work, we employ two assessment metrics for quantitative evaluation: Dragging Accuracy Index (DAI) (Zhang et al., 2024) and GPTEval3D (Wu et al., 2024b). DAI measures the effectiveness of a method in transferring source content to a target point. While DAI effectively measures drag accuracy, it is insufficient because the editing process sometimes introduce overall distortions or artifacts, resulting in unrealistic or unnatural results. To address this, we use GPTEval3D, which leverages GPT-4V and customizable 3D-aware prompts to offer flexible comparisons between two 3D assets based on a set of specific evaluation criteria. For more details about these metrics, please refer to Sec. A.2.

4.2 Results

Refer to caption

Figure 6: 3D dragging results on meshes and 3D Gaussians. The first three rows show the results for the mesh, and the last three rows show the results for the 3D Gaussians. Black dashed circles indicate some detailed differences.

Baselines. One baseline comparison involves leveraging a 2D drag method to edit each view independently. In this setup, we use DiffEditor (Mou et al., 2024) to drag the four rendered views, followed by the same reconstruction and optimization steps as ours to produce the final 3D results. During our initial experiments, we observed that when editing much more than four views, such as 120, DiffEditor introduced significant 2D inconsistencies. Thus, for a fair comparison, we limit the process to four images as in our approach. We also compare our method with APAP, the state-of-the-art drag-based mesh deformation technique. Additionally, we include PhysGaussian (Xie et al., 2024), which enables user control over Gaussian-based dynamics. For this comparison, we start with a 3D model, render four images, reconstruct a 3D Gaussian, and feed it into the PhysGaussian simulator. More detailed drag setup for PhysGaussian please refer to Sec. A.3. Note that as the released code of Interactive3D (Dong et al., 2024) cannot be run successfully, we are unable to include it in our comparisons. But conceptually, our approach provides a stronger multi-view diffusion prior compared to the SDS loss in Interactive3D, as we can also observe in our comparison with APAP.

Visual Comparisons. We first conduct a visual comparison of the proposed MVDrag3D against baselines, as demonstrated in Fig. 6. The first three rows present results of dragging on meshes, while the last three rows show results on 3D Gaussians. For each method, we render two views to highlight the respective editing results. Take the wolf mode in the first row as an example, we aim to lift its left leg. While APAP deforms the leg, it bends rather than lifts it, resulting in a less realistic motion. In contrast, our method produces an articulation-like motion that is more natural. DiffEditor generates a successful edit in some views, but others fail, leading to inconsistent 3D results. As for PhysGaussian, it relies on predefined physical properties. Since the optimal parameters are unknown, its results exhibit some distortion. Additionally, it is unable to generate new content. For more visual results, please refer to the supplemental video demo.

Table 1: Quantitative comparison with state-of-the-art methods on both meshes and 3D Gaussians. Left side of “/”: Mesh. Right side: 3D Gaussians. γ𝛾\gammaitalic_γ represents the patch radius, which defines the neighborhood around the 2D dragging points. APAP was not tested on 3D Gaussians. In the last column, we report a rough average running time.

Method γ=1⁢(↓)𝛾1↓\gamma=1(\downarrow)italic_γ = 1 ( ↓ ) γ=3⁢(↓)𝛾3↓\gamma=3(\downarrow)italic_γ = 3 ( ↓ ) γ=5⁢(↓)𝛾5↓\gamma=5(\downarrow)italic_γ = 5 ( ↓ ) γ=7⁢(↓)𝛾7↓\gamma=7(\downarrow)italic_γ = 7 ( ↓ ) γ=10⁢(↓)𝛾10↓\gamma=10(\downarrow)italic_γ = 10 ( ↓ ) Time
APAP 0.2154 / – 0.2467 / – 0.2150 / – 0.1859 / – 0.1672 / – 6 minutes
PhysGaussian 0.1763 / 0.2468 0.1887 / 0.2331 0.1671 / 0.2153 0.1448 / 0.1979 0.1296 / 0.1814 1 minutes
DiffEditor 0.1564 / 0.1722 0.1452 / 0.1735 0.1348 / 0.1619 0.1299 / 0.1486 0.1300 / 0.1358 6 minutes
Ours (LGM) 0.1153 / 0.1702 0.1080 / 0.1588 0.0989 / 0.1397 0.0890 / 0.1260 0.0865 / 0.1130 3 minutes
Ours + deformation 0.1121 / 0.1269 0.1044 / 0.1150 0.0975 / 0.1081 0.0908 / 0.1017 0.0881 / 0.0937 5 minutes
Ours + deformation + SDS 0.1461 / 0.1159 0.1292 / 0.1074 0.1175 / 0.1020 0.1064 / 0.0960 0.0994 / 0.0900 8 minutes

Table 2: Evaluation results of GPTEval3D. “Ours + deformation + SDS” performs almost the best across all criteria on both meshes and 3D Gaussians.

Quantitative Comparisons. In addition to the visual comparisons, we conducted a quantitative evaluation to assess the effectiveness of all compared methods in terms of dragging accuracy (DAI) and overall editing quality (GPTEval3D). Table 1 reports different methods’ DAI across varying patch radius values γ𝛾\gammaitalic_γ. As γ𝛾\gammaitalic_γ increases from 1 to 10, our method, both with and without SDS, shows consistently lower error against other approaches like APAP, PhysGaussian, and DiffEditor. In Table 2, the GPTEval3D evaluation reveals that the “Ours + deformation + SDS” method performs almost the best across all criteria on both meshes and 3D Gaussians. Notably, we observed that while the SDS version of our method may not always achieve the highest DAI score, this is understandable. The SDS tends to sharpen visual details, which can lead to minor numerical decreases, but it ultimately results in more visually pleasing outputs. This is further supported by the GPTEval3D results, where the SDS version achieves the highest score in texture details.

4.3 Abalation and Discussion

Abalation. We start with the initial reconstruction from (Tang et al., 2024a) as a baseline (Ours (LGM)) and progressively integrate our two-step optimizations: (i) Gaussian position optimization (Ours + deformation), and (ii) image-conditioned multi-view SDS (Ours + deformation + SDS). Table 1 presents a clear comparison of the impact of each stage on both mesh data and 3D Gaussians. Fig. 4 and Fig. 5 also visually demonstrate the effectiveness of our proposed optimization strategy.

Refer to caption

Figure 7: Results of dragging on image-conditioned multi-view diffusion model. We extend the dragging stage to ImageDream (Wang & Shi, 2023). The results are less flexible as indicated by black arrows.

Drag on image-conditioned diffusion model. Considering the existence of several image-conditioned multi-view diffusion models, such as Imagedream (Wang & Shi, 2023) and Zero123++ (Shi et al., 2023a), an intuitive idea is to extend the multi-view dragging stage to these models. Here, we specifically extend it to Imagedream. Fig. 7 shows two cases. The conditioning image is the front view of each input. Under this setting, we observe that the results are less visually pleasing. We suspect the reason is that the image condition is too strong, thereby restricting the editing effects. In Mou et al. (2024), the authors introduce the use of both image and text for fine-grained image editing by tuning a new encoder, enabling a more detailed description of the desired changes. We see this as a potential direction for our work, aiming to enhance precision and flexibility in multi-view editing.

5 Conclusion

In this work, we introduce MVDrag3D, a novel paradigm that harnesses the power of multi-view generation-reconstruction priors for creative 3D editing. MVDrag3D first applies a multi-view dragging technique to ensure consistent edits across four orthogonal views. Following this, a reconstruction model generates 3D Gaussians of the edited object. To refine these initial 3D Gaussians, we introduce a deformation network that aligns the Gaussians across different views, complemented by a multi-view score function to enhance visual sharpness and consistency. Extensive experiments showcase the precision, generative capabilities, and flexibility of our method, making it a versatile solution for 3D editing across various object categories and representations.

References

Appendix A Appendix

A.1 Additional Parameters for multi-view dragging

For multi-view image dragging, parameters such as the editing and content energy balance weights α𝛼\alphaitalic_α and β𝛽\betaitalic_β (see Eq. 2) and the classifier-free guidance (CFG) need to be configured. We leave these as open parameters for users, as the optimal settings may vary depending on the specific edit target.

A.2 Metric explanation

DAI. DAI measures the effectiveness of a method in transferring semantic content to a target point. Specifically, it evaluates whether the content at the source position denoted as 𝒑jsubscript𝒑𝑗\bm{p}_{j}bold_italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, has been successfully moved to the target location 𝒒jsubscript𝒒𝑗\bm{q}_{j}bold_italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT in the edited 3D object. For each 3D object, the DAI is computed over four views and considers all non-occluded dragging points as follows:

| DAI=14⁢∑i=14∑j=1k‖𝐈i⋅Ω⁢(𝒑i,j2⁢D,γ)−𝐈e,i⋅Ω⁢(𝒒i,j2⁢D,γ)‖22(1+2⁢γ)2,DAI14superscriptsubscript𝑖14superscriptsubscript𝑗1𝑘superscriptsubscriptnorm⋅subscript𝐈𝑖Ωsuperscriptsubscript𝒑𝑖𝑗2𝐷𝛾⋅subscript𝐈𝑒𝑖Ωsuperscriptsubscript𝒒𝑖𝑗2𝐷𝛾22superscript12𝛾2{\rm DAI}=\dfrac{1}{4}\sum_{i=1}^{4}\sum_{j=1}^{k}\dfrac{\left\|{\mathbf{I}_{i% }\cdot\mathrm{\Omega}(\bm{p}_{i,j}^{2D},\gamma)-\mathbf{I}_{e,i}\cdot\mathrm{% \Omega}(\bm{q}_{i,j}^{2D},\gamma)}\right\|_{2}^{2}}{(1+2\gamma)^{2}},roman_DAI = divide start_ARG 1 end_ARG start_ARG 4 end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT divide start_ARG ∥ bold_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ roman_Ω ( bold_italic_p start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_D end_POSTSUPERSCRIPT , italic_γ ) - bold_I start_POSTSUBSCRIPT italic_e , italic_i end_POSTSUBSCRIPT ⋅ roman_Ω ( bold_italic_q start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_D end_POSTSUPERSCRIPT , italic_γ ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 + 2 italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , | (5) | | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --- |

where Ω⁢(𝒑i,j2⁢D,γ)Ωsuperscriptsubscript𝒑𝑖𝑗2𝐷𝛾\mathrm{\Omega}(\bm{p}_{i,j}^{2D},\gamma)roman_Ω ( bold_italic_p start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_D end_POSTSUPERSCRIPT , italic_γ ) represents a patch centered at 𝒑i,j2⁢Dsuperscriptsubscript𝒑𝑖𝑗2𝐷\bm{p}_{i,j}^{2D}bold_italic_p start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_D end_POSTSUPERSCRIPT with radius γ𝛾\gammaitalic_γ. Eq. 5 calculates the mean squared error between the patch at 𝒑j2⁢Dsuperscriptsubscript𝒑𝑗2𝐷\bm{p}_{j}^{2D}bold_italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_D end_POSTSUPERSCRIPT of 𝐈𝐈\mathbf{I}bold_I and the patch at 𝒒j2⁢Dsuperscriptsubscript𝒒𝑗2𝐷\bm{q}_{j}^{2D}bold_italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_D end_POSTSUPERSCRIPT of 𝐈esubscript𝐈𝑒\mathbf{I}_{e}bold_I start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT. By adjusting the radius γ𝛾\gammaitalic_γ, the metric can focus on different levels of context. A smaller γ𝛾\gammaitalic_γ provides a precise evaluation of differences at the exact control points, while a larger γ𝛾\gammaitalic_γ includes a broader region, allowing for an assessment of the surrounding context. This adaptability makes DAI a flexible tool for examining various aspects of editing quality. Given that the image resolution is 256×256256256256\times 256256 × 256, we set γ=1,3,5,7,10𝛾135710\gamma={1,3,5,7,10}italic_γ = 1 , 3 , 5 , 7 , 10.

GPTEval3D. While DAI effectively measures drag accuracy, it is not sufficient on its own because the editing process can introduce distortions or artifacts, leading to unrealistic or unnatural results. Therefore, evaluating the naturalness and fidelity of the edited images is crucial for a comprehensive quality assessment. This task is particularly challenging due to the absence of ground-truth edited 3D objects for reference. To address this, we utilize GPTEval3D, which leverages GPT-4V with customizable 3D-aware prompts. GPTEval3D aligns well with human judgment across several dimensions, including text-to-asset alignment, 3D plausibility, texture-–geometry coherence, texture details, and geometry details. Specifically, GPTEval3D prompts GPT-4V to compare two 3D assets generated by different methods using four rendered images and normal maps. The pairwise comparisons are then used to calculate Elo ratings, which reflect each method’s performance. For more details, please refer to (Wu et al., 2024b).

Fig. 8 presents a pairwise comparison example of GPTEval3D on two versions of our method: Ours (LGM) and the full version, Ours + deformation + SDS. The visual results on the left show that Ours (LGM) produces somewhat blurry output with noticeable noise in the geometry, particularly around the tail region. This can be attributed to the lack of optimization provided by the deformation network and SDS in this version. On the right side of the figure, GPT-4V’s judgment aligns with our observations, concluding that the second method, Ours + deformation + SDS, outperforms Ours (LGM) across all five evaluation criteria.

Refer to caption

Figure 8: An analysis example of GPTEval3D on two versions of our method: Ours (LGM) and the full version, Ours + deformation + SDS. The left side of the figure shows selected four-view results from both methods, including both the appearance image and the normal map. On the right, GPT-4V’s evaluation is presented, which aligns with human observations. The final line on the right confirms that the second method, Ours + deformation + SDS, outperforms the first, Ours (LGM), across all five evaluation criteria.

A.3 Drag setup for PhysGaussian

In PhysGaussian (Xie et al., 2024), we use the translation function as a proxy for the drag operation. We set the drag starting points as the center points and use the direction from the starting points to the destination points to define the initial velocity. For each dragging point pair, we assign a translation movement, and the simulation continues until either the starting point reaches the destination or the iteration count reaches the set maximum (75 by default).

Refer to caption

Figure 9: Effect of different text prompts. When editing images, a text prompt that better aligns with the drag intention can help query more meaningful features from the diffusion model, ultimately leading to more visually pleasing results. Black dashed circles highlight edit differences.

A.4 Running time statistics

The last column of Table 1 also summarizes the rough average running time for each method. APAP, DiffEditor, and the full version of our method are slower than PhysGaussian, Ours (LGM), and “Ours + deformation”, mainly due to the absence of SDS optimization in their pipelines. PhysGaussian runs the fastest since it does not involve any optimization process.

A.5 Text prompt

Interestingly, during our early tests, we observed that text input plays a crucial cue for generative editing. As shown in Fig. 9, when dragging the dog’s mouth to open, using a more specific text prompt like “a dachshund with an open mouth” can effectively guide the process. This proves the significance of prompt design in aligning the diffusion model’s features with the intended edits. In all our experiments, we provide a more detailed text prompt when the drag intention is clear. However, for cases where the intention is less defined, we use a more general description instead.

Refer to caption

Figure 10: An example of local identity change. In this example, our goal is to drag the owl suit. Although our method successfully closes the suit, the tie part of the suit changes during the multi-view dragging process, as shown in the dashed circle region.

A.6 Limitations

Despite achieving consistent results, the four-view image editing process sometimes requires significant parameter tuning, highlighting the need for a simpler, more user-friendly multi-view editing tool, akin to InstantDrag (Shin et al., 2024). Additionally, the editing quality can occasionally alter the object’s identity (the tie part of the owl suit in Fig. 10), how to achieve more precise local control is non-trivial. Finally, while we use multi-view images as a 3D proxy, dragging points can sometimes become occluded in all views. This limitation motivates future work on training a “pure” 3D generative model to enable more flexible and accurate 3D editing.