Single Image to Simulation-Ready 3D Outfit with Diffusion Prior and Differentiable Physics (original) (raw)

, Chang Yu UCLA , Wenxin Du UCLA , Ying Jiang UCLA , Tianyi Xie UCLA , Yunuo Chen UCLA , Yin Yang University of Utah and Chenfanfu Jiang UCLA

Abstract.

Recent advances in large models have significantly advanced image-to-3D reconstruction. However, the generated models are often fused into a single piece, limiting their applicability in downstream tasks. This paper focuses on 3D garment generation, a key area for applications like virtual try-on with dynamic garment animations, which require garments to be separable and simulation-ready. We introduce Dress-1-to-3, a novel pipeline that reconstructs physics-plausible, simulation-ready separated garments with sewing patterns and humans from an in-the-wild image. Starting with the image, our approach combines a pre-trained image-to-sewing pattern generation model for creating coarse sewing patterns with a pre-trained multi-view diffusion model to produce multi-view images. The sewing pattern is further refined using a differentiable garment simulator based on the generated multi-view images. Versatile experiments demonstrate that our optimization approach substantially enhances the geometric alignment of the reconstructed 3D garments and humans with the input image. Furthermore, by integrating a texture generation module and a human motion generation module, we produce customized physics-plausible and realistic dynamic garment demonstrations. Our project page is https://dress-1-to-3.github.io/.

††copyright: none

Refer to caption

Figure 1. Dress-1-to-3 can reconstruct simulation-ready textured clothed humans from casually posed single view images.

1. Introduction

Creating digital assets of clothed humans is crucial for a wide range of applications, including virtual reality (VR), the film industry, fashion design, and gaming. However, the traditional pipeline for digital human and garment creation involves multiple intricate steps, such as concept design, material selection, garment modeling, human pose generation, garment fitting, and animation. These processes are often labor-intensive and time-consuming.

In recent years, significant advancements in image-to-3D asset reconstruction have been driven by the development of powerful image and video generation models. Among these, multiview diffusion models [Chen et al., 2024c; Liu et al., 2023a; Gao et al., 2024] have emerged as a promising approach, effectively leveraging multiview images as intermediate representations to capture 3D information. When fine-tuned on human datasets, these models generalize well to avatar reconstructions from in-the-wild images [Li et al., 2024d; He et al., 2024a]. However, the generated results are often fused into a single piece, making them unsuitable for downstream tasks such as garment animation and interaction.

In the meantime, sewing patterns, a foundational representation in the garment design industry, have been adopted as intermediate reconstruction outputs to recover garment geometries [Liu et al., 2023b; Li et al., 2024b]. This representation is particularly advantageous due to its seamless integration with downstream applications such as physics simulation and garment editing. Despite their promise, these feed-forward approaches face significant limitations stemming from the scarcity of high-quality 3D data. As a result, the reconstructed garments are often constrained by the distribution of the training dataset, leading to inaccuracies in aligning with input images. This limitation hinders their ability to produce detailed and diverse reconstructions reflective of real-world garment variations. The question then arises: can we keep the advantages of the simulation-ready representation of sewing patterns while leveraging the powerful priors in large multi-view diffusion models to reconstruct garments from solely an in-the-wild image?

To address this problem, we introduce Dress-1-to-3, a novel garment reconstruction pipeline that accurately transforms an in-the-wild image into a simulation-ready representation of separated human and garment by leveraging the strengths of both 2D multi-view diffusion and 3D sewing pattern reconstruction. To bridge those two parts, we propose a generalized and unified IPC differentiable framework for garment optimization, which enables the optimization of 3D sewing patterns using 2D generative multi-view RGB images and normal maps as guidance. By refining imperfect generative outputs to align with the geometry encoded in multiview images, our approach allows the reconstruction of out-of-distribution garment shapes with high fidelity. Our contributions include:

2.1. Multi-view Diffusion

Owing to their powerful predictive ability, Diffusion Probabilistic Models [Ho et al., 2020] have been applied to image [Nichol et al., 2021; Zhang et al., 2023; Dhariwal and Nichol, 2021; Ruiz et al., 2023; Saharia et al., 2022], video [Chen et al., 2024d; Ho et al., 2022], and 3D shape synthesis tasks [Long et al., 2024; Yu et al., 2024b; Tang et al., 2024], etc. However, applying image diffusion models to generate multi-view images separately poses significant challenges in maintaining consistency across different views. To address multi-view inconsistency, multi-view attentions and camera pose controls are adopted to fine-tune pre-trained image diffusion models, enabling the simultaneous synthesis of multi-view images [Shi et al., 2024; Wang and Shi, 2023; Xu et al., 2024; Yang et al., 2024; Shi et al., 2023; Long et al., 2024], though these methods might result in compromised geometric consistency due to the lack of inherent 3D biases. To ensure both global semantic consistency and detailed local alignment in multi-view diffusion models, 3D-adapters [Chen et al., 2024a] propose a plug-in module designed to infuse 3D geometry awareness. Nevertheless, the generated images by these models are sparse views. To address this issue, CAT3D [Gao et al., 2024] introduces an efficient parallel sampling strategy to generate a large set of camera poses, and MVDiffusion++ [Tang et al., 2025] adopts a pose-free architecture and a view dropout strategy to reduce computational costs, generating dense, high-resolution images.

Generating consistent images from multi-view diffusions offers guidance for further 3D shape reconstruction [Gao et al., 2024]. PSHuman [Li et al., 2024d] integrates a body-face cross-scale diffusion with an SMPL-X conditioned multi-view diffusion for clothed human reconstruction with high-quality face details. Recent work, MagicMan [He et al., 2024a], utilizes a hybrid human-specific multi-view diffusion model with 3D SMPL-X-based body priors and 2D diffusion priors to consistently generate dense multi-view RGB images and normal maps, supporting high-quality human mesh reconstruction. Different from these works, we exploited multi-view diffusions to generate multi-view normals and RGB images as guidance to optimize sewing patterns and stitches instead of human meshes.

2.2. Garment Reconstruction

Previous work focusing on clothed human reconstruction [Xiu et al., 2022, 2023] typically generates garments fused with digital human models, limiting them to basic skinning-based animations and requiring extra segmentation and editing to separate the garments from the human body. In contrast, our approach focuses on reconstructing separately wearable, simulation-ready garments and human models. Other closely related works include Li et al. [2024a]; Yu et al. [2024a], which also generates simulation-ready clothes via differentiable simulation, but at the cost of creating clothing templates by artists, precise point clouds by scanners or 3D shapes of garments. NeuralTailor [Korosteleva and Lee, 2022] utilizes point-level attention for pattern shape and stitching information regression, enabling the reconstruction of garment meshes from point clouds. In contrast, our paper focuses on reconstructing non-watertight garments and humans separately from a single image without additional inputs.

To reconstruct separated non-watertight garments from a single image, GarVerseLOD [Luo et al., 2024] recovers garment details hierarchically in a coarse-to-fine framework. However, it fails to reconstruct complex skirts or dresses with slits or with complex human poses due to the limited representation of such features in the training data. ClothWild [Moon et al., 2022] exploits a weakly supervised pipeline with DensePose-based loss to further increase robustness on in-the-wild images. BCNet [Jiang et al., 2020] introduces a layered garment representation and a generic skinning weight generation network to model garments with different topologies. Deep Fashion3D [Zhu et al., 2020] refines adaptable templates with rich annotations to fit garment shapes. While they are limited to garment categories in their training datasets, these works fail to reconstruct complex categories such as jumpsuits. Additionally, they require nearly frontal images as input, limiting reconstruction from different views. AnchorUDF [Zhao et al., 2021] explores a learnable unsigned distance function to query both 3D position features and pixel-aligned image features via anchor points, which reconstructs the coarse garment shape but lacks the generation of high-quality geometric details.

Instead of directly reconstructing garment meshes, some works [Liu et al., 2023b; He et al., 2024b; Korosteleva and Sorkine-Hornung, 2023] treat sewing patterns as intermediate representations to generate garments by stitching them together. Recent work, GarmentRecovery [Li et al., 2024b], introduces implicit sewing patterns (ISP) to provide shape priors integrated with deformation priors for further garment recovery, though it builds specialized models for each individual garment or garment type. Both SewFormer [Liu et al., 2023b], and PanelFormer [Chen et al., 2024b] utilize Transformers to predict sewing patterns and stitches. Concurrent work, SewingLDM, AIpparel, Design2GarmentCode, and ChatGarment [Liu et al., 2024; Nakayama et al., 2024; Zhou et al., 2025; Bian et al., 2025], exploits multimodal models to synthesize sewing patterns. However, their garment results lack physical material parameters. Therefore, they fail to reconstruct diverse shapes for garments with different physical materials. While Wang et al. [2018] and Yang et al. [2018] leverage a shared latent space and joint material-pose optimization to generate 3D garments and 2D sewing patterns, their approaches rely heavily on large-scale datasets on garment templates and human-body models, limiting their ability to generalize to out-of-distribution garments and body shapes. Our work aims to generate diverse, image-aligned, simulation-ready garments with high-quality details from in-the-wild images by optimizing sewing patterns and stitches with physical parameters via differentiable simulations.

2.3. Differentiable Simulation

Differentiable simulation has seen widespread application in recent research, particularly for system identification and the inference of material parameters from both synthetic [Li et al., 2023a, 2024c] and real-world [Huang et al., 2024; Si et al., 2024] observations. The scope of exploration spans various domains, including fluid dynamics and control [McNamara et al., 2004; Schenck and Fox, 2018; Li et al., 2023b, 2024c], rigid-body dynamics [Freeman et al., 2021; Strecke and Stueckler, 2021; Xu et al., 2023], articulated systems [Geilinger et al., 2020; Qiao et al., 2021; Xu et al., 2021], soft-body dynamics [Hahn et al., 2019; Hu et al., 2019b; Du et al., 2021; Jatavallabhula et al., 2021; Huang et al., 2024], cloth [Li et al., 2022; Stuyck and Chen, 2023; Li et al., 2024a], inelasticity [Huang et al., 2021; Li et al., 2023a], inflatable structures [Panetta et al., 2021], and Voronoi diagrams [Numerow et al., 2024].

Cloth-based applications, whether for static optimization or dynamic simulation [Santesteban et al., 2022; Grigorev et al., 2023], frequently involve extensive frictional contact. Consequently, many works focus on robust methods for handling dry frictional contact in differentiable simulations. Bartle et al. [2016] proposes a physics-driven pattern adjustment for garment editing using fixed-point optimization, which does not account for gradients. Liang et al. [2019] is the first to introduce a fully functional differentiable cloth simulator with frictional contact and self-collision, formulating a quadratic programming problem. Jatavallabhula et al. [2021] employs a penalty-based frictional contact model, while Du et al. [2021] and Li et al. [2022] leverage the adjoint method for Projective Dynamics [Bouaziz et al., 2014] with friction. Building on Position-Based Dynamics [Müller et al., 2007; Macklin et al., 2016], Stuyck and Chen [2023] and Li et al. [2024a] introduce differentiable formulations for compliant constraint dynamics, and Huang et al. [2024] presents an adjoint-based framework for differentiable Incremental Potential Contact (IPC) [Li et al., 2020, 2021].

The finite difference (FD) method [Renardy and Rogers, 2006] is a standard approach to numerical differentiation. The complex-step finite difference technique [Luo et al., 2019; Shen et al., 2021] offers an alternative that mitigates issues such as subtractive cancellation and accumulated numerical errors by leveraging complex Taylor expansions [Brezillon et al., 1981]. They can be used to optimize low-DoF system [Zheng et al., 2025]. Automatic differentiation (AD) [Naumann, 2011; Margossian, 2019] and code transformation libraries like NVIDIA Warp [Macklin, 2022], DiffTaichi [Hu et al., 2019b, a], and others [Herholz et al., 2024] automatically compute gradients based on forward simulation, allowing for greater reuse of existing code. However, they can introduce code constraints, incur a high memory footprint, and may cause gradient explosion if applied naively. Our framework combines NVIDIA Warp’s AD with an adjoint method to achieve both development efficiency and high performance.

3. Differentiable Garment Simulation

3.1. Forward Simulation

We use Codimensional Incremental Potential Contact (CIPC) [Li et al., 2021] as our underlying garment simulation method, which is the state-of-the-art in cloth simulation regarding accuracy and robustness. It ensures non-penetration through distance-based log barrier energy and continuous collision detection (CCD). Below, we summarize the simulation pipeline, with further details available in Li et al. [2021].

The simulated codimensional surface is discretized into triangles defined by vertices 𝑽𝑽\bm{V}bold_italic_V and faces 𝑭𝑭\bm{F}bold_italic_F. Let 𝑿𝑿\bm{X}bold_italic_X denote the vertex positions in the undeformed state, and let 𝒙nsuperscript𝒙𝑛\bm{x}^{n}bold_italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and 𝒗nsuperscript𝒗𝑛\bm{v}^{n}bold_italic_v start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT represent the vertex positions and velocities, respectively, at time step tnsuperscript𝑡𝑛t^{n}italic_t start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. CIPC employs an optimization-based time integrator to achieve the state transition from time step tnsuperscript𝑡𝑛t^{n}italic_t start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT to tn+1=tn+hsuperscript𝑡𝑛1superscript𝑡𝑛ℎt^{n+1}=t^{n}+hitalic_t start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT = italic_t start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT + italic_h, minimizing the following energy:

| (1) | 𝒙n+1=arg⁢min𝒙⁡E⁢(𝒙)=12⁢‖𝒙−𝒙~‖𝑴2+Ψ⁢(𝒙;𝑿)+B⁢(𝒙).superscript𝒙𝑛1subscriptargmin𝒙𝐸𝒙12superscriptsubscriptnorm𝒙~𝒙𝑴2Ψ𝒙𝑿𝐵𝒙\bm{x}^{n+1}=\operatorname*{arg\,min}_{\bm{x}}E(\bm{x})=\frac{1}{2}\|\bm{x}-% \tilde{\bm{x}}\|_{\bm{M}}^{2}+\Psi(\bm{x};\bm{X})+B(\bm{x}).bold_italic_x start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT italic_E ( bold_italic_x ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ bold_italic_x - over~ start_ARG bold_italic_x end_ARG ∥ start_POSTSUBSCRIPT bold_italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + roman_Ψ ( bold_italic_x ; bold_italic_X ) + italic_B ( bold_italic_x ) . | | --- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |

Here, 𝒙~=𝒙n+𝒗n⁢h+𝒈⁢h2~𝒙superscript𝒙𝑛superscript𝒗𝑛ℎ𝒈superscriptℎ2\tilde{\bm{x}}=\bm{x}^{n}+\bm{v}^{n}h+\bm{g}h^{2}over~ start_ARG bold_italic_x end_ARG = bold_italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT + bold_italic_v start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_h + bold_italic_g italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT represents the predictive position under backward Euler integration. ∥⋅∥𝑴\|\cdot\|_{\bm{M}}∥ ⋅ ∥ start_POSTSUBSCRIPT bold_italic_M end_POSTSUBSCRIPT denotes the L2superscript𝐿2L^{2}italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-norm weighted by the vertex mass 𝑴i⁢isubscript𝑴𝑖𝑖\bm{M}_{ii}bold_italic_M start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT. Ψ⁢(𝒙;𝑿)Ψ𝒙𝑿\Psi(\bm{x};\bm{X})roman_Ψ ( bold_italic_x ; bold_italic_X ) is the elastic energy, encompassing both stretching and bending energies, depending on the user’s choice. B⁢(𝒙)𝐵𝒙B(\bm{x})italic_B ( bold_italic_x ) is the log barrier energy introduced by IPC, defined over all contacting vertex-triangle and edge-edge pairs. The barrier energy for each pair of primitives increases from zero to infinity as the gap decreases from a threshold d^^𝑑\hat{d}over^ start_ARG italic_d end_ARG to 00, providing sufficient repulsion to prevent penetrations.

Newton’s method with line search is employed to solve the optimization problem, requiring the analytical computation of the gradient and Hessian matrix of the energy at each iteration. The step size upper bound in each line search is clamped by CCD to ensure that all intermediate states remain intersection-free, provided that 𝒙nsuperscript𝒙𝑛\bm{x}^{n}bold_italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is initially intersection-free. Finally, the new velocity is updated as 𝒗n+1=(𝒙n+1−𝒙n)/hsuperscript𝒗𝑛1superscript𝒙𝑛1superscript𝒙𝑛ℎ\bm{v}^{n+1}=(\bm{x}^{n+1}-\bm{x}^{n})/hbold_italic_v start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT = ( bold_italic_x start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT - bold_italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) / italic_h.

3.2. Differentiable CIPC

Huang et al. [2024] provided an analytical derivation of differentiable IPC using the adjoint method. However, their derivation is closely tied to specific choices of constitutive models. To extend their framework to support cloth simulation, tedious derivations of analytical derivatives are required. In this work, we present a simple and unified framework that leverages both automatic differentiation and the adjoint method.

The governing equation of CIPC simulation can be expressed as an implicit nonlinear system of equations derived from the first-order optimality condition of the minimizer for Equation 1:

(2) 𝑮⁢(𝒙∗;𝒙n,𝒗n,𝝇n)=∇E⁢(𝒙∗;𝒙n,𝒗n,𝝇n)=𝟎,𝑮superscript𝒙superscript𝒙𝑛superscript𝒗𝑛superscript𝝇𝑛∇𝐸superscript𝒙superscript𝒙𝑛superscript𝒗𝑛superscript𝝇𝑛0\displaystyle\bm{G}(\bm{x}^{*};\bm{x}^{n},\bm{v}^{n},\bm{\varsigma}^{n})=% \nabla E(\bm{x}^{*};\bm{x}^{n},\bm{v}^{n},\bm{\varsigma}^{n})=\bm{0},bold_italic_G ( bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ; bold_italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , bold_italic_v start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , bold_italic_ς start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) = ∇ italic_E ( bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ; bold_italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , bold_italic_v start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , bold_italic_ς start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) = bold_0 ,
(3) 𝒙n+1=𝒙∗,𝒗n+1=1h⁢(𝒙∗−𝒙n),formulae-sequencesuperscript𝒙𝑛1superscript𝒙superscript𝒗𝑛11ℎsuperscript𝒙superscript𝒙𝑛\displaystyle\bm{x}^{n+1}=\bm{x}^{*},\quad\bm{v}^{n+1}=\frac{1}{h}(\bm{x}^{*}-% \bm{x}^{n}),bold_italic_x start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT = bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_italic_v start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_h end_ARG ( bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - bold_italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ,

Here, 𝒙∗superscript𝒙\bm{x}^{*}bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the minimizer of the system energy E𝐸Eitalic_E, 𝒙n,𝒗nsuperscript𝒙𝑛superscript𝒗𝑛{\bm{x}^{n},\bm{v}^{n}}bold_italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , bold_italic_v start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT represents the last system state, and 𝝇nsuperscript𝝇𝑛\bm{\varsigma}^{n}bold_italic_ς start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT denotes the set of all continuous parameters of the implicit equation, including shape parameters 𝑿𝑿\bm{X}bold_italic_X, mass matrix 𝑴𝑴\bm{M}bold_italic_M, elastic moduli, and others. We assume 𝝇nsuperscript𝝇𝑛{\bm{\varsigma}^{n}}bold_italic_ς start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT are independent, although they may share the same values. This abstraction allows the simulator to function as a differentiable layer with 𝒙n,𝒗n,𝝇nsuperscript𝒙𝑛superscript𝒗𝑛superscript𝝇𝑛{\bm{x}^{n},\bm{v}^{n},\bm{\varsigma}^{n}}bold_italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , bold_italic_v start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , bold_italic_ς start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT as input and 𝒙n+1,𝒗n+1superscript𝒙𝑛1superscript𝒗𝑛1{\bm{x}^{n+1},\bm{v}^{n+1}}bold_italic_x start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT , bold_italic_v start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT as output. The computational graph can be handled by any auto-differentiable framework such as PyTorch. The backward operator computes d⁢ℒd⁢𝒙ndℒdsuperscript𝒙𝑛\frac{\text{d}\mathcal{L}}{\text{d}\bm{x}^{n}}divide start_ARG d caligraphic_L end_ARG start_ARG d bold_italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG, d⁢ℒd⁢𝒗ndℒdsuperscript𝒗𝑛\frac{\text{d}\mathcal{L}}{\text{d}\bm{v}^{n}}divide start_ARG d caligraphic_L end_ARG start_ARG d bold_italic_v start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG, and d⁢ℒd⁢𝝇ndℒdsuperscript𝝇𝑛\frac{\text{d}\mathcal{L}}{\text{d}\bm{\varsigma}^{n}}divide start_ARG d caligraphic_L end_ARG start_ARG d bold_italic_ς start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG given d⁢ℒd⁢𝒙n+1dℒdsuperscript𝒙𝑛1\frac{\text{d}\mathcal{L}}{\text{d}\bm{x}^{n+1}}divide start_ARG d caligraphic_L end_ARG start_ARG d bold_italic_x start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT end_ARG and d⁢ℒd⁢𝒗n+1dℒdsuperscript𝒗𝑛1\frac{\text{d}\mathcal{L}}{\text{d}\bm{v}^{n+1}}divide start_ARG d caligraphic_L end_ARG start_ARG d bold_italic_v start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT end_ARG for a given training loss function ℒℒ\mathcal{L}caligraphic_L.

Taking the full derivatives of Equation 2 with respect to 𝒙n,𝒗n,𝝇nsuperscript𝒙𝑛superscript𝒗𝑛superscript𝝇𝑛{\bm{x}^{n},\bm{v}^{n},\bm{\varsigma}^{n}}bold_italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , bold_italic_v start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , bold_italic_ς start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT on both sides, we obtain:

(4) ∂𝑮∂𝒙∗⁢[d⁢𝒙∗d⁢𝒙n,d⁢𝒙∗d⁢𝒗n,d⁢𝒙∗d⁢𝝇n]+[∂𝑮∂𝒙n,∂G∂𝒗n,∂G∂𝝇n]=𝟎,𝑮superscript𝒙dsuperscript𝒙dsuperscript𝒙𝑛dsuperscript𝒙dsuperscript𝒗𝑛dsuperscript𝒙dsuperscript𝝇𝑛𝑮superscript𝒙𝑛𝐺superscript𝒗𝑛𝐺superscript𝝇𝑛0\frac{\partial\bm{G}}{\partial\bm{x}^{*}}\left[\frac{\text{d}\bm{x}^{*}}{\text% {d}\bm{x}^{n}},\frac{\text{d}\bm{x}^{*}}{\text{d}\bm{v}^{n}},\frac{\text{d}\bm% {x}^{*}}{\text{d}\bm{\varsigma}^{n}}\right]+\left[\frac{\partial\bm{G}}{% \partial\bm{x}^{n}},\frac{\partial G}{\partial\bm{v}^{n}},\frac{\partial G}{% \partial\bm{\varsigma}^{n}}\right]=\bm{0},divide start_ARG ∂ bold_italic_G end_ARG start_ARG ∂ bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG [ divide start_ARG d bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG d bold_italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG , divide start_ARG d bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG d bold_italic_v start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG , divide start_ARG d bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG d bold_italic_ς start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG ] + [ divide start_ARG ∂ bold_italic_G end_ARG start_ARG ∂ bold_italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG , divide start_ARG ∂ italic_G end_ARG start_ARG ∂ bold_italic_v start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG , divide start_ARG ∂ italic_G end_ARG start_ARG ∂ bold_italic_ς start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG ] = bold_0 ,

which leads to

(5) [d⁢𝒙∗d⁢𝒙n,d⁢𝒙∗d⁢𝒗n,d⁢𝒙∗d⁢𝝇n]=−[∂𝑮∂𝒙∗]−1⁢[∂𝑮∂𝒙n,∂G∂𝒗n,∂G∂𝝇n].dsuperscript𝒙dsuperscript𝒙𝑛dsuperscript𝒙dsuperscript𝒗𝑛dsuperscript𝒙dsuperscript𝝇𝑛superscriptdelimited-[]𝑮superscript𝒙1𝑮superscript𝒙𝑛𝐺superscript𝒗𝑛𝐺superscript𝝇𝑛\left[\frac{\text{d}\bm{x}^{*}}{\text{d}\bm{x}^{n}},\frac{\text{d}\bm{x}^{*}}{% \text{d}\bm{v}^{n}},\frac{\text{d}\bm{x}^{*}}{\text{d}\bm{\varsigma}^{n}}% \right]=-\left[\frac{\partial\bm{G}}{\partial\bm{x}^{*}}\right]^{-1}\left[% \frac{\partial\bm{G}}{\partial\bm{x}^{n}},\frac{\partial G}{\partial\bm{v}^{n}% },\frac{\partial G}{\partial\bm{\varsigma}^{n}}\right].[ divide start_ARG d bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG d bold_italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG , divide start_ARG d bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG d bold_italic_v start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG , divide start_ARG d bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG d bold_italic_ς start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG ] = - [ divide start_ARG ∂ bold_italic_G end_ARG start_ARG ∂ bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG ] start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT [ divide start_ARG ∂ bold_italic_G end_ARG start_ARG ∂ bold_italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG , divide start_ARG ∂ italic_G end_ARG start_ARG ∂ bold_italic_v start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG , divide start_ARG ∂ italic_G end_ARG start_ARG ∂ bold_italic_ς start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG ] .

By the chain rule, we have:

(6) [d⁢ℒd⁢𝒙n,d⁢ℒd⁢𝒗n,d⁢ℒd⁢𝝇n]=d⁢ℒd⁢𝒙n+1⁢[d⁢𝒙n+1d⁢𝒙n,d⁢𝒙n+1d⁢𝒗n,d⁢𝒙n+1d⁢𝝇n]+d⁢ℒd⁢𝒗n+1⁢[d⁢𝒗n+1d⁢𝒙n,d⁢𝒗n+1d⁢𝒗n,d⁢𝒗n+1d⁢𝝇n].dℒdsuperscript𝒙𝑛dℒdsuperscript𝒗𝑛dℒdsuperscript𝝇𝑛dℒdsuperscript𝒙𝑛1dsuperscript𝒙𝑛1dsuperscript𝒙𝑛dsuperscript𝒙𝑛1dsuperscript𝒗𝑛dsuperscript𝒙𝑛1dsuperscript𝝇𝑛dℒdsuperscript𝒗𝑛1dsuperscript𝒗𝑛1dsuperscript𝒙𝑛dsuperscript𝒗𝑛1dsuperscript𝒗𝑛dsuperscript𝒗𝑛1dsuperscript𝝇𝑛\begin{split}\left[\frac{\text{d}\mathcal{L}}{\text{d}\bm{x}^{n}},\frac{\text{% d}\mathcal{L}}{\text{d}\bm{v}^{n}},\frac{\text{d}\mathcal{L}}{\text{d}\bm{% \varsigma}^{n}}\right]=\frac{\text{d}\mathcal{L}}{\text{d}\bm{x}^{n+1}}\left[% \frac{\text{d}\bm{x}^{n+1}}{\text{d}\bm{x}^{n}},\frac{\text{d}\bm{x}^{n+1}}{% \text{d}\bm{v}^{n}},\frac{\text{d}\bm{x}^{n+1}}{\text{d}\bm{\varsigma}^{n}}% \right]&\\ +\frac{\text{d}\mathcal{L}}{\text{d}\bm{v}^{n+1}}\left[\frac{\text{d}\bm{v}^{n% +1}}{\text{d}\bm{x}^{n}},\frac{\text{d}\bm{v}^{n+1}}{\text{d}\bm{v}^{n}},\frac% {\text{d}\bm{v}^{n+1}}{\text{d}\bm{\varsigma}^{n}}\right].&\end{split}start_ROW start_CELL [ divide start_ARG d caligraphic_L end_ARG start_ARG d bold_italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG , divide start_ARG d caligraphic_L end_ARG start_ARG d bold_italic_v start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG , divide start_ARG d caligraphic_L end_ARG start_ARG d bold_italic_ς start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG ] = divide start_ARG d caligraphic_L end_ARG start_ARG d bold_italic_x start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT end_ARG [ divide start_ARG d bold_italic_x start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT end_ARG start_ARG d bold_italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG , divide start_ARG d bold_italic_x start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT end_ARG start_ARG d bold_italic_v start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG , divide start_ARG d bold_italic_x start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT end_ARG start_ARG d bold_italic_ς start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG ] end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL + divide start_ARG d caligraphic_L end_ARG start_ARG d bold_italic_v start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT end_ARG [ divide start_ARG d bold_italic_v start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT end_ARG start_ARG d bold_italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG , divide start_ARG d bold_italic_v start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT end_ARG start_ARG d bold_italic_v start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG , divide start_ARG d bold_italic_v start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT end_ARG start_ARG d bold_italic_ς start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG ] . end_CELL start_CELL end_CELL end_ROW

Here, we assume d⁢ℒd⁢𝒙ndℒdsuperscript𝒙𝑛\frac{\text{d}\mathcal{L}}{\text{d}\bm{x}^{n}}divide start_ARG d caligraphic_L end_ARG start_ARG d bold_italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG, d⁢ℒd⁢𝒗ndℒdsuperscript𝒗𝑛\frac{\text{d}\mathcal{L}}{\text{d}\bm{v}^{n}}divide start_ARG d caligraphic_L end_ARG start_ARG d bold_italic_v start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG, and d⁢ℒd⁢𝝇ndℒdsuperscript𝝇𝑛\frac{\text{d}\mathcal{L}}{\text{d}\bm{\varsigma}^{n}}divide start_ARG d caligraphic_L end_ARG start_ARG d bold_italic_ς start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG are all row vectors to ensure dimension consistency. From Equation 3, we have:

(7) d⁢𝒙n+1=d⁢𝒙∗,d⁢𝒗n+1=1h⁢(d⁢𝒙∗−d⁢𝒙n).formulae-sequencedsuperscript𝒙𝑛1dsuperscript𝒙dsuperscript𝒗𝑛11ℎdsuperscript𝒙dsuperscript𝒙𝑛\text{d}\bm{x}^{n+1}=\text{d}\bm{x}^{*},\quad\text{d}\bm{v}^{n+1}=\frac{1}{h}(% \text{d}\bm{x}^{*}-\text{d}\bm{x}^{n}).d bold_italic_x start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT = d bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , d bold_italic_v start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_h end_ARG ( d bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - d bold_italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) .

Plugging Equation 7 into Equation 6, we obtain:

(8) [d⁢ℒd⁢𝒙n,d⁢ℒd⁢𝒗n,d⁢ℒd⁢𝝇n]=d⁢ℒd⁢𝒙n+1⁢[d⁢𝒙∗d⁢𝒙n,d⁢𝒙∗d⁢𝒗n,d⁢𝒙∗d⁢𝝇n]+1h⁢d⁢ℒd⁢𝒗n+1⁢[d⁢𝒙∗d⁢𝒙n−𝑰,d⁢𝒙∗d⁢𝒗n,d⁢𝒙∗d⁢𝝇n].dℒdsuperscript𝒙𝑛dℒdsuperscript𝒗𝑛dℒdsuperscript𝝇𝑛dℒdsuperscript𝒙𝑛1dsuperscript𝒙dsuperscript𝒙𝑛dsuperscript𝒙dsuperscript𝒗𝑛dsuperscript𝒙dsuperscript𝝇𝑛1ℎdℒdsuperscript𝒗𝑛1dsuperscript𝒙dsuperscript𝒙𝑛𝑰dsuperscript𝒙dsuperscript𝒗𝑛dsuperscript𝒙dsuperscript𝝇𝑛\begin{split}\left[\frac{\text{d}\mathcal{L}}{\text{d}\bm{x}^{n}},\frac{\text{% d}\mathcal{L}}{\text{d}\bm{v}^{n}},\frac{\text{d}\mathcal{L}}{\text{d}\bm{% \varsigma}^{n}}\right]=\frac{\text{d}\mathcal{L}}{\text{d}\bm{x}^{n+1}}\left[% \frac{\text{d}\bm{x}^{*}}{\text{d}\bm{x}^{n}},\frac{\text{d}\bm{x}^{*}}{\text{% d}\bm{v}^{n}},\frac{\text{d}\bm{x}^{*}}{\text{d}\bm{\varsigma}^{n}}\right]&\\ +\frac{1}{h}\frac{\text{d}\mathcal{L}}{\text{d}\bm{v}^{n+1}}\left[\frac{\text{% d}\bm{x}^{*}}{\text{d}\bm{x}^{n}}-\bm{I},\frac{\text{d}\bm{x}^{*}}{\text{d}\bm% {v}^{n}},\frac{\text{d}\bm{x}^{*}}{\text{d}\bm{\varsigma}^{n}}\right].&\end{split}start_ROW start_CELL [ divide start_ARG d caligraphic_L end_ARG start_ARG d bold_italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG , divide start_ARG d caligraphic_L end_ARG start_ARG d bold_italic_v start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG , divide start_ARG d caligraphic_L end_ARG start_ARG d bold_italic_ς start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG ] = divide start_ARG d caligraphic_L end_ARG start_ARG d bold_italic_x start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT end_ARG [ divide start_ARG d bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG d bold_italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG , divide start_ARG d bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG d bold_italic_v start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG , divide start_ARG d bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG d bold_italic_ς start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG ] end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL + divide start_ARG 1 end_ARG start_ARG italic_h end_ARG divide start_ARG d caligraphic_L end_ARG start_ARG d bold_italic_v start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT end_ARG [ divide start_ARG d bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG d bold_italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG - bold_italic_I , divide start_ARG d bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG d bold_italic_v start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG , divide start_ARG d bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG d bold_italic_ς start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG ] . end_CELL start_CELL end_CELL end_ROW

With some rearrangements, we arrive at:

(9) d⁢ℒd⁢𝒙n=[d⁢ℒd⁢𝒙n+1+1h⁢d⁢ℒd⁢𝒗n+1]⁢d⁢𝒙∗d⁢𝒙n−1h⁢d⁢ℒd⁢𝒗n+1[d⁢ℒd⁢𝒗n,d⁢ℒd⁢𝝇n]=[d⁢ℒd⁢𝒙n+1+1h⁢d⁢ℒd⁢𝒗n+1]⁢[d⁢𝒙∗d⁢𝒗n,d⁢𝒙∗d⁢𝝇n].dℒdsuperscript𝒙𝑛delimited-[]dℒdsuperscript𝒙𝑛11ℎdℒdsuperscript𝒗𝑛1dsuperscript𝒙dsuperscript𝒙𝑛1ℎdℒdsuperscript𝒗𝑛1dℒdsuperscript𝒗𝑛dℒdsuperscript𝝇𝑛delimited-[]dℒdsuperscript𝒙𝑛11ℎdℒdsuperscript𝒗𝑛1dsuperscript𝒙dsuperscript𝒗𝑛dsuperscript𝒙dsuperscript𝝇𝑛\begin{split}&\frac{\text{d}\mathcal{L}}{\text{d}\bm{x}^{n}}=\left[\frac{\text% {d}\mathcal{L}}{\text{d}\bm{x}^{n+1}}+\frac{1}{h}\frac{\text{d}\mathcal{L}}{% \text{d}\bm{v}^{n+1}}\right]\frac{\text{d}\bm{x}^{*}}{\text{d}\bm{x}^{n}}-% \frac{1}{h}\frac{\text{d}\mathcal{L}}{\text{d}\bm{v}^{n+1}}\\ &\left[\frac{\text{d}\mathcal{L}}{\text{d}\bm{v}^{n}},\frac{\text{d}\mathcal{L% }}{\text{d}\bm{\varsigma}^{n}}\right]=\left[\frac{\text{d}\mathcal{L}}{\text{d% }\bm{x}^{n+1}}+\frac{1}{h}\frac{\text{d}\mathcal{L}}{\text{d}\bm{v}^{n+1}}% \right]\left[\frac{\text{d}\bm{x}^{*}}{\text{d}\bm{v}^{n}},\frac{\text{d}\bm{x% }^{*}}{\text{d}\bm{\varsigma}^{n}}\right].\end{split}start_ROW start_CELL end_CELL start_CELL divide start_ARG d caligraphic_L end_ARG start_ARG d bold_italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG = [ divide start_ARG d caligraphic_L end_ARG start_ARG d bold_italic_x start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 1 end_ARG start_ARG italic_h end_ARG divide start_ARG d caligraphic_L end_ARG start_ARG d bold_italic_v start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT end_ARG ] divide start_ARG d bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG d bold_italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_h end_ARG divide start_ARG d caligraphic_L end_ARG start_ARG d bold_italic_v start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL [ divide start_ARG d caligraphic_L end_ARG start_ARG d bold_italic_v start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG , divide start_ARG d caligraphic_L end_ARG start_ARG d bold_italic_ς start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG ] = [ divide start_ARG d caligraphic_L end_ARG start_ARG d bold_italic_x start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 1 end_ARG start_ARG italic_h end_ARG divide start_ARG d caligraphic_L end_ARG start_ARG d bold_italic_v start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT end_ARG ] [ divide start_ARG d bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG d bold_italic_v start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG , divide start_ARG d bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG d bold_italic_ς start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG ] . end_CELL end_ROW

Denote 𝒜=[d⁢ℒd⁢𝒙n+1+1h⁢d⁢ℒd⁢𝒗n+1]⁢[∂𝑮∂𝒙∗]−1𝒜delimited-[]dℒdsuperscript𝒙𝑛11ℎdℒdsuperscript𝒗𝑛1superscriptdelimited-[]𝑮superscript𝒙1\mathcal{A}=\left[\frac{\text{d}\mathcal{L}}{\text{d}\bm{x}^{n+1}}+\frac{1}{h}% \frac{\text{d}\mathcal{L}}{\text{d}\bm{v}^{n+1}}\right]\left[\frac{\partial\bm% {G}}{\partial\bm{x}^{*}}\right]^{-1}caligraphic_A = [ divide start_ARG d caligraphic_L end_ARG start_ARG d bold_italic_x start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 1 end_ARG start_ARG italic_h end_ARG divide start_ARG d caligraphic_L end_ARG start_ARG d bold_italic_v start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT end_ARG ] [ divide start_ARG ∂ bold_italic_G end_ARG start_ARG ∂ bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG ] start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT. By Equation 5, we have:

(10) d⁢ℒd⁢𝒙n=−𝒜⁢∂𝑮∂𝒙n−1h⁢d⁢ℒd⁢𝒗n+1,dℒdsuperscript𝒙𝑛𝒜𝑮superscript𝒙𝑛1ℎdℒdsuperscript𝒗𝑛1\frac{\text{d}\mathcal{L}}{\text{d}\bm{x}^{n}}=-\mathcal{A}\frac{\partial\bm{G% }}{\partial\bm{x}^{n}}-\frac{1}{h}\frac{\text{d}\mathcal{L}}{\text{d}\bm{v}^{n% +1}},divide start_ARG d caligraphic_L end_ARG start_ARG d bold_italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG = - caligraphic_A divide start_ARG ∂ bold_italic_G end_ARG start_ARG ∂ bold_italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_h end_ARG divide start_ARG d caligraphic_L end_ARG start_ARG d bold_italic_v start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT end_ARG ,
(11) [d⁢ℒd⁢𝒗n,d⁢ℒd⁢𝝇n]=−𝒜⁢[∂𝑮∂𝒗n,∂𝑮∂𝝇n].dℒdsuperscript𝒗𝑛dℒdsuperscript𝝇𝑛𝒜𝑮superscript𝒗𝑛𝑮superscript𝝇𝑛\left[\frac{\text{d}\mathcal{L}}{\text{d}\bm{v}^{n}},\frac{\text{d}\mathcal{L}% }{\text{d}\bm{\varsigma}^{n}}\right]=-\mathcal{A}\left[\frac{\partial\bm{G}}{% \partial\bm{v}^{n}},\frac{\partial\bm{G}}{\partial\bm{\varsigma}^{n}}\right].[ divide start_ARG d caligraphic_L end_ARG start_ARG d bold_italic_v start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG , divide start_ARG d caligraphic_L end_ARG start_ARG d bold_italic_ς start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG ] = - caligraphic_A [ divide start_ARG ∂ bold_italic_G end_ARG start_ARG ∂ bold_italic_v start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG , divide start_ARG ∂ bold_italic_G end_ARG start_ARG ∂ bold_italic_ς start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG ] .

Observe that 𝒜𝒜\mathcal{A}caligraphic_A is obtained by solving a linear system, where the coefficient matrix ∂𝑮∂𝒙∗𝑮superscript𝒙\frac{\partial\bm{G}}{\partial\bm{x}^{*}}divide start_ARG ∂ bold_italic_G end_ARG start_ARG ∂ bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG is the Hessian matrix of the system energy E𝐸Eitalic_E. The term 𝒜⁢[∂𝑮∂𝒙n,∂𝑮∂𝒗n,∂𝑮∂𝝇n]𝒜𝑮superscript𝒙𝑛𝑮superscript𝒗𝑛𝑮superscript𝝇𝑛\mathcal{A}\left[\frac{\partial\bm{G}}{\partial\bm{x}^{n}},\frac{\partial\bm{G% }}{\partial\bm{v}^{n}},\frac{\partial\bm{G}}{\partial\bm{\varsigma}^{n}}\right]caligraphic_A [ divide start_ARG ∂ bold_italic_G end_ARG start_ARG ∂ bold_italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG , divide start_ARG ∂ bold_italic_G end_ARG start_ARG ∂ bold_italic_v start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG , divide start_ARG ∂ bold_italic_G end_ARG start_ARG ∂ bold_italic_ς start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG ] back-propagates the differentials in 𝒜𝒜\mathcal{A}caligraphic_A to 𝒙nsuperscript𝒙𝑛\bm{x}^{n}bold_italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, 𝒗nsuperscript𝒗𝑛\bm{v}^{n}bold_italic_v start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, and 𝝇nsuperscript𝝇𝑛\bm{\varsigma}^{n}bold_italic_ς start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT through 𝑮𝑮\bm{G}bold_italic_G, respectively. This process can be implemented by treating 𝑮𝑮\bm{G}bold_italic_G as a differentiable layer that supports auto-differentiation. Using AutoDiff, we eliminate the need to manually derive the analytical expressions for ∂𝑮∂𝒗n𝑮superscript𝒗𝑛\frac{\partial\bm{G}}{\partial\bm{v}^{n}}divide start_ARG ∂ bold_italic_G end_ARG start_ARG ∂ bold_italic_v start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG and ∂𝑮∂𝝇n𝑮superscript𝝇𝑛\frac{\partial\bm{G}}{\partial\bm{\varsigma}^{n}}divide start_ARG ∂ bold_italic_G end_ARG start_ARG ∂ bold_italic_ς start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG. All other components required for forward simulations have already been derived.

Refer to caption

Figure 2. Dress-1-to-3 Pipeline. Starting with a single-view input image of a clothed human, we first derive an initial estimation of the sewing pattern. Additionally, we employ multi-view diffusion to generate orbital camera views, which serve as ground-truth 3D information for both human pose and garment shape. Next, we utilize differentiable simulation to sew and drape the pattern onto the posed human model, optimizing its shape and physical parameters in conjunction with geometric regularizers. Finally, the optimized garment shape provides a physically plausible rest shape in its static state and is readily animatable using a physical simulator.

4. Method Overview

We start our pipeline by estimating an initial garment sewing pattern from a single-view image. Next, we generate consistent multi-view RGB images and their corresponding normal maps, based on which we predict the human body pose. The 3D garment is initialized by stitching and draping the 2D patterns onto the predicted human model. The garment’s interaction with the human body is simulated using a differentiable CIPC simulator, allowing us to optimize the physical parameters and the shapes of the sewing patterns guided by the previously generated multi-view RGB images, normal maps, and segmentation results. The optimized state produces a simulation-ready scene with a human model wearing well-fitted 3D outfits that align with the input. Garment textures are automatically generated using a visual-language model and image diffusion. Finally, by applying our CIPC simulator, we can simulate dynamic scenes where the predicted human body wears the optimized garments while performing various motion sequences. An illustration of the pipeline is shown in Figure 2. We elaborate on each component of the pipeline in the following sections.

5. Pre-Optimization Steps

5.1. Simulatable Sewing Pattern Generation

From a single-view image, our pipeline starts by generating an initial sewing pattern decomposition along with stitch information using SewFormer [Liu et al., 2023b]. Following SewFormer’s convention, the sewing pattern is represented as a set of quadratic Bézier curves on a 2D plane, forming a collection of disconnected patches. The curves for each patch are connected to form a loop. Let ℰℰ\mathcal{E}caligraphic_E denote the set of all curves, with its control parameters comprising the set of curve vertices 𝒫={𝑷i}𝒫subscript𝑷𝑖\mathcal{P}=\{\bm{P}_{i}\}caligraphic_P = { bold_italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } and the set of control points 𝒦={𝑲e}𝒦superscript𝑲𝑒\mathcal{K}=\{\bm{K}^{e}\}caligraphic_K = { bold_italic_K start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT } for each edge curve e∈ℰ𝑒ℰe\in\mathcal{E}italic_e ∈ caligraphic_E. To enable garment simulation, the patches are discretized into triangle meshes. First, we apply arc-length parameterization to achieve uniform sampling along the patch boundaries. For stitched patch edges, we ensure they share the same number of sampled points. This consistency allows us to apply vertex-to-vertex stitch constraints in garment simulations, simplifying the sewing process. Using the sampled boundary points, we then perform Delaunay triangulation [Shewchuk, 2008] independently for the interior of each patch.

[Uncaptioned image]

5.1.1. Patch Symmetrization

The sewing patterns generated by SewFormer often display certain symmetries, which we aim to preserve during garment optimization. SewFormer generates a fixed number of patches with a predefined order for patch names, though some patches may remain inactive. Symmetry information, including self-symmetry and inter-symmetry, is embedded in these patch names. Symmetric edge pairs can be automatically identified by overlapping a patch with its mirrored symmetric counterpart or, in the case of self-symmetry, with the mirrored version of itself. Given the set of symmetric edge pairs ℰS={(i,j)∼(k,l)}subscriptℰ𝑆similar-to𝑖𝑗𝑘𝑙\mathcal{E}_{S}=\{(i,j)\sim(k,l)\}caligraphic_E start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = { ( italic_i , italic_j ) ∼ ( italic_k , italic_l ) }, we define the validated curve vertices {𝑷^i}subscript^𝑷𝑖\{\hat{\bm{P}}_{i}\}{ over^ start_ARG bold_italic_P end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } of the patches prior to triangulation by solving the following quadratic optimization problem:

| (12) | min{𝑷^i}⁢∑(i,j)∼(k,l)‖(𝑷^i−𝑷^j)+𝑹S⁢(𝑷^k−𝑷^l)‖22+ϵ⁢∑i‖𝑷i^−𝑷i‖22.subscriptsubscript^𝑷𝑖subscriptsimilar-to𝑖𝑗𝑘𝑙subscriptsuperscriptnormsubscript^𝑷𝑖subscript^𝑷𝑗subscript𝑹𝑆subscript^𝑷𝑘subscript^𝑷𝑙22italic-ϵsubscript𝑖subscriptsuperscriptnorm^subscript𝑷𝑖subscript𝑷𝑖22\min_{\{\hat{\bm{P}}_{i}\}}\sum_{(i,j)\sim(k,l)}\|(\hat{\bm{P}}_{i}-\hat{\bm{P% }}_{j})+\bm{R}_{S}(\hat{\bm{P}}_{k}-\hat{\bm{P}}_{l})\|^{2}_{2}+\epsilon\sum_{% i}\|\hat{\bm{P}_{i}}-\bm{P}_{i}\|^{2}_{2}.roman_min start_POSTSUBSCRIPT { over^ start_ARG bold_italic_P end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT ( italic_i , italic_j ) ∼ ( italic_k , italic_l ) end_POSTSUBSCRIPT ∥ ( over^ start_ARG bold_italic_P end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over^ start_ARG bold_italic_P end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + bold_italic_R start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_P end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - over^ start_ARG bold_italic_P end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_ϵ ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ over^ start_ARG bold_italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG - bold_italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . | | ---- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |

Here, 𝑹S=[−1001]subscript𝑹𝑆matrix1001\bm{R}_{S}=\begin{bmatrix}-1&0\\ 0&1\end{bmatrix}bold_italic_R start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL - 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] represents the flip matrix, assuming the symmetry axis is vertical. This optimization involves solving a fixed-coefficient, positive definite linear system, which ensures differentiability. The validated edge control points {𝑲^e}superscript^𝑲𝑒\{\hat{\bm{K}}^{e}\}{ over^ start_ARG bold_italic_K end_ARG start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT } are computed analytically by symmetrizing their relative coordinates. The symmetrization constraints are illustrated in the inset figure. Throughout this paper, we omit the hat notation for validated vertices and control points, as all computations are based on the symmetrized patches. However, it is important to note that the underlying garment optimization variables retain the original, non-symmetry-enforced geometry parameters.

5.1.2. Sewing Pattern Discretization

To enable direct optimization of Bézier curves, we make the sampling from boundary curve parameters to mesh vertices differentiable. Both boundary sampling and interior sampling are conceptualized as fixed-coordinate sampling based on their control points. Each boundary edge curve e∈ℰ𝑒ℰe\in\mathcal{E}italic_e ∈ caligraphic_E is defined by the starting vertex 𝑷0esuperscriptsubscript𝑷0𝑒\bm{P}_{0}^{e}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT, the control point 𝑲esuperscript𝑲𝑒\bm{K}^{e}bold_italic_K start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT, and the endpoint 𝑷1esuperscriptsubscript𝑷1𝑒\bm{P}_{1}^{e}bold_italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT (which is also the starting point of the next edge). The curve can be differentiably parameterized as 𝑷e⁢(t)=(1−t)2⁢𝑷0e+2⁢(1−t)⁢t⁢𝑲e+t2⁢𝑷1esuperscript𝑷𝑒𝑡superscript1𝑡2superscriptsubscript𝑷0𝑒21𝑡𝑡superscript𝑲𝑒superscript𝑡2superscriptsubscript𝑷1𝑒\bm{P}^{e}(t)=(1-t)^{2}\bm{P}_{0}^{e}+2(1-t)t\bm{K}^{e}+t^{2}\bm{P}_{1}^{e}bold_italic_P start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ( italic_t ) = ( 1 - italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT + 2 ( 1 - italic_t ) italic_t bold_italic_K start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT + italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT. Uniform sampling along the curve in terms of arc length is represented as a set of parameters {t1e,…,tnee}superscriptsubscript𝑡1𝑒…superscriptsubscript𝑡subscript𝑛𝑒𝑒\{t_{1}^{e},\ldots,t_{n_{e}}^{e}\}{ italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , … , italic_t start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT }, with 𝑽ie=𝑷e⁢(tie)subscriptsuperscript𝑽𝑒𝑖superscript𝑷𝑒superscriptsubscript𝑡𝑖𝑒\bm{V}^{e}_{i}=\bm{P}^{e}(t_{i}^{e})bold_italic_V start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_italic_P start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ) being the sampled points. The number of sampled points nesubscript𝑛𝑒n_{e}italic_n start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT may vary for different edges. After independent triangulation for each patch, we compute the harmonic coordinate matrix 𝑯∈ℝnI×nB𝑯superscriptℝsubscript𝑛𝐼subscript𝑛𝐵\bm{H}\in\mathbb{R}^{n_{I}\times n_{B}}bold_italic_H ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_POSTSUPERSCRIPT [Joshi et al., 2007] for all the interior points, where nIsubscript𝑛𝐼n_{I}italic_n start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT is the number of interior vertices and nBsubscript𝑛𝐵n_{B}italic_n start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT is the total number of boundary vertices. With a slight abuse of notation, we reparameterize the j𝑗jitalic_j-th interior vertex as𝑽jI=∑i𝑯j⁢i⁢𝑽iBsuperscriptsubscript𝑽𝑗𝐼subscript𝑖subscript𝑯𝑗𝑖subscriptsuperscript𝑽𝐵𝑖\bm{V}_{j}^{I}=\sum_{i}\bm{H}_{ji}\bm{V}^{B}_{i}bold_italic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_italic_H start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT bold_italic_V start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, with 𝑯j⁢isubscript𝑯𝑗𝑖\bm{H}_{ji}bold_italic_H start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT denoting its harmonic weight relative to the i𝑖iitalic_i-th boundary point 𝑽iBsuperscriptsubscript𝑽𝑖𝐵\bm{V}_{i}^{B}bold_italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT. Here 𝑯j⁢isubscript𝑯𝑗𝑖\bm{H}_{ji}bold_italic_H start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT is zero if 𝑽jIsuperscriptsubscript𝑽𝑗𝐼\bm{V}_{j}^{I}bold_italic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT and 𝑽iBsubscriptsuperscript𝑽𝐵𝑖\bm{V}^{B}_{i}bold_italic_V start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT do not belong to the same patch. During backpropagation, we fix the boundary sampling coordinates ⋃e∈ℰ,i≤ne{tie}subscriptformulae-sequence𝑒ℰ𝑖subscript𝑛𝑒superscriptsubscript𝑡𝑖𝑒\bigcup_{e\in\mathcal{E},i\leq n_{e}}\{t_{i}^{e}\}⋃ start_POSTSUBSCRIPT italic_e ∈ caligraphic_E , italic_i ≤ italic_n start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT { italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT } and the interior harmonic coordinate matrix 𝑯𝑯\bm{H}bold_italic_H, so that the triangulation is analytically determined by the original parameters of the Bézier curves. These coordinates are updated only after remeshing is performed, which will be discussed in the garment optimization section.

5.2. Multi-view Image Generation

Given a single-view image of a full-body clothed human, we generate a set of multi-view RGB images and normal maps under orbital camera views using a pre-trained multi-view diffusion model, MagicMan [He et al., 2024a]. These multi-view images of the clothed human are treated as ground truth data for human pose and garment shape in the subsequent reconstruction steps.

5.3. Human Body Reconstruction

The generated garment is statically draped on a fixed human mesh. To reduce the gap between the reconstructed garment and the image, an accurate human body is required to correctly support the garment. We use SMPL-X [Pavlakos et al., 2019] as our parameterized human model. First, we apply OSX [Lin et al., 2023] to the input single-view image to obtain an initial pose estimation 𝜽𝜽\bm{\theta}bold_italic_θ and shape estimation 𝜷𝜷\bm{\beta}bold_italic_β. This initial estimation typically does not perfectly align with other views, and the scaling and rotation are inconsistent across the multi-view images. Subsequently, we fine-tune the pose based on multi-view images using a coarse-to-fine strategy.

In the coarse stage, we estimate joint landmarks on the images using DWPose [Yang et al., 2023]. Here, we optimize only the global scaling S𝑆Sitalic_S and rotation 𝑹𝑹\bm{R}bold_italic_R of the SMPL-X model based on the following landmark loss:

| (13) | ℒLandP=1|Ω|⁢∑i‖𝒘i⋅(Proj⁡(𝑱⁢(S,𝑹,𝜽,𝜷);Ωi)−𝑱¯i)‖22,superscriptsubscriptℒLandP1Ωsubscript𝑖subscriptsuperscriptnorm⋅subscript𝒘𝑖Proj𝑱𝑆𝑹𝜽𝜷subscriptΩ𝑖subscriptbold-¯𝑱𝑖22\mathcal{L}_{\text{Land}}^{\text{P}}=\frac{1}{|\Omega|}\sum_{i}\|\bm{w}_{i}% \cdot\left(\operatorname{Proj}(\bm{J}(S,\bm{R},\bm{\theta},\bm{\beta});\Omega_% {i})-\bm{\bar{J}}_{i}\right)\|^{2}_{2},caligraphic_L start_POSTSUBSCRIPT Land end_POSTSUBSCRIPT start_POSTSUPERSCRIPT P end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG | roman_Ω | end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ bold_italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ ( roman_Proj ( bold_italic_J ( italic_S , bold_italic_R , bold_italic_θ , bold_italic_β ) ; roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - overbold_¯ start_ARG bold_italic_J end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , | | ---- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |

where Ω={Ωi}ΩsubscriptΩ𝑖\Omega=\{\Omega_{i}\}roman_Ω = { roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } represents the set of camera parameters, 𝑱𝑱\bm{J}bold_italic_J is the 3D joint location map provided by the SMPL-X model, ProjProj\operatorname{Proj}roman_Proj is the projection operator from world space to screen space, 𝑱¯isubscriptbold-¯𝑱𝑖\bm{\bar{J}}_{i}overbold_¯ start_ARG bold_italic_J end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the 2D joint location estimated by DWPose, and 𝒘isubscript𝒘𝑖\bm{w}_{i}bold_italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the per-landmark confidence score of the estimation. We use ∥⋅∥22\|\cdot\|^{2}_{2}∥ ⋅ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to denote the mean square error (MSE). This optimization essentially estimates the model-to-world matrix of the SMPL-X model. To further refine pose and shape parameters, in the fine stage, we additionally incorporate the following RGB loss and mask loss:

| (14) | ℒRGBP=1|Ω|⁢∑i‖(𝑴io)c⋅(𝑰⁢(S,𝑹,𝜽,𝜷,𝑪H;Ωi)−𝑰¯i)‖1,superscriptsubscriptℒRGBP1Ωsubscript𝑖subscriptnorm⋅superscriptsubscriptsuperscript𝑴𝑜𝑖𝑐𝑰𝑆𝑹𝜽𝜷subscript𝑪𝐻subscriptΩ𝑖subscript¯𝑰𝑖1\displaystyle\mathcal{L}_{\text{RGB}}^{\text{P}}=\frac{1}{|\Omega|}\sum_{i}\|% \bm{(}\bm{M}^{o}_{i})^{c}\cdot(\bm{I}(S,\bm{R},\bm{\theta},\bm{\beta},\bm{C}_{% H};\Omega_{i})-\bar{\bm{I}}_{i})\|_{1},caligraphic_L start_POSTSUBSCRIPT RGB end_POSTSUBSCRIPT start_POSTSUPERSCRIPT P end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG | roman_Ω | end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ bold_( bold_italic_M start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ⋅ ( bold_italic_I ( italic_S , bold_italic_R , bold_italic_θ , bold_italic_β , bold_italic_C start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ; roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - over¯ start_ARG bold_italic_I end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , | | ---- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | (15) | ℒMaskP=1|Ω|⁢∑i‖(𝑴io)c⋅(𝑴⁢(S,𝑹,𝜽,𝜷;Ωi)−𝑴¯i)‖1.superscriptsubscriptℒMaskP1Ωsubscript𝑖subscriptnorm⋅superscriptsubscriptsuperscript𝑴𝑜𝑖𝑐𝑴𝑆𝑹𝜽𝜷subscriptΩ𝑖subscript¯𝑴𝑖1\displaystyle\mathcal{L}_{\text{Mask}}^{\text{P}}=\frac{1}{|\Omega|}\sum_{i}\|% (\bm{M}^{o}_{i})^{c}\cdot(\bm{M}(S,\bm{R},\bm{\theta},\bm{\beta};\Omega_{i})-% \bar{\bm{M}}_{i})\|_{1}.caligraphic_L start_POSTSUBSCRIPT Mask end_POSTSUBSCRIPT start_POSTSUPERSCRIPT P end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG | roman_Ω | end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ( bold_italic_M start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ⋅ ( bold_italic_M ( italic_S , bold_italic_R , bold_italic_θ , bold_italic_β ; roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - over¯ start_ARG bold_italic_M end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . |

Here, 𝑪Hsubscript𝑪𝐻\bm{C}_{H}bold_italic_C start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT represents the optimizable human body vertex color, while 𝑰⁢(⋅)𝑰⋅\bm{I}(\cdot)bold_italic_I ( ⋅ ) and 𝑴⁢(⋅)𝑴⋅\bm{M}(\cdot)bold_italic_M ( ⋅ ) denote the posed human body RGB rendering process and contour rendering process under camera view ΩisubscriptΩ𝑖\Omega_{i}roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, implemented using Nvdiffrast [Laine et al., 2020]. 𝑰¯isubscript¯𝑰𝑖{\bar{\bm{I}}_{i}}over¯ start_ARG bold_italic_I end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝑴¯isubscript¯𝑴𝑖{\bar{\bm{M}}_{i}}over¯ start_ARG bold_italic_M end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are the generated multi-view RGB images and masks, respectively. 𝑴iosubscriptsuperscript𝑴𝑜𝑖\bm{M}^{o}_{i}bold_italic_M start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the occluded region of the human body, which includes the garment region 𝑴iβsubscriptsuperscript𝑴𝛽𝑖\bm{M}^{\beta}_{i}bold_italic_M start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and other non-garment occlusions 𝑴iαsubscriptsuperscript𝑴𝛼𝑖\bm{M}^{\alpha}_{i}bold_italic_M start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (such as footwear, accessories, and hair). These regions are generated using SegFormer [Xie et al., 2021]. The notation (⋅)csuperscript⋅𝑐(\cdot)^{c}( ⋅ ) start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT denotes the complement of the specified region. We use ∥⋅∥1\|\cdot\|_{1}∥ ⋅ ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to denote the mean absolute error (MAE). By excluding the loss computation in the occluded region, we can accommodate loosely fitted garments. In summary, we optimize using the following loss in the fine stage:

| (16) | ℒP⁢(S,𝑹,𝜽,𝜷,𝑪H)=ℒRGBP+ℒMaskP+λ1⁢ℒLandP+λ2⁢‖𝜽−𝜽0‖1+λ3⁢‖𝜷−𝜷0‖1.superscriptℒP𝑆𝑹𝜽𝜷subscript𝑪𝐻subscriptsuperscriptℒPRGBsubscriptsuperscriptℒPMasksubscript𝜆1subscriptsuperscriptℒPLandsubscript𝜆2subscriptnorm𝜽subscript𝜽01subscript𝜆3subscriptnorm𝜷subscript𝜷01\footnotesize\mathcal{L}^{\text{P}}(S,\bm{R},\bm{\theta},\bm{\beta},\bm{C}_{H}% )=\mathcal{L}^{\text{P}}_{\text{RGB}}+\mathcal{L}^{\text{P}}_{\text{Mask}}+% \lambda_{1}\mathcal{L}^{\text{P}}_{\text{Land}}+\lambda_{2}\|\bm{\theta}-\bm{% \theta}_{0}\|_{1}+\lambda_{3}\|\bm{\beta}-\bm{\beta}_{0}\|_{1}.caligraphic_L start_POSTSUPERSCRIPT P end_POSTSUPERSCRIPT ( italic_S , bold_italic_R , bold_italic_θ , bold_italic_β , bold_italic_C start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ) = caligraphic_L start_POSTSUPERSCRIPT P end_POSTSUPERSCRIPT start_POSTSUBSCRIPT RGB end_POSTSUBSCRIPT + caligraphic_L start_POSTSUPERSCRIPT P end_POSTSUPERSCRIPT start_POSTSUBSCRIPT Mask end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT P end_POSTSUPERSCRIPT start_POSTSUBSCRIPT Land end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ bold_italic_θ - bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∥ bold_italic_β - bold_italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . | | ---- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |

Here, we also regularize the pose and shape parameters where 𝜽0subscript𝜽0\bm{\theta}_{0}bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝜷0subscript𝜷0\bm{\beta}_{0}bold_italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT are their initial estimates provided by OSX.

5.4. Garment Initialization

The generated sewing patterns are positioned near the human body and sewn together to be dressed. SewFormer provides an initial placement around the T-posed SMPL-X model. To ensure proper layering, we adopt a bottom-to-top strategy for fitting the entire set of garments onto the human body, allowing the top garments to overlay the bottom ones. Connected components are identified by treating stitched vertices as connected. These components are sorted vertically and sequentially fitted from bottom to top through simulations using CIPC. After completing the T-pose fitting, the human body is interpolated from the T-pose to the reconstructed pose, and the entire cloth-human interaction is simulated by treating the human in motion as a moving boundary condition. To secure the bottom garments and prevent them from slipping during pose interpolation, we shrink the rest shape of the triangles near the waist to generate sufficient friction.

6. Garment Optimization

6.1. Optimization Overview

In garment optimization phase, we iteratively fine tune parameters of sewing pattern so that the statically draped garments on a posed human body match generated multi-view images in all views. We optimize the curve vertex set 𝒫𝒫\mathcal{P}caligraphic_P and the control point set 𝒦𝒦\mathcal{K}caligraphic_K of Bézier curves using differentiable CIPC simulation based on the generated multi-view images. To further leverage RGB information for assisting the optimization, we also optimize the vertex colors 𝑪Gsubscript𝑪𝐺\bm{C}_{G}bold_italic_C start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT of the discretized garment mesh for RGB renderings. Additionally, we optimize the global stretching stiffness κssubscript𝜅𝑠\kappa_{s}italic_κ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and the global bending stiffness κbsubscript𝜅𝑏\kappa_{b}italic_κ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT to automatically discover a set of physical parameters that align with the 2D observations.

For each optimization iteration, we use CIPC simulation to statically drape the garment onto the fixed-posed human body mesh. Leveraging the robustness of CIPC, we simulate one step of 1 second to directly reach near-static equilibrium. Since the static equilibrium does not locally depend on the initial state, meaning that the Jacobian matrix of the simulated state with respect to the initial state is zero, we update the initial state of iteration n𝑛nitalic_n, 𝒙0nsuperscriptsubscript𝒙0𝑛\bm{x}_{0}^{n}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, to the previously simulated state:

(17) 𝒙0n=Sim⁡(𝒙0n−1;𝝇⁢(κs,κb,𝒫,𝒦)).superscriptsubscript𝒙0𝑛Simsuperscriptsubscript𝒙0𝑛1𝝇subscript𝜅𝑠subscript𝜅𝑏𝒫𝒦\bm{x}_{0}^{n}=\operatorname{Sim}(\bm{x}_{0}^{n-1};\bm{\varsigma}(\kappa_{s},% \kappa_{b},\mathcal{P},\mathcal{K})).bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = roman_Sim ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT ; bold_italic_ς ( italic_κ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_κ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , caligraphic_P , caligraphic_K ) ) .

Here, SimSim\operatorname{Sim}roman_Sim represents the simulation process described in section 3. The initial state, 𝒙00superscriptsubscript𝒙00\bm{x}_{0}^{0}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, is obtained from the initial garment fitting described in subsection 5.4. 𝝇⁢(𝒫,𝒦)𝝇𝒫𝒦\bm{\varsigma}(\mathcal{P},\mathcal{K})bold_italic_ς ( caligraphic_P , caligraphic_K ) denotes the simulation rest shape data, including nodal mass, per-stencil elastic stiffness, undistorted material space, and similar properties. To make the simulation as path-independent as possible, we avoid adding friction during the process. To prevent the bottom garments from slipping down, the boundary loop of the bottom component near the waist area is fixed.

In summary, we solve the following optimization problem:

(18) min⁡ℒ⁢(𝒫,𝒦,κs,κb;𝒙,𝒙0),ℒ𝒫𝒦subscript𝜅𝑠subscript𝜅𝑏𝒙subscript𝒙0\min\mathcal{L}(\mathcal{P},\mathcal{K},\kappa_{s},\kappa_{b};\bm{x},\bm{x}_{0% }),roman_min caligraphic_L ( caligraphic_P , caligraphic_K , italic_κ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_κ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ; bold_italic_x , bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ,

where 𝒙𝒙\bm{x}bold_italic_x represents the simulated state starting from initial state 𝒙0subscript𝒙0\bm{x}_{0}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, which is iteratively updated to the previously simulated state. We elaborate on the training losses in ℒℒ\mathcal{L}caligraphic_L that we use in the following sections. We observe that edge curvatures 𝒦𝒦\mathcal{K}caligraphic_K are more sensitive than vertex positions 𝒫𝒫\mathcal{P}caligraphic_P. Therefore, we employ a two-stage training approach, where in the first stage, the update of𝒦𝒦\mathcal{K}caligraphic_K is frozen.

6.2. Rendering Losses

6.2.1. Garment Mask Loss

The dominant rendering loss we employ is the garment mask loss. Given the multi-view ground-truth images, we use SegFormer [Xie et al., 2021] to segment top, bottom, and dress garment masks, assigning each component with a distinct color. The mask loss is defined as follows:

| (19) | ℒMask=1|Ω|⁢∑i‖(𝑴iα)c⋅(𝑴⁢(𝒙;𝑪C,Ωi)−𝑴¯i)‖1.subscriptℒMask1Ωsubscript𝑖subscriptnorm⋅superscriptsubscriptsuperscript𝑴𝛼𝑖𝑐𝑴𝒙subscript𝑪𝐶subscriptΩ𝑖subscript¯𝑴𝑖1\mathcal{L}_{\text{Mask}}=\frac{1}{|\Omega|}\sum_{i}\|(\bm{M}^{\alpha}_{i})^{c% }\cdot(\bm{M}(\bm{x};\bm{C}_{C},\Omega_{i})-\bar{\bm{M}}_{i})\|_{1}.caligraphic_L start_POSTSUBSCRIPT Mask end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | roman_Ω | end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ( bold_italic_M start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ⋅ ( bold_italic_M ( bold_italic_x ; bold_italic_C start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT , roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - over¯ start_ARG bold_italic_M end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . | | ---- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |

Here, 𝒙𝒙\bm{x}bold_italic_x represents the simulated state of garments draped over the human body. 𝑪Csubscript𝑪𝐶\bm{C}_{C}bold_italic_C start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT denotes the component color, which is discussed in the following section. The rendered colored mask 𝑴⁢(𝒙;𝑪C,Ωi)𝑴𝒙subscript𝑪𝐶subscriptΩ𝑖\bm{M}(\bm{x};\bm{C}_{C},\Omega_{i})bold_italic_M ( bold_italic_x ; bold_italic_C start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT , roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is obtained by assigning 𝑪Csubscript𝑪𝐶\bm{C}_{C}bold_italic_C start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT to the corresponding garment vertices and setting the human body to black, ensuring that only the non-occluded parts of the garments are rendered. 𝑴¯isubscript¯𝑴𝑖{\bar{\bm{M}}_{i}}over¯ start_ARG bold_italic_M end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the set of colored garment masks generated from multi-view RGB images. We also exclude the loss computation in the occluded regions 𝑴iαsubscriptsuperscript𝑴𝛼𝑖{\bm{M}^{\alpha}_{i}}bold_italic_M start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT caused by hair and accessories to avoid incorrect mask guidance.

Initialization of Component Colors

The component color 𝑪Csubscript𝑪𝐶\bm{C}_{C}bold_italic_C start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT is automatically assigned prior to garment optimization. SewFormer typically predicts garments with one or two connected components. We vertically sort the sewn garment components and the 2D mask regions from the first camera view. The component colors are then assigned accordingly. If only one component is predicted but multiple garment masks are present, we adjust the multi-view garment masks to use a single color.

6.2.2. RGB and Normal Rendering Loss

We also utilize RGB and normal rendering losses to improve garment optimization. These losses are introduced to stabilize the training process, as the gradient of the mask rendering loss within the interior regions of the garment is zero. They are formulated similarly to the mask rendering loss:

| (20) | ℒRGB=1|Ω|⁢∑i‖𝑴iβ⋅(𝑰⁢(𝒙;𝑪G,Ωi)−𝑰¯i)‖1,subscriptℒRGB1Ωsubscript𝑖subscriptnorm⋅subscriptsuperscript𝑴𝛽𝑖𝑰𝒙subscript𝑪𝐺subscriptΩ𝑖subscript¯𝑰𝑖1\displaystyle\mathcal{L}_{\text{RGB}}=\frac{1}{|\Omega|}\sum_{i}\|\bm{M}^{% \beta}_{i}\cdot(\bm{I}(\bm{x};\bm{C}_{G},\Omega_{i})-\bar{\bm{I}}_{i})\|_{1},caligraphic_L start_POSTSUBSCRIPT RGB end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | roman_Ω | end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ bold_italic_M start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ ( bold_italic_I ( bold_italic_x ; bold_italic_C start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT , roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - over¯ start_ARG bold_italic_I end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , | | ---- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | (21) | ℒNormal=1|Ω|⁢∑i‖𝑴iβ⋅(𝑵⁢(𝒙;Ωi)−𝑵¯i)‖1.subscriptℒNormal1Ωsubscript𝑖subscriptnorm⋅subscriptsuperscript𝑴𝛽𝑖𝑵𝒙subscriptΩ𝑖subscript¯𝑵𝑖1\displaystyle\mathcal{L}_{\text{Normal}}=\frac{1}{|\Omega|}\sum_{i}\|\bm{M}^{% \beta}_{i}\cdot(\bm{N}(\bm{x};\Omega_{i})-\bar{\bm{N}}_{i})\|_{1}.caligraphic_L start_POSTSUBSCRIPT Normal end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | roman_Ω | end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ bold_italic_M start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ ( bold_italic_N ( bold_italic_x ; roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - over¯ start_ARG bold_italic_N end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . |

Here, 𝑰⁢(𝒙;𝑪G,Ωi)𝑰𝒙subscript𝑪𝐺subscriptΩ𝑖\bm{I}(\bm{x};\bm{C}_{G},\Omega_{i})bold_italic_I ( bold_italic_x ; bold_italic_C start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT , roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) represents the garment RGB rendering of the vertex color 𝑪Gsubscript𝑪𝐺\bm{C}_{G}bold_italic_C start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT under the camera view ΩisubscriptΩ𝑖\Omega_{i}roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and 𝑵⁢(𝒙;Ωi)𝑵𝒙subscriptΩ𝑖\bm{N}(\bm{x};\Omega_{i})bold_italic_N ( bold_italic_x ; roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) denotes the corresponding normal map rendering. The sets {𝑰¯i}subscript¯𝑰𝑖\{\bar{\bm{I}}_{i}\}{ over¯ start_ARG bold_italic_I end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } and {𝑵¯i}subscript¯𝑵𝑖\{\bar{\bm{N}}_{i}\}{ over¯ start_ARG bold_italic_N end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } are the multi-view RGB and normal images generated by the multi-view diffusion process. The loss computation is restricted to the garment regions 𝑴iβsubscriptsuperscript𝑴𝛽𝑖\bm{M}^{\beta}_{i}bold_italic_M start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

6.3. Geometric Regularizers

The sewing pattern optimization under rendering losses alone is ill-posed because, for the same sewn 3D garment mesh, there are infinitely many ways to decompose the mesh into flattened patches. Therefore, we incorporate several geometric losses to regularize the sewing pattern optimization.

6.3.1. Area Ratio Loss

We use the following area ratio loss to preserve the relative area of each patch with respect to the connected component it belongs to:

(22) ℒAR=1NP⁢∑p(A¯p⁢(𝑿)A¯p⁢(𝑿0)−1)2,subscriptℒAR1subscript𝑁𝑃subscript𝑝superscriptsubscript¯𝐴𝑝𝑿subscript¯𝐴𝑝superscript𝑿012\mathcal{L}_{\text{AR}}=\frac{1}{N_{P}}\sum_{p}\left(\frac{\bar{A}_{p}(\bm{X})% }{\bar{A}_{p}(\bm{X}^{0})}-1\right)^{2},caligraphic_L start_POSTSUBSCRIPT AR end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( divide start_ARG over¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( bold_italic_X ) end_ARG start_ARG over¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( bold_italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) end_ARG - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where NPsubscript𝑁𝑃N_{P}italic_N start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT is the number of garment patches, A¯psubscript¯𝐴𝑝\bar{A}_{p}over¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is the operator that computes the ratio between the area of the p𝑝pitalic_p-th patch and the area of the component. 𝑿𝑿\bm{X}bold_italic_X represents the current 2D discretization of the garment patches, and 𝑿0superscript𝑿0\bm{X}^{0}bold_italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT denotes the initial sampling.

6.3.2. Corner Regularizers

[Uncaptioned image]

Boundary Corner Regularizer

For the boundary loops of garment components, we identify all corner vertices of the original Bézier curves. At these corners, where two patches are typically sewn together, we apply the following boundary corner regularizer to penalize deviations of corner angles from right angles, as illustrated in the inset figure:

(23) ℒBC=1NB⁢C⁢∑c(1−𝒅1c×𝒅2c).subscriptℒBC1subscript𝑁𝐵𝐶subscript𝑐1subscriptsuperscript𝒅𝑐1subscriptsuperscript𝒅𝑐2\mathcal{L}_{\text{BC}}=\frac{1}{N_{BC}}\sum_{c}(1-\bm{d}^{c}_{1}\times\bm{d}^% {c}_{2}).caligraphic_L start_POSTSUBSCRIPT BC end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_B italic_C end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( 1 - bold_italic_d start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × bold_italic_d start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) .

Here, NB⁢Csubscript𝑁𝐵𝐶N_{BC}italic_N start_POSTSUBSCRIPT italic_B italic_C end_POSTSUBSCRIPT represents the total number of boundary corners, 𝒅1csubscriptsuperscript𝒅𝑐1\bm{d}^{c}_{1}bold_italic_d start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝒅2csubscriptsuperscript𝒅𝑐2\bm{d}^{c}_{2}bold_italic_d start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT denote two consecutive unit tangent vectors at corner c𝑐citalic_c.

Small-Angle Corner Regularizers

Small angles at patch corners can introduce instabilities into optimization and simulation; thus, we use the following regularizer to penalize such angles:

(24) ℒSAC=−1NC⁢∑csc⁢(𝑿)⁢(𝑽1c−𝑽0c)^×(𝑽2c−𝑽0c)^.subscriptℒSAC1subscript𝑁𝐶subscript𝑐subscript𝑠𝑐𝑿^superscriptsubscript𝑽1𝑐superscriptsubscript𝑽0𝑐^superscriptsubscript𝑽2𝑐superscriptsubscript𝑽0𝑐\mathcal{L}_{\text{SAC}}=-\frac{1}{N_{C}}\sum_{c}s_{c}(\bm{X})\widehat{(\bm{V}% _{1}^{c}-\bm{V}_{0}^{c})}\times\widehat{(\bm{V}_{2}^{c}-\bm{V}_{0}^{c})}.caligraphic_L start_POSTSUBSCRIPT SAC end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( bold_italic_X ) over^ start_ARG ( bold_italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT - bold_italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) end_ARG × over^ start_ARG ( bold_italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT - bold_italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) end_ARG .

[Uncaptioned image]

Here, NCsubscript𝑁𝐶N_{C}italic_N start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT is the number of patch corners, (𝑽1c,𝑽0c,𝑽2c)superscriptsubscript𝑽1𝑐superscriptsubscript𝑽0𝑐superscriptsubscript𝑽2𝑐(\bm{V}_{1}^{c},\bm{V}_{0}^{c},\bm{V}_{2}^{c})( bold_italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , bold_italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , bold_italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) is the tuple of three consecutive discrete boundary sampling points at the corner c𝑐citalic_c, (⋅)^^⋅\widehat{(\cdot)}over^ start_ARG ( ⋅ ) end_ARG is the vector normalization operator. sc⁢(𝑿)subscript𝑠𝑐𝑿s_{c}(\bm{X})italic_s start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( bold_italic_X ) is a non-differentiable sign function: sc⁢(𝑿)=0subscript𝑠𝑐𝑿0s_{c}(\bm{X})=0italic_s start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( bold_italic_X ) = 0 if the discretized corner angle is larger than some threshold, otherwise, sc⁢(𝑿)subscript𝑠𝑐𝑿s_{c}(\bm{X})italic_s start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( bold_italic_X ) equals the sign of the cross product on the initial sewing pattern. This regularization applies to the two cases illustrated in the inset figure. It tries to maintain the same sign of the angle and avoid the angle becoming too small. However, Bézier curves may still intersect at corners even though the discretized corner triangles are normal. We use the following discretization consistency regularizer to align the curve’s end tangents and discrete edge directions:

(25) ℒDC=1NC⁢∑c(2−𝝉1c⋅(𝑽1c−𝑽0c)^−𝝉2c⋅(𝑽2c−𝑽0c)^),subscriptℒDC1subscript𝑁𝐶subscript𝑐2⋅superscriptsubscript𝝉1𝑐^superscriptsubscript𝑽1𝑐superscriptsubscript𝑽0𝑐⋅superscriptsubscript𝝉2𝑐^superscriptsubscript𝑽2𝑐superscriptsubscript𝑽0𝑐\mathcal{L}_{\text{DC}}=\frac{1}{N_{C}}\sum_{c}(2-\bm{\tau}_{1}^{c}\cdot% \widehat{(\bm{V}_{1}^{c}-\bm{V}_{0}^{c})}-\bm{\tau}_{2}^{c}\cdot\widehat{(\bm{% V}_{2}^{c}-\bm{V}_{0}^{c})}),caligraphic_L start_POSTSUBSCRIPT DC end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( 2 - bold_italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ⋅ over^ start_ARG ( bold_italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT - bold_italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) end_ARG - bold_italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ⋅ over^ start_ARG ( bold_italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT - bold_italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) end_ARG ) ,

where 𝝉1c,𝝉2csuperscriptsubscript𝝉1𝑐superscriptsubscript𝝉2𝑐\bm{\tau}_{1}^{c},\bm{\tau}_{2}^{c}bold_italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , bold_italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT are two consecutive end tangents of Bézier curves at corner c𝑐citalic_c.

6.3.3. Comfort Loss

In addition to the appearance of the fitting matching the observation, we also aim to ensure that the fitting is comfortable. We use the stretching elasticity energy to evaluate the tightness of the fitting. To prevent overly tight fitting, we introduce the following comfort regularizer:

| (26) | ℒComfort=∫‖𝑭⁢(𝒙,𝑿)−𝑹⁢(𝑭)‖2⁢𝑑𝑿,subscriptℒComfortsuperscriptnorm𝑭𝒙𝑿𝑹𝑭2differential-d𝑿\mathcal{L}_{\text{Comfort}}=\int\|\bm{F}(\bm{x},\bm{X})-\bm{R}(\bm{F})\|^{2}d% \bm{X},caligraphic_L start_POSTSUBSCRIPT Comfort end_POSTSUBSCRIPT = ∫ ∥ bold_italic_F ( bold_italic_x , bold_italic_X ) - bold_italic_R ( bold_italic_F ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d bold_italic_X , | | ---- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |

where 𝑹⁢(𝑭)𝑹𝑭\bm{R}(\bm{F})bold_italic_R ( bold_italic_F ) represents the closest rotation matrix to 𝑭𝑭\bm{F}bold_italic_F. This is the same as the as-rigid-as-possible (ARAP) stretching energy used in the forward simulation, except that here we assume the global stiffness is 1.

6.3.4. Laplacian Loss

To ensure the smoothness of the fitting, we include a Laplacian regularizer:

| (27) | ℒLap=‖Δ⁢𝒙‖2,subscriptℒLapsubscriptnormΔ𝒙2\mathcal{L}_{\text{Lap}}=\|\Delta\bm{x}\|_{2},caligraphic_L start_POSTSUBSCRIPT Lap end_POSTSUBSCRIPT = ∥ roman_Δ bold_italic_x ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , | | ---- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |

where ΔΔ\Deltaroman_Δ represents the node-area-weighted Laplacian operator on triangle meshes, and 𝒙𝒙\bm{x}bold_italic_x denotes the simulated garment vertex positions.

6.3.5. Seam Losses

The stitched curved edge pairs should have the same shape to prevent undesired wrinkles near the seams. To achieve this, we use a seam length regularization similar to [Li et al., 2024a] to regularize the paired stitched edges:

| (28) | ℒSL=1NS⁢∑ei∼ej|∫‖⁢𝑷˙ei⁢(t)⁢‖d⁢t−∫‖⁢𝑷˙ej⁢(t)⁢‖d⁢t|,subscriptℒSL1subscript𝑁𝑆subscriptsimilar-tosubscript𝑒𝑖subscript𝑒𝑗delimited-|‖superscript˙𝑷subscript𝑒𝑖𝑡norm𝑑𝑡superscript˙𝑷subscript𝑒𝑗𝑡delimited-‖|𝑑𝑡\mathcal{L}_{\text{SL}}=\frac{1}{N_{S}}\sum_{e_{i}\sim e_{j}}\left|\int\|\dot{% \bm{P}}^{e_{i}}(t)\|dt-\int\|\dot{\bm{P}}^{e_{j}}(t)\|dt\right|,caligraphic_L start_POSTSUBSCRIPT SL end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT | ∫ ∥ over˙ start_ARG bold_italic_P end_ARG start_POSTSUPERSCRIPT italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_t ) ∥ italic_d italic_t - ∫ ∥ over˙ start_ARG bold_italic_P end_ARG start_POSTSUPERSCRIPT italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_t ) ∥ italic_d italic_t | , | | ---- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | - |

where NSsubscript𝑁𝑆N_{S}italic_N start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT is the number of stitched seams, eisubscript𝑒𝑖e_{i}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and ejsubscript𝑒𝑗e_{j}italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT iterate over all stitched edge pairs, and 𝑷˙e⁢(t)superscript˙𝑷𝑒𝑡\dot{\bm{P}}^{e}(t)over˙ start_ARG bold_italic_P end_ARG start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ( italic_t ) represents the tangent vector. The integral is computed using finite difference and the Riemann sum. Additionally, we regularize the seam curvatures on these pairs to preserve their initial curvatures:

| (29) | ℒSC=12⁢NS⁢∑ei∼ej‖𝒦¯ei−𝒦¯ei,0‖+‖𝒦¯ej−𝒦¯ej,0‖,subscriptℒSC12subscript𝑁𝑆subscriptsimilar-tosubscript𝑒𝑖subscript𝑒𝑗normsuperscript¯𝒦subscript𝑒𝑖superscript¯𝒦subscript𝑒𝑖0normsuperscript¯𝒦subscript𝑒𝑗superscript¯𝒦subscript𝑒𝑗0\mathcal{L}_{\text{SC}}=\frac{1}{2N_{S}}\sum_{e_{i}\sim e_{j}}\|\bar{\mathcal{% K}}^{e_{i}}-\bar{\mathcal{K}}^{e_{i},0}\|+\|\bar{\mathcal{K}}^{e_{j}}-\bar{% \mathcal{K}}^{e_{j},0}\|,caligraphic_L start_POSTSUBSCRIPT SC end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 italic_N start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ over¯ start_ARG caligraphic_K end_ARG start_POSTSUPERSCRIPT italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - over¯ start_ARG caligraphic_K end_ARG start_POSTSUPERSCRIPT italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , 0 end_POSTSUPERSCRIPT ∥ + ∥ over¯ start_ARG caligraphic_K end_ARG start_POSTSUPERSCRIPT italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - over¯ start_ARG caligraphic_K end_ARG start_POSTSUPERSCRIPT italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , 0 end_POSTSUPERSCRIPT ∥ , | | ---- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |

where 𝒦¯esuperscript¯𝒦𝑒\bar{\mathcal{K}}^{e}over¯ start_ARG caligraphic_K end_ARG start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT represents the relative coordinate of the control point within the frame of the curved edge segment e𝑒eitalic_e, and 𝒦¯e,0superscript¯𝒦𝑒0\bar{\mathcal{K}}^{e,0}over¯ start_ARG caligraphic_K end_ARG start_POSTSUPERSCRIPT italic_e , 0 end_POSTSUPERSCRIPT denotes its initial value.

6.4. Post-Iteration Processing

Occasionally, when two Bézier curves come close to each other—such as when forming a thin strip—the curves may penetrate one another after a parameter update in some iteration. This can lead to flipped triangles, causing the simulation to fail in the next iteration. To address this, we enforce a safeguard that modifies the geometry in-place to prevent such occurrences. Specifically, we optimize the negative triangle areas using a least-squares penalty after each iteration n𝑛nitalic_n:

| (30) | ℒFlip⁢(𝒫,𝒦)=1|F|⁢∑f(ϵ−min⁡{Af⁢(𝑿),ϵ})+λFlip⁢‖𝑿−𝑿n+1‖1,subscriptℒFlip𝒫𝒦1𝐹subscript𝑓italic-ϵsubscript𝐴𝑓𝑿italic-ϵsuperscript𝜆Flipsubscriptnorm𝑿superscript𝑿𝑛11\mathcal{L}_{\text{Flip}}(\mathcal{P},\mathcal{K})=\frac{1}{|F|}\sum_{f}(% \epsilon-\min\{A_{f}(\bm{X}),\epsilon\})+\lambda^{\text{Flip}}\|\bm{X}-\bm{X}^% {n+1}\|_{1},caligraphic_L start_POSTSUBSCRIPT Flip end_POSTSUBSCRIPT ( caligraphic_P , caligraphic_K ) = divide start_ARG 1 end_ARG start_ARG | italic_F | end_ARG ∑ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_ϵ - roman_min { italic_A start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( bold_italic_X ) , italic_ϵ } ) + italic_λ start_POSTSUPERSCRIPT Flip end_POSTSUPERSCRIPT ∥ bold_italic_X - bold_italic_X start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , | | ---- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |

where |F|𝐹|F|| italic_F | is the number of faces Afsubscript𝐴𝑓A_{f}italic_A start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT is the signed area of triangle f𝑓fitalic_f and 𝑿n+1superscript𝑿𝑛1\bm{X}^{n+1}bold_italic_X start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT is the discretized garment vertices after the parameter update at iteration n𝑛nitalic_n. We optimize the above loss only if triangles are close to flipping.

Refer to caption

Figure 3. Sewing Pattern Remeshing. We perform automatic remeshing during optimization when ill-conditioned triangles are detected. To avoid penetration, we pull back the new discretization to the initial unoptimized stage and rerun the garment initialization to fit it onto the human.

6.5. Remeshing

During optimization, we use cage deformations defined by a fixed set of harmonic coordinates to deform a fixed number of interior vertices. The triangulation quality can degrade significantly in regions with large deformations, creating challenges for simulations. To address this, we introduce automatic remeshing during the optimization iterations when the mesh quality drops below a predefined threshold. While rerunning the discretization on updated sewing patterns is straightforward, directly remeshing the fitted garment state on the human body can lead to penetrations. This occurs because the underlying smoothly interpolated surface may intersect after re-triangulation, as the collision handling relies on the previous discretization. To resolve this, we propose a refitting procedure that sews and refits the garment patches onto the human body without causing penetrations.

Assume 𝝌0superscript𝝌0\bm{\chi}^{0}bold_italic_χ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT is the initial garment sewing pattern in the continuous domain with triangulation 𝒯0superscript𝒯0\mathcal{T}^{0}caligraphic_T start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT. The sewing pattern optimization at step n𝑛nitalic_n can be characterized by a map ΦnsuperscriptΦ𝑛\Phi^{n}roman_Φ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT from 𝝌0superscript𝝌0\bm{\chi}^{0}bold_italic_χ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT to 𝝌nsuperscript𝝌𝑛\bm{\chi}^{n}bold_italic_χ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, where ΦnsuperscriptΦ𝑛\Phi^{n}roman_Φ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is a piecewise linear map defined on the continuous domain. Observe that the initial fitting is sewn from the discretization of 𝝌0superscript𝝌0\bm{\chi}^{0}bold_italic_χ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, where SewFormer provides reasonable transformations to position the panels around the human body. After generating a new triangulation 𝒯nsuperscript𝒯𝑛\mathcal{T}^{n}caligraphic_T start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT of 𝝌nsuperscript𝝌𝑛\bm{\chi}^{n}bold_italic_χ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, we pull 𝒯nsuperscript𝒯𝑛\mathcal{T}^{n}caligraphic_T start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT back to 𝝌0superscript𝝌0\bm{\chi}^{0}bold_italic_χ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT as the new triangulation of 𝝌0superscript𝝌0\bm{\chi}^{0}bold_italic_χ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT: 𝒯~0←[Φn]−1⁢(𝒯n)←superscript~𝒯0superscriptdelimited-[]superscriptΦ𝑛1superscript𝒯𝑛\tilde{\mathcal{T}}^{0}\leftarrow[{\Phi^{n}}]^{-1}(\mathcal{T}^{n})over~ start_ARG caligraphic_T end_ARG start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ← [ roman_Φ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( caligraphic_T start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ). We then apply the initial transformations to the updated discretization 𝒯~0superscript~𝒯0\tilde{\mathcal{T}}^{0}over~ start_ARG caligraphic_T end_ARG start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT to position the patches around the T-pose human body and execute the fitting procedure described in subsection 5.4. During this fitting process, we set the rest shape as 𝒯~0superscript~𝒯0\tilde{\mathcal{T}}^{0}over~ start_ARG caligraphic_T end_ARG start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT. A relaxation process follows, using 𝒯nsuperscript𝒯𝑛\mathcal{T}^{n}caligraphic_T start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT as the rest shape. The newly fitted results are non-penetrating, and we set them as the initial state 𝒙0nsubscriptsuperscript𝒙𝑛0\bm{x}^{n}_{0}bold_italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT for the differentiable simulation process. Finally, 𝒯0superscript𝒯0\mathcal{T}^{0}caligraphic_T start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT is replaced with 𝒯~0superscript~𝒯0\tilde{\mathcal{T}}^{0}over~ start_ARG caligraphic_T end_ARG start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT. This remeshing process is illustrated in Figure 3.

7. Post-Optimization Steps

7.1. Texture Generation

To complete our pipeline and deliver a fully textured garment directly from a single image input, tailored to the needs of the garment fabrication industry, we incorporate an additional texture generation module. Unlike formulating texture creation as a reconstruction task—an approach constrained by the ill-posed nature of the problem due to sparse inputs, severe distortion, and occlusions caused by the human body and overlapping garment layers—our module adopts generative methods to produce garment textures. This module employs two strategies for texture generation:

Tileable Texture Generation via FabricDiffusion

In this strategy, we assume that in real-world garment creation, clothing panels are typically cut from a single piece of fabric and sewn together, resulting in similar and tileable textures. Based on this assumption, given the front-view ground truth input image and its corresponding colored segmentation mask, we identify the largest uniform color square area within the segmentation mask for each garment component (e.g., top or bottom) as the captured texture region. This region may exhibit distortions and varying illumination caused by occlusions and poses in the input image. To address these issues, we process the captured texture region using FabricDiffusion [Zhang et al., 2024], which generates distortion-free and tileable texture maps. To determine the appropriate tiling scale for aligning the textures with the garment’s UV space (optimized in our pipeline), we assume consistent camera view parameters for the front view. This scale can be calculated by multiplying the derivative of the cropped region’s size by a constant factor.

In-the-Wild Texture Generation via GPT-4o and FLUX

For generalized textures that do not fall into the above case, we utilize Vision-Language Models (VLMs) in collaboration with a Diffusion model. Specifically, we process the input image using the GPT-4o [Achiam et al., 2023] VLM to extract descriptive keywords for the textures of various components, such as "denim, dark blue, smooth fabric" and "argyle, grey and white, knitted fabric" through prompt-based querying. These extracted keywords are then fed into FLUX [Labs, 2023], which generates the corresponding textures.

7.2. Showcase under Human Motions

The reconstructed simulation-ready garment and human model can be used to generate realistic dynamic human motion in clothing using the robust CIPC physics-based simulator. However, IPC-based simulators require the initial configuration of the human model to be penetration-free. As IPC-based simulators produce intersection-free results, self-penetrations of the human model during the given motion can cause solution failures when the human model interacts with garments. To address this issue, we replace the human model during motion with the nearest intersection-free human model by solving Injective Deformation Processing (IDP) [Fang et al., 2021] problems. In solving these IDP problems, we follow the method in Li et al. [2025] where the authors use PBD simulations to resolve collisions between garments and human models. Instead, we apply extended Position-Based Dynamics (XPBD) [Macklin et al., 2016] simulations to resolve self-penetrations in the human model by repelling colliding vertices and faces, while preserving natural deformations.

Refer to caption

Figure 4. Qualitative Comparisons of Geometry Reconstruction. Our proposed method not only generates sewing patterns that seamlessly integrate into animation and simulation workflows but also achieves superior garment reconstruction accuracy compared to baseline methods.

Refer to caption

Figure 5. Qualitative Comparison of Panel Shape Prediction. Neural Tailor [Korosteleva and Lee, 2022] takes ground-truth garment meshes as input, while SewFormer [Liu et al., 2023b] and our proposed method use single-view images as input. Extra unexpected panels and edges with significant errors are highlighted in red.

8. Implementation

Differentiable Simulation Layer

We implement CIPC simulation using NVIDIA Warp [Macklin, 2022] to utilize the Auto-Diff feature. The simulator is wrapped in a customized autograd.Function to be integrated to the global computational graph.

Balancing between Losses

The training loss is weighted sum of all rendering losses and geometric regularizers. We use the LMasksubscript𝐿MaskL_{\text{Mask}}italic_L start_POSTSUBSCRIPT Mask end_POSTSUBSCRIPT as the dominant loss and set the relative weights for LRGBsubscript𝐿RGBL_{\text{RGB}}italic_L start_POSTSUBSCRIPT RGB end_POSTSUBSCRIPT and LNormalsubscript𝐿NormalL_{\text{Normal}}italic_L start_POSTSUBSCRIPT Normal end_POSTSUBSCRIPT to λRGB=0.1subscript𝜆RGB0.1\lambda_{\text{RGB}}=0.1italic_λ start_POSTSUBSCRIPT RGB end_POSTSUBSCRIPT = 0.1 and λNormal=0.1subscript𝜆Normal0.1\lambda_{\text{Normal}}=0.1italic_λ start_POSTSUBSCRIPT Normal end_POSTSUBSCRIPT = 0.1. For geometric regularizers, we do not have a rule of thumb to balance them. For all experiments, we use λLap=0.001subscript𝜆Lap0.001\lambda_{\text{Lap}}=0.001italic_λ start_POSTSUBSCRIPT Lap end_POSTSUBSCRIPT = 0.001, λComfort=0.1subscript𝜆Comfort0.1\lambda_{\text{Comfort}}=0.1italic_λ start_POSTSUBSCRIPT Comfort end_POSTSUBSCRIPT = 0.1, λAR=0.01subscript𝜆AR0.01\lambda_{\text{AR}}=0.01italic_λ start_POSTSUBSCRIPT AR end_POSTSUBSCRIPT = 0.01, λSAC=0.01subscript𝜆SAC0.01\lambda_{\text{SAC}}=0.01italic_λ start_POSTSUBSCRIPT SAC end_POSTSUBSCRIPT = 0.01, λDC=0.001subscript𝜆DC0.001\lambda_{\text{DC}}=0.001italic_λ start_POSTSUBSCRIPT DC end_POSTSUBSCRIPT = 0.001, λBC=0.001subscript𝜆BC0.001\lambda_{\text{BC}}=0.001italic_λ start_POSTSUBSCRIPT BC end_POSTSUBSCRIPT = 0.001, λSL=0.1subscript𝜆SL0.1\lambda_{\text{SL}}=0.1italic_λ start_POSTSUBSCRIPT SL end_POSTSUBSCRIPT = 0.1, λSL=0.1subscript𝜆SL0.1\lambda_{\text{SL}}=0.1italic_λ start_POSTSUBSCRIPT SL end_POSTSUBSCRIPT = 0.1.

Training Time

The pre-optimization steps require about 10 minutes. The garment optimization process can be finished within 2 hours on a single RTX 3090 with 24GB device memory.

9. Experiments

9.1. Geometry Reconstruction Comparison

We first conduct comparison study to evaluate the reconstruction accuracy of baseline methods and our proposed approach.

Benchmark

We use the CloSe [Antić et al., 2024] and 4D-Dress [Wang et al., 2024] datasets for the comparison study. CloSe is a large-scale 3D clothing dataset featuring detailed segmentation across diverse clothing classes. 4D-Dress offers high-quality 4D textured scans of dynamic clothed human sequences. For evaluation, we carefully select examples encompassing a variety of human body shapes, poses, and their corresponding front-view images to establish a comprehensive benchmark.

Baselines

pare our method with state-of-the-art single-view garment reconstruction methods, including BCNet [Jiang et al., 2020], ClothWild [Moon et al., 2022], GarmentRecovery [Li et al., 2024b], and SewFormer [Liu et al., 2023b]. Among these, BCNet and ClothWild are designed for clothed human reconstruction but are limited to tight-fitting clothing and not readily adaptable for downstream tasks such as animation and simulation. GarmentRecovery extends to loose-fitting garments reconstruction by deforming predicted rest shapes to align with input images. In contrast, SewFormer predicts corresponding sewing patterns directly from images, enabling seamless integration into animation pipelines and physical simulations. Our proposed method builds upon SewFormer and incorporate differentiable simulation to refine 2D panels and physical parameters. For SewFormer and our approach, we simulate the predicted sewing patterns and use the resulting 3D garments for quantitative comparisons.

Results

We evaluate the accuracy of baseline methods and our approach using two metrics: Chamfer Distance (CD) and Intersection over Union (IoU). CD quantifies the geometric similarity between reconstructed and ground-truth meshes, while IoU assesses the alignment between the garment mask of the rendered reconstruction and the input front-view images. The quantitative results for the CloSe and 4D-Dress datasets are presented in Table 1, and visualized qualitative comparisons are shown in Figure 4. BCNet and ClothWild tend to produce overly smooth garment meshes, lacking fine wrinkle details. GarmentRecovery improves geometric details but often results in interpenetrated reconstructions. SewFormer predicts sewing patterns that can be directly used for simulation, yet it neglects physical parameters, leading to simulated results that deviate significantly from the ground-truth mesh. In contrast, our method not only generates sewing patterns for seamless integration into downstream pipelines but also optimizes garment physical parameters, enabling accurate geometry reconstruction that closely aligns with ground truth.

Table 1. Quantitative Comparisons of Geometry Reconstruction. We evaluate the performance of baseline methods and our approach on the CloSe and 4D-Dress datasets. Our proposed method achieves the highest reconstruction accuracy across both datasets.

9.2. Sewing Pattern Evaluation

We compare our method with two approaches that predict garment sewing patterns: Neural Tailor [Korosteleva and Lee, 2022] and SewFormer [Liu et al., 2023b]. Neural Tailor generates sewing patterns from garment point cloud inputs, while SewFormer and our method recover patterns directly from single-view image inputs. For comparison purposes, we sample points from garment meshes to serve as inputs for Neural Tailor. The qualitative results are shown in Figure 5. Neural Tailor is trained on garments draped over an average SMPL female body in a T-pose. Consequently, its predictions are usually unsatisfactory if the human pose deviates from the T-pose, and may produce unexpected additional panels. Furthermore, its reliance on garment point clouds as input significantly restricts its practical applicability. SewFormer, on the other hand, generates symmetric and organized panels. However, its predicted panel shapes often fail to align with the input image. For instance, it may predict long pant panels for an image with short pants. This is probably due to the small scale of the dataset used for its training. Nevertheless, collecting a large-scale dataset of real-world clothed human images paired with corresponding garment meshes and sewing patterns is a challenging and resource-intensive task. In contrast, our optimization-based method requires no additional training data. By leveraging differentiable simulation, it refines an initial estimate of the sewing patterns, achieving significantly more accurate results.

9.3. Textured Garment Reconstruction and Simulation

Refer to caption

Figure 6. Qualitative Results of Textured Clothed Human. We showcase the generation capability of Dress-1-to-3 using in-the-wild test images from various sources, including both real-world and synthetic images. Our streamlined pipeline generates perfectly fitted 3D garments with visually plausible textures.

Test Images

To evaluate the generative capability of our method, we perform extensive tests on a variety of images from different sources including 4D-Dress [Wang et al., 2024], CloSe [Antić et al., 2024] and DeepFashion2 [Ge et al., 2019]. These images exhibit diverse quality levels and human poses, highlighting the robustness of our method. To further extend the generative capability of our method with text prompts, we employ FLUX [Labs, 2023] to generate input images using custom textual descriptions of clothing on a model. For instance, prompts such as "a female model wearing a blazer and pants" are used. To enhance the diversity of the generated results, we randomly incorporate detailed descriptions, including the shape and color of the clothing, as well as the pose and appearance of the model.

Textured Garment Reconstruction

As demonstrated in Figure 6, Dress-1-to-3 effectively reconstructs 3D garments that accurately fit human models in both real-world and synthetic images. Our method automatically retrieves visually plausible garment textures using image diffusion techniques. This streamlined process requires minimal human effort to reconstruct high-fidelity garments with sewing patterns and offers users the flexibility to easily adjust garment shape and texture.

Refer to caption

Figure 7. Garment Simulation. We animate garment motion using various human sequences as moving boundary conditions. Our simulation-ready garments exhibit physically plausible dynamics.

Garment Simulation

The garments synthesized by our method are simulation-ready due to the accurate sewing, fitting, and optimization of garment patterns. The optimized 3D outfits align perfectly with the human body at steady state, avoiding artifacts such as self- or interpenetration. These garments can be seamlessly integrated into physics-based simulations, such as those used in video games. In Figure 1 and Figure 7, we visualize several simulated human motion sequences showcasing dynamic garment behavior.

9.4. Ablation Study

In Figure 8, we perform an ablation study for key individual components in Dress-1-to-3, using the same garment images as in Section 9.3. This study evaluates the contributions of each proposed component to the final garment reconstruction quality.

Patch Symmetrization

We first evaluate the effectiveness of the proposed patch symmetrization, designed to facilitate better capture of symmetrical geometry. As shown in Figure 8, removing symmetry enforcement results in visibly asymmetric outputs compared to the input garment image. This highlights the critical role of symmetry enforcement in preserving structural coherence and alignment, particularly for garments with strong symmetrical patterns, such as dresses or jackets. By aligning the reconstructed mesh to expected symmetrical features, this component ensures geometric fidelity.

Laplacian Loss

Laplacian loss ℒLapsubscriptℒLap\mathcal{L}_{\text{Lap}}caligraphic_L start_POSTSUBSCRIPT Lap end_POSTSUBSCRIPT is applied to smooth out noise and irregular wrinkles in the reconstructed garment mesh. This loss minimizes high-frequency artifacts, enabling a cleaner and more aesthetically pleasing surface. The weight of ℒLapsubscriptℒLap\mathcal{L}_{\text{Lap}}caligraphic_L start_POSTSUBSCRIPT Lap end_POSTSUBSCRIPT is a tunable parameter, allowing users to control the degree of smoothness based on their preferences. As shown in Figure 8, a higher weight results in smoother results but may slightly reduce detail, whereas a lower weight preserves intricate wrinkles but may retain noise.

Boundary Corner Regularizer

Refer to caption

Figure 8. Ablation Study. We conduct ablation studies on our geometric regularizer to ensure that the sewing pattern maintains both reasonable 2D patterns and a plausible 3D fitted shape. We minimize irregularities such as asymmetry, sharp or acute angles, and inconsistent scaling of the 2D patterns while reducing noisy geometry and unrealistic wrinkles.

The boundary corner regularizer, ℒBCsubscriptℒBC\mathcal{L}_{\text{BC}}caligraphic_L start_POSTSUBSCRIPT BC end_POSTSUBSCRIPT, mitigates the occurrence of sharp angles in the reconstructed sewing patterns. Sharp or acute angles can lead to practical difficulties during garment fabrication, as they introduce challenges in stitching and material handling. As demonstrated in Figure 8, optimization results obtained without ℒBCsubscriptℒBC\mathcal{L}_{\text{BC}}caligraphic_L start_POSTSUBSCRIPT BC end_POSTSUBSCRIPT often generate sewing patterns with acute or impractical corner geometries, whereas incorporating this regularizer results in smoother, more fabrication-friendly boundaries.

Comfort Loss

Comfort loss, ℒComfortsubscriptℒComfort\mathcal{L}_{\text{Comfort}}caligraphic_L start_POSTSUBSCRIPT Comfort end_POSTSUBSCRIPT, ensures the reconstructed garment mesh adheres to an appropriate scale relative to the input image. This prevents the generation of sewing patterns that are too small or tight, which could compromise wearability. Without ℒComfortsubscriptℒComfort\mathcal{L}_{\text{Comfort}}caligraphic_L start_POSTSUBSCRIPT Comfort end_POSTSUBSCRIPT, as shown in Figure 8, the reconstructed sewing patterns often exhibit significantly smaller dimensions than expected, leading to impractical or unrealistic results. Incorporating this loss ensures that the final garment size aligns with user expectations and real-world usability requirements

Area Ratio Loss

To maintain realistic proportions between garment parts, the area ratio loss, ℒARsubscriptℒAR\mathcal{L}_{\text{AR}}caligraphic_L start_POSTSUBSCRIPT AR end_POSTSUBSCRIPT, is applied to ensure that the relative area of each patch remains consistent with the connected components, reflecting real-world fabrication principles. For instance, in a skirt, the front and back panels should have comparable areas to align with practical garment construction. As illustrated in Figure 8, omitting ℒARsubscriptℒAR\mathcal{L}_{\text{AR}}caligraphic_L start_POSTSUBSCRIPT AR end_POSTSUBSCRIPT often results in disproportionate patch sizes, such as an overly large front skirt panel compared to the back, violating fabrication norms.

Seam Losses

Two seam losses: the length seam loss, ℒSLsubscriptℒSL\mathcal{L}_{\text{SL}}caligraphic_L start_POSTSUBSCRIPT SL end_POSTSUBSCRIPT, and the curvature seam loss, ℒSCsubscriptℒSC\mathcal{L}_{\text{SC}}caligraphic_L start_POSTSUBSCRIPT SC end_POSTSUBSCRIPT are adopted to ensure that stitched curved edge pairs should have the same shape to prevent undesired wrinkles near the seam, and that enforces preservation of seam curvatures, respectively. As shown in Figure 8, the absence of ℒSLsubscriptℒSL\mathcal{L}_{\text{SL}}caligraphic_L start_POSTSUBSCRIPT SL end_POSTSUBSCRIPT leads to uneven sleeve seams, introducing visual artifacts and potential fabrication issues. Similarly, without ℒSCsubscriptℒSC\mathcal{L}_{\text{SC}}caligraphic_L start_POSTSUBSCRIPT SC end_POSTSUBSCRIPT, the seams can become overly curved, deviating significantly from the intended design. Together, these losses contribute to producing smooth and realistic seams.

Refer to caption

Figure 9. Comparisons between vertex color renderings and texture renderings.

Vertex Color Reconstruction

Vertex colors are optimized to assist garment optimization. However, due to the limited mesh resolution, the visual appearance synthesized with vertex colors tends to be overly smooth. Additionally, colors from adjacent parts can bleed into part boundaries. Comparisons between vertex color renderings and texture renderings are shown in Figure 9. This necessitates an additional module to generate textures that are not constrained by mesh resolution.

Refer to caption

Figure 10. Our method tries to find a static garment fit that approximates a non-static garment snapshot or a fit under grasping forces. The input images are generated by GPT-4o.

Loose Garments

Our method optimizes garments in static fitting states under gravity and body supporting forces. For non-static states, such as a snapshot of flowing, or static states influenced by other external forces such as grasping, our garment optimization process attempts to approximate the non-static states by finding nearby static configurations (as shown in Figure 10). However, these static approximations may not reflect the garment’s true geometry. We acknowledge this as a limitation and leave the extension to dynamic states or broader boundary conditions as future work.

10. Conclusion

In this paper, we present a garment reconstruction pipeline, Dress-1-to-3, which takes a single-view image as input and reconstructs a posed human wearing textured garments, with both the human pose and garment shapes closely aligned with the input image. During optimization, we refine the sewing pattern shapes and physical material parameters by leveraging a differentiable CIPC simulator with accurate contact. The resulting garment assets are simulation-ready and can be seamlessly integrated into a physics-based simulator.

We benchmark our pipeline against baseline methods through two key experiments: a quantitative comparison of geometry reconstruction using existing garment datasets and a qualitative evaluation of sewing patterns. In both cases, Dress-1-to-3 significantly outperforms the baseline approaches.

To further assess the Dress-1-to-3’s robustness and performance, we test our textured garment reconstruction using in-the-wild real-world and synthetic images, validated together with animations of dressed humans. The high-quality results demonstrate the robustness and effectiveness of our approach.

Additionally, ablation studies underscore the importance of the patch symmetrization technique and the contributions of each regularization loss term, highlighting their critical role in optimizing the pipeline’s performance.

Limitations and Future Work

While our method provides consistent high-fidelity reconstruction and has been extensively tested with in-the-wild images, its generation ability is somewhat limited by the initial estimation of the sewing pattern. For instance, our method cannot predict new connected pattern components if they are not included in the initial estimation. Additionally, challenges arise with layered clothing, as SewFormer can only predict single-layer patterns, causing multi-layered garments to be fused into a single cloth component. It is worth noting that with a more versatile sewing pattern predictor capable of handling such cases, our method would also be able to process more complex garments. We leave this enhancement as future work.

Furthermore, some optimized sewing patterns may not fully adhere to conventional fashion design principles, as our supervision relies solely on ground-truth renderings of fitted garments. Incorporating regularizers based on design conventions in future work could help produce patterns that are more suitable for manufacturing.

Another limitation lies in the overly smoothed garment surfaces produced by our method. To enhance training robustness, we incorporate regularizers such as seam loss and Laplacian loss; however, these also suppress the formation of natural wrinkles. Another contributing factor to the lack of high-frequency detail is the inconsistency in multi-view normal maps generated by MagicMan, which further smooths geometry in detailed regions. Addressing the reconstruction of high-frequency geometric features remains an avenue for future work.

We also observe a gap between input images and generated textures. Since this is not our primary technical contribution, we use off-the-shelf tools for texture generation. Improving PBR texture generation is left for future work, as it warrants a standalone research effort.

Lastly, although our simulation layer supports differentiable dynamic simulation, we currently use it only for static fittings under gravity and body support forces. Extending the garment optimization to handle dynamic scenarios, such as reconstruction from monocular videos, or interaction-driven deformations like grasping, would be both interesting and practically valuable.

Ethical Concerns

We acknowledge that body shape biases exist in our input image datasets.

References