PhyScene: Physically Interactable 3D Scene Synthesis for Embodied AI (original) (raw)

Yandan Yang∗ Baoxiong Jia∗ Peiyuan Zhi Siyuan Huang
State Key Laboratory of General Artificial Intelligence,
Beijing Institute for General Artificial Intelligence (BIGAI)
https://physcene.github.io

Abstract

With recent developments in Embodied Artificial Intelligence (EAI) research, there has been a growing demand for high-quality, large-scale interactive scene generation. While prior methods in scene synthesis have prioritized the naturalness and realism of the generated scenes, the physical plausibility and interactivity of scenes have been largely left unexplored. To address this disparity, we introduce PhyScene, a novel method dedicated to generating interactive 3D scenes characterized by realistic layouts, articulated objects, and rich physical interactivity tailored for embodied agents. Based on a conditional diffusion model for capturing scene layouts, we devise novel physics- and interactivity-based guidance mechanisms that integrate constraints from object collision, room layout, and object reachability. Through extensive experiments, we demonstrate that PhyScene effectively leverages these guidance functions for physically interactable scene synthesis, outperforming existing state-of-the-art scene synthesis methods by a large margin. Our findings suggest that the scenes generated by PhyScene hold considerable potential for facilitating diverse skill acquisition among agents within interactive environments, thereby catalyzing further advancements in embodied AI research.

[Uncaptioned image]

Figure 1: Illustration of the PhyScene, physically interactable scene synthesis method to generate interactive 3D scenes characterized by realistic layouts, articulated objects, and rich physical interactivity tailored for embodied agents.

††∗indicates equal contribution.

1 Introduction

The exploration of scene synthesis [14, 54, 11, 62, 67, 7, 45, 30, 16, 58] has constituted a persistent focus within the field of computer vision. Initially conceived to facilitate indoor design applications, scene synthesis aimed to create diverse 3D environments characterized by both realism and naturalness. However, with the advent of embodied artificial intelligence (EAI) [1, 27, 12, 25], the objectives of this task have taken on new dimensions. Simulated environments [33, 35, 50, 57, 9, 10], now supporting a plethora of intricate embodied tasks, have propelled the task of scene synthesis into an important data source that provides unlimited scenarios for agents to robustly learn skills like navigation [2, 34] and manipulation [18, 48, 31]. This trend underscores the growing importance of scene synthesis within the context of EAI research.

Nevertheless, achieving a seamless transition from conventional scene synthesis algorithms to those tailored for EAI presents significant challenges in scene generation. As many EAI tasks involve physics simulation [40, 36, 39, 37, 19, 65], the synthesized scenes must adhere to physical constraints while enabling a high degree of interactivity among objects (e.g., articulated objects or fluids) and scene layout (e.g., reachability of objects) to facilitate agent skill acquisition. These stringent interactivity requirements introduce several obstacles for scene synthesis algorithms. Limited by the quality of real-world scanned scenes [8, 4, 29], previous methods have primarily relied on manually created scenes [15, 14]. However, these datasets are designed with non-interactable objects, overlooking physical constraints, and are prone to violations of such constraints. Consequently, this poses a significant challenge for algorithms aiming to learn physically plausible arrangements of interactable objects. Beyond data-level hurdles, incorporating scene interactivity (e.g., maintaining sufficient workspace, ensuring object reachability and interactivity) introduces non-trivial challenges in designing optimizable objectives that reflect such abstract concepts. These challenges emphasize the need for an effective scene synthesis algorithm that integrates the naturalness and realism of conventional synthesis algorithms while ensuring the physical plausibility and interactivity of scenes.

To address these challenges, we propose PhyScene, a diffusion-based method embedded with physical commonsense for interactable scene synthesis. Specifically, our approach builds on the efficacy of guided diffusion models [23, 51, 3, 38] to effectively learn scene distribution and guide the model in generating scenes that are both functionally interactive and physically plausible. To incorporate articulated objects into generated scenes, we utilize the shape and geometry features, bridging rigid-body objects from training scenes with existing articulated object datasets. To model physical plausibility and interactivity accurately, we impose three key constraints on the generated scenes: (1) physical collision avoidance between objects to enable simulation, (2) object layouts constrained on the floor plan to avoid inter-room conflicts, and (3) the interactiveness and reachability of each object when assuming an embodied agent of proper size need to navigate. We convert these constraints into guidance functions that can be easily integrated into the guided diffusion model. We further propose metrics considering the aforementioned constraints in our evaluation process for assessing all existing models. Through meticulously designed experiments, we demonstrate that PhyScene not only achieves state-of-the-art results on traditional scene synthesis metrics but also significantly enhances the physical plausibility and interactivity of generated scenes compared to existing methods. We hope this work can make a step forward in scalable indoor scene synthesis for EAI tasks, contributing to the broader landscape of EAI research.

In summary, our main contributions are:

•
We propose PhyScene, a guided diffusion model, for physically interactable scene synthesis with realistic layouts and interactable objects.
•
Through well-crafted designs of guidance functions, we convert constraints encompassing collision avoidance, room layout, and reachability into PhyScene in a simple and effective way to ensure the physical plausibility and interactivity of the generated scenes.
•
By comparing with competitive baseline models, we show that PhyScene can not only achieve state-of-the-art results on traditional scene-synthesis metrics but also significantly outperforms existing methods for interactable scene synthesis on our delicately designed physical metrics, paving the way for new research topics bridging scene synthesis and EAI.

Indoor Scene Synthesis

Indoor scene synthesis is formulated as a layout prediction problem, where each object is often represented by its 3D bounding box, semantic labels [14, 54], or shape features [51] for retrieving corresponding meshes from 3D asset libraries to the specific locations. To properly model the layout of objects in training datasets, current methods usually represent the arrangement of objects as a scene graph [11, 62, 67, 7] and utilize scene priors such as the spatial relationship between objects [45] and object category (co-)occurrence frequency [16, 58] for approximating the scene layout distribution. While generating new scenes, these works leverage iterative sampling or optimization methods to reject scenes that violate the designed scene priors for synthesizing scenes with desired properties [16, 13, 45, 7]. However, such methods are often limited by the efficacy of sampling or optimization algorithms. More recent works try to learn scene layout distributions with deep neural networks [42, 41, 54, 26, 64, 44, 59] to improve the generation efficiency.

For the quality evaluation of generated scenes, common metrics test model performance with perceptual quality scores (e.g., FID [22], KID [5],etc.). However, these realism metrics do not address the physical plausibility and interactivity of generated scenes, which is crucial for adapting scenes into simulated environments. In fact, a commonly used scene synthesis dataset, 3D-FRONT dataset [14], exhibits frequent occurrence of these physically implausible layouts (as shown in Tab. 1). In addition, the interactivity of scenes for object manipulation and reachability is also understudied in prior works. ProcTHOR [10] has proposed a procedural generation pipeline for interactable scenes with rule-based constraints and statistical scene priors. Nonetheless, as pointed out by [32], these generated scenes suffer from the pre-defined priors, thus generating unrealistic scenes that are harmful to agent learning. To this end, we aim to bridge this gap in PhyScene, uniting efforts in scene synthesis and EAI to provide a pipeline that could suffice for large-scale interactable scene synthesis while maintaining visual realism and naturalness.

Physical Plausibility and Interactivity in 3D Scenes

Producing physically plausible generations in 3D scenes has been a long-standing problem for computer vision, given its subtleness in properly converting physical constraints into optimizable objectives. To tackle this challenge, various optimization-based approaches have been proposed for tasks such as scene-conditioned pose [21] and motion generation [52]. However, the study of physical plausibility for scene generation has been largely left untouched. Meanwhile, the modeling of interactivity of 3D scenes has been largely left untouched in existing works without proper definition. Some works [53] aim to define the level of scene interactivity via human and robot preferences in a scene rearrangement setting. Nonetheless, with their task-specific design, the optimization objectives are hard to be generalized to other settings. Therefore, PhyScene aims at addressing these obstacles and makes the first attempts to provide reasonable definitions of physical plausibility and scene interactivity in the context of scene synthesis.

Guided Diffusion Models

Diffusion models [24, 49, 38] have shown promising results for generative AI [47, 28, 26] across various domains [60, 43, 63, 56, 55]. Through an iterative denoising process, diffusion models excel at handling high dimensional distributions without mode collapse. Such an iterative process also offers flexible ways to provide conditions [46, 6] and guidance [23, 3] that could effectively affect the inference of models. For example, SceneDiffuser [26] integrates a physics-based objective as conditional guidance for physically plausible planning and motion generation. PhysDiff [61] proposes a physics-based motion projection module to instill the laws of physics into the denoising diffusion process for motion generation.PhyScene takes insight from these powerful techniques and integrates physical and interactivity guidance as conditional guidance for scene synthesis. Compared to constrained sampling methods such as Markov Chain Monte Carlo (MCMC) [45, 53], diffusion guidance runs more efficiently during the inference stage. Meanwhile, in contrast to models that take in constraints as a learnable objective [51], our guidance functions can more effectively ensure the satisfaction of constraints during inference. To the best of our knowledge, PhyScene makes the first attempt to integrate a conditional diffusion model with physical plausibility and interactivity guidances to effectively generate physically interactable 3D scenes.

Table 1: Interactivity evaluation of scenes in the 3D-FRONT dataset. These scenes exhibit a high rate of physical constraint violations including collision, layout, and interactivity. We provide detailed definitions of the metrics as explained in Sec. 4.

3 PhyScene

Refer to caption

Figure 2: Overview of PHYSCENE. We leverage diffusion models for capturing scene layout distributions and apply three distinct guidance functions for improving the physical plausibility and interactivity of generated scenes.

Physically interactable scene synthesis requires realistic layouts, articulated objects, and physical interactivity. However, integrating articulated objects into scenes trained solely with static objects presents data-level challenges. We outline our method for incorporating articulated objects into generated scenes in Sec. 3.1. We then detail the model structure and training process of PhyScene, where it learns prior layout knowledge from the dataset in Sec. 3.2. To ensure physical interactivity, we consider collision avoidance, room layout constraint, and agent interactiveness as three key constraints, and provide details in Sec. 3.3 on transforming them into guidance functions for posterior optimization during the inference process.

3.1 Object representation

The scene 𝐱𝐱\mathbf{x}bold_x is composed of N𝑁Nitalic_N objects, noted as 𝐱={𝐨1,…,𝐨𝐍}𝐱subscript𝐨1…subscript𝐨𝐍\mathbf{x}=\{\mathbf{o}_{1},...,\mathbf{o_{N}}\}bold_x = { bold_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_o start_POSTSUBSCRIPT bold_N end_POSTSUBSCRIPT }. Each object representation 𝐨i=[𝐜i,𝐬i,𝐫i,𝐭i,𝐟i]subscript𝐨𝑖subscript𝐜𝑖subscript𝐬𝑖subscript𝐫𝑖subscript𝐭𝑖subscript𝐟𝑖\mathbf{o}_{i}=[\mathbf{c}_{i},\mathbf{s}_{i},\mathbf{r}_{i},\mathbf{t}_{i},% \mathbf{f}_{i}]bold_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = [ bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] is composed of a semantic label 𝐜i∈ℝCsubscript𝐜𝑖superscriptℝ𝐶\mathbf{c}_{i}\in\mathbb{R}^{C}bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT out of C categories, size 𝐬i∈ℝ3subscript𝐬𝑖superscriptℝ3\mathbf{s}_{i}\in\mathbb{R}^{3}bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, orientation 𝐫i=(c⁢o⁢s⁢θi,s⁢i⁢n⁢θi)∈ℝ2subscript𝐫𝑖𝑐𝑜𝑠subscript𝜃𝑖𝑠𝑖𝑛subscript𝜃𝑖superscriptℝ2\mathbf{r}_{i}=(cos\theta_{i},sin\theta_{i})\in\mathbb{R}^{2}bold_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_c italic_o italic_s italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s italic_i italic_n italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, location 𝐭i∈ℝ3subscript𝐭𝑖superscriptℝ3\mathbf{t}_{i}\in\mathbb{R}^{3}bold_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT and 3D feature 𝐟i∈ℝ32subscript𝐟𝑖superscriptℝ32\mathbf{f}_{i}\in\mathbb{R}^{32}bold_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 32 end_POSTSUPERSCRIPT encoded from the shape of the object. Notably, common approaches for scene synthesis retrieve objects using the predicted size 𝐬isubscript𝐬𝑖\mathbf{s}_{i}bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and label 𝐜isubscript𝐜𝑖\mathbf{c}_{i}bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. However, such methods could not be applied across asset libraries. We therefore leverage the shape feature 𝐟isubscript𝐟𝑖\mathbf{f}_{i}bold_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as a critical indicator for object retrieval, especially considering the objects in available articulated object datasets are largely different from those in scene synthesis datasets. Specifically, we follow [51] and utilize a variational auto-encoder to embed object geometric features, transforming each 3D furniture model into a latent shape feature. For generating scenes with interactable objects, we consider object assets from: 1) 3D-FUTURE [15] which contains CAD models used in 3D-FRONT [14], and 2) GAPartNet [17] that includes various articulated objects. During inference, we use the latent encoded feature to find the best match of articulated objects in GAPartNet given the static objects in 3D-Front, thereby enabling the generation of scenes containing interactable objects.

3.2 Conditional Diffusion for Layout Modeling

With a data sample 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT representing the scene layout in the dataset, we gradually add Gaussian noise to 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT with a forward process q⁢(𝐱t+1|𝐱t)𝑞conditionalsubscript𝐱𝑡1subscript𝐱𝑡q(\mathbf{x}_{t+1}|\mathbf{x}_{t})italic_q ( bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) converting it into a Gaussian noise 𝐱Tsubscript𝐱𝑇\mathbf{x}_{T}bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. Then a reverse denoising process pθ⁢(𝐱t|𝐱t+1)subscript𝑝𝜃conditionalsubscript𝐱𝑡subscript𝐱𝑡1p_{\theta}(\mathbf{x}_{t}|\mathbf{x}_{t+1})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) is applied to recover the data from noise with learnable parameters θ𝜃\thetaitalic_θ. Additionally, we consider using the floor plan ℱℱ\mathcal{F}caligraphic_F as a condition for incorporating the workspace and room layout constraints. In this case, we reconstruct 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT via:

| pθ⁢(𝐱0|ℱ)subscript𝑝𝜃conditionalsubscript𝐱0ℱ\displaystyle p_{\theta}(\mathbf{x}_{0}|\mathcal{F})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | caligraphic_F ) | =p⁢(𝐱T)⁢∏t=1Tpθ⁢(𝐱t−1|𝐱t,ℱ),absent𝑝subscript𝐱𝑇superscriptsubscriptproduct𝑡1𝑇subscript𝑝𝜃conditionalsubscript𝐱𝑡1subscript𝐱𝑡ℱ\displaystyle=p(\mathbf{x}_{T})\prod_{t=1}^{T}p_{\theta}(\mathbf{x}_{t-1}|% \mathbf{x}_{t},\mathcal{F}),= italic_p ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_F ) , | | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------ | | pθ⁢(𝐱t−1|𝐱t,ℱ)subscript𝑝𝜃conditionalsubscript𝐱𝑡1subscript𝐱𝑡ℱ\displaystyle p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathcal{F})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_F ) | =𝒩⁢(𝐱t−1;μθ⁢(𝐱t,t,ℱ),Σθ⁢(𝐱t,t,ℱ)),absent𝒩subscript𝐱𝑡1subscript𝜇𝜃subscript𝐱𝑡𝑡ℱsubscriptΣ𝜃subscript𝐱𝑡𝑡ℱ\displaystyle=\mathcal{N}(\mathbf{x}_{t-1};\mu_{\theta}(\mathbf{x}_{t},t,% \mathcal{F}),\Sigma_{\theta}(\mathbf{x}_{t},t,\mathcal{F})),= caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ; italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , caligraphic_F ) , roman_Σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , caligraphic_F ) ) , | |

where pθ⁢(𝐱0|ℱ)subscript𝑝𝜃conditionalsubscript𝐱0ℱp_{\theta}(\mathbf{x}_{0}|\mathcal{F})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | caligraphic_F ) denotes the probability of scene layout 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT given the conditional floor plan ℱℱ\mathcal{F}caligraphic_F. As pointed out by previous works [24], this maximization of conditional probability pθ⁢(𝐱0|ℱ)subscript𝑝𝜃conditionalsubscript𝐱0ℱp_{\theta}(\mathbf{x}_{0}|\mathcal{F})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | caligraphic_F ) could be equivalently formulated as a simplified objective of estimating the noise ϵitalic-ϵ\epsilonitalic_ϵ through:

| ℒθ⁢(𝐱0|ℱ)subscriptℒ𝜃conditionalsubscript𝐱0ℱ\displaystyle\mathcal{L}_{\theta}(\mathbf{x}_{0}|\mathcal{F})caligraphic_L start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | caligraphic_F ) | =𝔼t,ϵ,𝐱0⁢[‖ϵ−ϵθ⁢(α^t⁢𝐱0+1−α^t⁢ϵ,t,ℱ)‖22]absentsubscript𝔼𝑡bold-italic-ϵsubscript𝐱0delimited-[]superscriptsubscriptnormbold-italic-ϵsubscriptbold-italic-ϵ𝜃subscript^𝛼𝑡subscript𝐱01subscript^𝛼𝑡bold-italic-ϵ𝑡ℱ22\displaystyle=\mathbb{E}_{t,\bm{\epsilon},\mathbf{x}_{0}}\left[\|\bm{\epsilon}% -\bm{\epsilon}_{\theta}(\sqrt{\hat{\alpha}_{t}}\mathbf{x}_{0}+\sqrt{1-\hat{% \alpha}_{t}}\bm{\epsilon},t,\mathcal{F})\|_{2}^{2}\right]= blackboard_E start_POSTSUBSCRIPT italic_t , bold_italic_ϵ , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∥ bold_italic_ϵ - bold_italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( square-root start_ARG over^ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - over^ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_ϵ , italic_t , caligraphic_F ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] | (1) | | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --- | | =𝔼t,ϵ,𝐱0⁢[‖ϵ−ϵθ⁢(𝐱t,t,ℱ)‖22],absentsubscript𝔼𝑡bold-italic-ϵsubscript𝐱0delimited-[]superscriptsubscriptnormbold-italic-ϵsubscriptbold-italic-ϵ𝜃subscript𝐱𝑡𝑡ℱ22\displaystyle=\mathbb{E}_{t,\bm{\epsilon},\mathbf{x}_{0}}\left[\|\bm{\epsilon}% -\bm{\epsilon}_{\theta}(\mathbf{x}_{t},t,\mathcal{F})\|_{2}^{2}\right],= blackboard_E start_POSTSUBSCRIPT italic_t , bold_italic_ϵ , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∥ bold_italic_ϵ - bold_italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , caligraphic_F ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , | | | |

where α^tsubscript^𝛼𝑡\hat{\alpha}_{t}over^ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a pre-defined function of t𝑡titalic_t in the forward process according to a noise schedule (see details in the supplementary). To learn this conditional model, we utilize a U-Net with attention blocks to model ϵθ⁢(𝐱t,t,ℱ)subscriptbold-italic-ϵ𝜃subscript𝐱𝑡𝑡ℱ\bm{\epsilon}_{\theta}(\mathbf{x}_{t},t,\mathcal{F})bold_italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , caligraphic_F ) with time embedding t𝑡titalic_t and floor plan embedding ℱℱ\mathcal{F}caligraphic_F added as conditions within every U-Net layer.

3.3 Guidance for Physical Interactivity

Considering the physical constraints violations in scenes from existing training data (as shown in Tab. 1), we ensure the physical plausibility and interactivity of generated scenes by guiding the conditional scene diffusion process with physic-based guidance functions. We start by first introducing guided sampling for diffusion models. Given constraint function φ⁢(𝐱,ℱ)𝜑𝐱ℱ\varphi(\mathbf{x},\mathcal{F})italic_φ ( bold_x , caligraphic_F ), we formulate the guided inference problem as optimizing the probability of constraint satisfaction:

| p⁢(𝐱0|ℱ,O=1)𝑝conditionalsubscript𝐱0ℱ𝑂1\displaystyle p(\mathbf{x}_{0}|\mathcal{F},O=1)italic_p ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | caligraphic_F , italic_O = 1 ) | ∝pθ⁢(𝐱0|ℱ)⁢p⁢(O=1|𝐱0,ℱ)proportional-toabsentsubscript𝑝𝜃conditionalsubscript𝐱0ℱ𝑝𝑂conditional1subscript𝐱0ℱ\displaystyle\propto p_{\theta}(\mathbf{x}_{0}|\mathcal{F})p(O=1|\mathbf{x}_{0% },\mathcal{F})∝ italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | caligraphic_F ) italic_p ( italic_O = 1 | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , caligraphic_F ) | (2) | | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------ | -------------------------------------------------------------------- | --- | | ∝pθ⁢(𝐱0|ℱ)⋅exp⁡(φ⁢(𝐱0,ℱ)),proportional-toabsent⋅subscript𝑝𝜃conditionalsubscript𝐱0ℱ𝜑subscript𝐱0ℱ\displaystyle\propto p_{\theta}(\mathbf{x}_{0}|\mathcal{F})\cdot\exp\left(% \varphi(\mathbf{x}_{0},\mathcal{F})\right),∝ italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | caligraphic_F ) ⋅ roman_exp ( italic_φ ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , caligraphic_F ) ) , | | | | |

where O𝑂Oitalic_O is an optimality indicator checking if the conditional generated output 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at denoising step t𝑡titalic_t satisfies the constraints in φ⁢(𝐱,ℱ)𝜑𝐱ℱ\varphi(\mathbf{x},\mathcal{F})italic_φ ( bold_x , caligraphic_F ). Similar to [26], we use the first order Taylor expansion around 𝐱t=𝝁subscript𝐱𝑡𝝁\mathbf{x}_{t}=\bm{\mu}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_μ at timestep t𝑡titalic_t to estimate the optimal condition in Eq. 2 with:

| logp(\displaystyle\log p(roman_log italic_p ( | O=1|𝐱t,ℱ)≈(𝐱t−𝝁)𝐠+C\displaystyle O=1|\mathbf{x}_{t},\mathcal{F})\approx(\mathbf{x}_{t}-\bm{\mu})% \mathbf{g}+Citalic_O = 1 | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_F ) ≈ ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_μ ) bold_g + italic_C | (3) | | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------- | | 𝐠𝐠\displaystyle\mathbf{g}bold_g | =∇𝐱tlog⁡p⁢(O=1|𝐱t,ℱ)|𝐱t=𝝁absentevaluated-atsubscript∇subscript𝐱𝑡𝑝𝑂conditional1subscript𝐱𝑡ℱsubscript𝐱𝑡𝝁\displaystyle=\nabla_{\mathbf{x}_{t}}\log p(O=1|\mathbf{x}_{t},\mathcal{F})|_{% \mathbf{x}_{t}=\bm{\mu}}= ∇ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( italic_O = 1 | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_F ) | start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_μ end_POSTSUBSCRIPT | | =∇𝐱tφ⁢(𝐱t,ℱ)|𝐱t=𝝁,absentevaluated-atsubscript∇subscript𝐱𝑡𝜑subscript𝐱𝑡ℱsubscript𝐱𝑡𝝁\displaystyle=\nabla_{\mathbf{x}_{t}}\varphi(\mathbf{x}_{t},\mathcal{F})|_{% \mathbf{x}_{t}=\bm{\mu}},= ∇ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_φ ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_F ) | start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_μ end_POSTSUBSCRIPT , | | |

where 𝝁=𝝁θ⁢(𝐱t,t,ℱ)𝝁subscript𝝁𝜃subscript𝐱𝑡𝑡ℱ\bm{\mu}=\bm{\mu}_{\theta}(\mathbf{x}_{t},t,\mathcal{F})bold_italic_μ = bold_italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , caligraphic_F ), 𝐠𝐠\mathbf{g}bold_g is the first order gradient estimate at 𝐱t=𝝁subscript𝐱𝑡𝝁\mathbf{x}_{t}=\bm{\mu}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_μ of log⁡p⁢(O=1|𝐱t,ℱ)𝑝𝑂conditional1subscript𝐱𝑡ℱ\log p(O=1|\mathbf{x}_{t},\mathcal{F})roman_log italic_p ( italic_O = 1 | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_F ), and C𝐶Citalic_C is a constant. Therefore to generate a scene with constraints considered, we can modify the denoising process with a constraint perturbed Gaussian transition:

| pθ⁢(𝐱t−1|𝐱t,ℱ,O=1)=𝒩⁢(𝐱t−1;𝝁+λ⁢𝚺⁢𝐠,𝚺),subscript𝑝𝜃conditionalsubscript𝐱𝑡1subscript𝐱𝑡ℱ𝑂1𝒩subscript𝐱𝑡1𝝁𝜆𝚺𝐠𝚺\displaystyle p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathcal{F},O=1)=% \mathcal{N}(\mathbf{x}_{t-1};\bm{\mu}+\lambda\bm{\Sigma}\mathbf{g},\bm{\Sigma}),italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_F , italic_O = 1 ) = caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ; bold_italic_μ + italic_λ bold_Σ bold_g , bold_Σ ) , | (4) | | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --- |

where 𝚺=𝚺θ⁢(𝐱t,t,ℱ)𝚺subscript𝚺𝜃subscript𝐱𝑡𝑡ℱ\bm{\Sigma}=\bm{\Sigma}_{\theta}(\mathbf{x}_{t},t,\mathcal{F})bold_Σ = bold_Σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , caligraphic_F ) and λ𝜆\lambdaitalic_λ is a scaling factor. Notably, the formulations in Eq. 2 and Eq. 4 leverage the predefined constraint functions φ⁢(𝐱,ℱ)𝜑𝐱ℱ\varphi(\mathbf{x},\mathcal{F})italic_φ ( bold_x , caligraphic_F ) as a tilting function on the original scene layout distribution to handle constraints.

Under this formulation, we can easily combine the constraint functions into both learning and inference. Following Eq. 1, we can reformulate the optimization of objective with φ⁢(𝐱,ℱ)𝜑𝐱ℱ\varphi(\mathbf{x},\mathcal{F})italic_φ ( bold_x , caligraphic_F ) through:

| ℒθ⁢(𝐱0|ℱ,O=1)=𝔼t,ϵ,𝐱0⁢[‖ϵ−ϵθ⁢(𝐱t,t,ℱ)−λ⁢𝚺⁢𝐠‖22]subscriptℒ𝜃conditionalsubscript𝐱0ℱ𝑂1subscript𝔼𝑡bold-italic-ϵsubscript𝐱0delimited-[]superscriptsubscriptnormbold-italic-ϵsubscriptbold-italic-ϵ𝜃subscript𝐱𝑡𝑡ℱ𝜆𝚺𝐠22\displaystyle\mathcal{L}_{\theta}(\mathbf{x}_{0}|\mathcal{F},O=1)=\mathbb{E}_{% t,\bm{\epsilon},\mathbf{x}_{0}}\left[\|\bm{\epsilon}-\bm{\epsilon}_{\theta}(% \mathbf{x}_{t},t,\mathcal{F})-\lambda\bm{\Sigma}\mathbf{g}\|_{2}^{2}\right]caligraphic_L start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | caligraphic_F , italic_O = 1 ) = blackboard_E start_POSTSUBSCRIPT italic_t , bold_italic_ϵ , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∥ bold_italic_ϵ - bold_italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , caligraphic_F ) - italic_λ bold_Σ bold_g ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] | (5) | | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --- |

In scene synthesis, the guidance functions φ⁢(𝐱t,ℱ)𝜑subscript𝐱𝑡ℱ\varphi(\mathbf{x}_{t},\mathcal{F})italic_φ ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_F ) usually require real scene layouts for computing the violation constraints. Therefore, instead of optimizing for 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT which might not be meaningful for real scenes, we convert the guidance functions into φ⁢(𝐱~0t,ℱ)𝜑superscriptsubscript~𝐱0𝑡ℱ\varphi(\tilde{\mathbf{x}}_{0}^{t},\mathcal{F})italic_φ ( over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , caligraphic_F ) where 𝐱~0tsuperscriptsubscript~𝐱0𝑡\tilde{\mathbf{x}}_{0}^{t}over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is the predicted scene layout given initialization 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. We summarize the guided learning and inference process of PhyScene in Algorithm 1.

Modules : Model pθ(⋅|ℱ)p_{\theta}(\cdot|\mathcal{F})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ | caligraphic_F ), guidance functions φ⁢(⋅,ℱ)={φcoll⁢(⋅),φlayout⁢(⋅,ℱ),φreach⁢(⋅,ℱ)}𝜑⋅ℱsubscript𝜑coll⋅subscript𝜑layout⋅ℱsubscript𝜑reach⋅ℱ\varphi(\cdot,\mathcal{F})=\{\varphi_{\text{coll}}(\cdot),\varphi_{\text{% layout}}(\cdot,\mathcal{F}),\varphi_{\text{reach}}(\cdot,\mathcal{F})\}italic_φ ( ⋅ , caligraphic_F ) = { italic_φ start_POSTSUBSCRIPT coll end_POSTSUBSCRIPT ( ⋅ ) , italic_φ start_POSTSUBSCRIPT layout end_POSTSUBSCRIPT ( ⋅ , caligraphic_F ) , italic_φ start_POSTSUBSCRIPT reach end_POSTSUBSCRIPT ( ⋅ , caligraphic_F ) }.

// constraint-guided learning

Input: 3D scene layout 𝐱={𝐨1,…,𝐨N}𝐱subscript𝐨1…subscript𝐨𝑁\mathbf{x}=\{\mathbf{o}_{1},...,\mathbf{o}_{N}\}bold_x = { bold_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_o start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } with floor plan ℱℱ\mathcal{F}caligraphic_F, where N is a fixed number of objects.

repeat

𝐱0∼p⁢(𝐱0|ℱ)similar-tosubscript𝐱0𝑝conditionalsubscript𝐱0ℱ\mathbf{x}_{0}\sim p(\mathbf{x}_{0}|\mathcal{F})bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_p ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | caligraphic_F )

ϵ∼𝒩⁢(𝟎,𝐈)similar-tobold-italic-ϵ𝒩0𝐈\bm{\epsilon}\sim\mathcal{N}({\bf 0},{\bf I})bold_italic_ϵ ∼ caligraphic_N ( bold_0 , bold_I ), t∼𝒰⁢({1,⋯,T})similar-to𝑡𝒰1⋯𝑇t\sim\mathcal{U}(\{1,\cdots,T\})italic_t ∼ caligraphic_U ( { 1 , ⋯ , italic_T } )

𝐱t=α^t⁢𝐱0+1−α^t⁢ϵsubscript𝐱𝑡subscript^𝛼𝑡subscript𝐱01subscript^𝛼𝑡bold-italic-ϵ\mathbf{x}_{t}=\sqrt{\hat{\alpha}_{t}}\mathbf{x}_{0}+\sqrt{1-\hat{\alpha}_{t}}% \bm{\epsilon}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG over^ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - over^ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_ϵ, 𝐱~0t∼pθ(⋅|ℱ)\tilde{\mathbf{x}}_{0}^{t}\sim p_{\theta}(\cdot|\mathcal{F})over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ | caligraphic_F )

θ=θ−η⁢∇θ‖ϵ−ϵθ⁢(𝐱t,t,ℱ)−λ⁢𝚺⁢𝐠‖22𝜃𝜃𝜂subscript∇𝜃subscriptsuperscriptnormbold-italic-ϵsubscriptbold-italic-ϵ𝜃subscript𝐱𝑡𝑡ℱ𝜆𝚺𝐠22\theta=\theta-\eta\nabla_{\theta}\|{\bf\bm{\epsilon}}-{\bf\bm{\epsilon}}_{% \theta}(\mathbf{x}_{t},t,\mathcal{F})-\lambda\bm{\Sigma}\mathbf{g}\|^{2}_{2}italic_θ = italic_θ - italic_η ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∥ bold_italic_ϵ - bold_italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , caligraphic_F ) - italic_λ bold_Σ bold_g ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT

until converged;

// one-step guided sampling

function sample (𝛕t(\bm{\tau}^{t}( bold_italic_τ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, φ𝜑\varphiitalic_φ):

𝝁=𝝁θ⁢(𝐱t,t,ℱ)𝝁subscript𝝁𝜃subscript𝐱𝑡𝑡ℱ\bm{\mu}=\bm{\mu}_{\theta}(\mathbf{x}_{t},t,\mathcal{F})bold_italic_μ = bold_italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , caligraphic_F ), 𝚺=𝚺θ⁢(𝐱t,t,ℱ)𝚺subscript𝚺𝜃subscript𝐱𝑡𝑡ℱ\bm{\Sigma}=\bm{\Sigma}_{\theta}(\mathbf{x}_{t},t,\mathcal{F})bold_Σ = bold_Σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , caligraphic_F )

φ⁢(𝐱t,ℱ)=γ1⁢φcoll⁢(𝐱t)+γ2⁢φlayout⁢(𝐱t,ℱ)+γ3⁢φreach⁢(𝐱t,ℱ)𝜑subscript𝐱𝑡ℱsubscript𝛾1subscript𝜑collsubscript𝐱𝑡subscript𝛾2subscript𝜑layoutsubscript𝐱𝑡ℱsubscript𝛾3subscript𝜑reachsubscript𝐱𝑡ℱ\varphi(\mathbf{x}_{t},\mathcal{F})=\gamma_{1}\varphi_{\text{coll}}(\mathbf{x}% _{t})+\gamma_{2}\varphi_{\text{layout}}(\mathbf{x}_{t},\mathcal{F})+\gamma_{3}% \varphi_{\text{reach}}(\mathbf{x}_{t},\mathcal{F})italic_φ ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_F ) = italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT coll end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT layout end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_F ) + italic_γ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT reach end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_F ) 𝐱t−1=𝒩⁢(𝐱t−1;𝝁+λ⁢𝚺⁢∇𝐱tφ⁢(𝐱t,ℱ)|𝐱t=𝝁,𝚺)subscript𝐱𝑡1𝒩subscript𝐱𝑡1𝝁evaluated-at𝜆𝚺subscript∇subscript𝐱𝑡𝜑subscript𝐱𝑡ℱsubscript𝐱𝑡𝝁𝚺\mathbf{x}_{t-1}=\mathcal{N}(\mathbf{x}_{t-1};\bm{\mu}+\lambda\bm{\Sigma}% \nabla_{\mathbf{x}_{t}}\varphi(\mathbf{x}_{t},\mathcal{F})|_{\mathbf{x}_{t}=% \bm{\mu}},\bm{\Sigma})bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ; bold_italic_μ + italic_λ bold_Σ ∇ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_φ ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_F ) | start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_μ end_POSTSUBSCRIPT , bold_Σ )

return 𝐱t−1subscript𝐱𝑡1\mathbf{x}_{t-1}bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT

// constraint-guided generation

Input: initial scene layout 𝐱T∼𝒩⁢(𝟎,𝐈)similar-tosubscript𝐱𝑇𝒩0𝐈\mathbf{x}_{T}\sim\mathcal{N}({\bf 0},{\bf I})bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∼ caligraphic_N ( bold_0 , bold_I )

for t=T,⋯,1𝑡𝑇⋯1t=T,\cdots,1italic_t = italic_T , ⋯ , 1 do

// sampling with optimization

𝐱t−1=sample⁢(𝐱t,φ)subscript𝐱𝑡1samplesubscript𝐱𝑡𝜑\mathbf{x}_{t-1}=\textnormal{{{\bf sample}}}(\mathbf{x}_{t},\varphi)bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = sample ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_φ )

end for

return 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT

Algorithm 1 Learnning and inference in PhyScene

Based on this formulation, we further propose three physic-based guidances φcoll⁢(𝐱)subscript𝜑coll𝐱\varphi_{\text{coll}}(\mathbf{x})italic_φ start_POSTSUBSCRIPT coll end_POSTSUBSCRIPT ( bold_x ), φlayout⁢(𝐱,ℱ)subscript𝜑layout𝐱ℱ\varphi_{\text{layout}}(\mathbf{x},\mathcal{F})italic_φ start_POSTSUBSCRIPT layout end_POSTSUBSCRIPT ( bold_x , caligraphic_F ), and φreach⁢(𝐱,ℱ)subscript𝜑reach𝐱ℱ\varphi_{\text{reach}}(\mathbf{x},\mathcal{F})italic_φ start_POSTSUBSCRIPT reach end_POSTSUBSCRIPT ( bold_x , caligraphic_F ) and integrate them into the inference process as illustrated in Algorithm 1. We detail the design of each guidance function as follows:

Collision Avoidance. We design a collision avoidance function to reduce object mesh collisions in the generated scene. Instead of calculating the collision mesh between objects, we use the predicted bounding boxes and object centers as effective approximates for estimating the collision score of objects. Specifically, we use 𝒃i=[𝐭i,𝐫i,𝐬i]subscript𝒃𝑖subscript𝐭𝑖subscript𝐫𝑖subscript𝐬𝑖{\bm{b}}_{i}=[\mathbf{t}_{i},\mathbf{r}_{i},\mathbf{s}_{i}]bold_italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = [ bold_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] to denote the 3D bounding box of object 𝒐isubscript𝒐𝑖{\bm{o}}_{i}bold_italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT including its location 𝐭isubscript𝐭𝑖\mathbf{t}_{i}bold_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, orientation 𝐫isubscript𝐫𝑖\mathbf{r}_{i}bold_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and size 𝐬isubscript𝐬𝑖\mathbf{s}_{i}bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. We use 3D IoU [66] to calculate the collision guidance objective via:

φcoll⁢(𝐱)=−∑i,j,i≠jIoU3⁢D⁢(𝒃i,𝒃j),subscript𝜑coll𝐱subscript𝑖𝑗𝑖𝑗subscriptIoU3𝐷subscript𝒃𝑖subscript𝒃𝑗\varphi_{\text{coll}}(\mathbf{x})=-\sum_{i,j,i\neq j}\textbf{IoU}_{3D}({\bm{b}% }_{i},{\bm{b}}_{j}),italic_φ start_POSTSUBSCRIPT coll end_POSTSUBSCRIPT ( bold_x ) = - ∑ start_POSTSUBSCRIPT italic_i , italic_j , italic_i ≠ italic_j end_POSTSUBSCRIPT IoU start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT ( bold_italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ,

(6)

where IoU3⁢DsubscriptIoU3𝐷\textbf{IoU}_{3D}IoU start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT represents the 3D bounding box IoU between object bounding boxes. We sum the collisions of each pair of objects in scene 𝒙𝒙{\bm{x}}bold_italic_x and take the negative value of the summation to penalize object collision.

Room-layout guidanceAn important goal of scalable scene synthesis is to generate interactable house-level scenes in which embodied agents can navigate and interact. To achieve this goal, we consider adding the room-layout guidance that penalizes the existence of objects which are outside of a pre-given floor plan. To consolidate this guidance function, we first extract a polygon of the room boundary given the floor plan ℱℱ\mathcal{F}caligraphic_F. We then derive a set of W𝑊Witalic_W outside barriers for identifying the boundary, represented as bounding boxes of walls {𝐛wwall}w=1Wsuperscriptsubscriptsubscriptsuperscript𝐛wall𝑤𝑤1𝑊\{\mathbf{b}^{\text{wall}}_{w}\}_{w=1}^{W}{ bold_b start_POSTSUPERSCRIPT wall end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_w = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W end_POSTSUPERSCRIPT with infinite thickness. We use a similar IoU score between objects and walls for room-layout guidance following:

| φlayout⁢(𝐱|ℱ)=−∑i=1N∑j=1W𝐈𝐨𝐔3⁢D⁢(𝒃i,𝒃jwall).subscript𝜑layoutconditional𝐱ℱsuperscriptsubscript𝑖1𝑁superscriptsubscript𝑗1𝑊subscript𝐈𝐨𝐔3𝐷subscript𝒃𝑖superscriptsubscript𝒃𝑗wall\varphi_{\text{layout}}(\mathbf{x}|\mathcal{F})=-\sum_{i=1}^{N}\sum_{j=1}^{W}% \mathbf{IoU}_{3D}({\bm{b}}_{i},{\bm{b}}_{j}^{\text{wall}}).italic_φ start_POSTSUBSCRIPT layout end_POSTSUBSCRIPT ( bold_x | caligraphic_F ) = - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W end_POSTSUPERSCRIPT bold_IoU start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT ( bold_italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT wall end_POSTSUPERSCRIPT ) . | (7) | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --- |

Reachability guidanceFor an embodied agent, the synthesized scene should allow it to traverse the entire room and interact with all objects successfully. Notably, the synthesized room is often separated into several disjoint connected regions in scenes with improper layouts. Based on this key observation, we aim to adjust the object locations that most significantly affect this connectivity between regions. More specifically, considering an embodied agent represented by its bounding box 𝒃agentsuperscript𝒃agent{\bm{b}}^{\text{agent}}bold_italic_b start_POSTSUPERSCRIPT agent end_POSTSUPERSCRIPT, we first map the generated scenes to a 2D room mask and calculate the walkable area in this scene considering the agent’s size. Next, we employ Gaussian distributions on each positioned object in the scene to form a cost map for traversing the scene. Intuitively, points closer to objects will have higher costs. With the cost map, we plan the shortest path between the center of the two largest connection regions using the A* algorithm [20]. The resulting path indicates the least effort path to traverse between these two regions. We then select L𝐿Litalic_L agent positions on this shortest path with bounding boxes {𝒃1agent,…,𝒃Lagent}subscriptsuperscript𝒃agent1…subscriptsuperscript𝒃agent𝐿\{{\bm{b}}^{\text{agent}}_{1},...,{\bm{b}}^{\text{agent}}_{L}\}{ bold_italic_b start_POSTSUPERSCRIPT agent end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_b start_POSTSUPERSCRIPT agent end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT } for applying the guidance function. The reachability guidance can therefore be calculated via:

| φreach⁢(𝐱|ℱ)=−∑i=1N∑j=1L𝐈𝐨𝐔3⁢D⁢(𝒃i,𝒃jagent).subscript𝜑reachconditional𝐱ℱsuperscriptsubscript𝑖1𝑁superscriptsubscript𝑗1𝐿subscript𝐈𝐨𝐔3𝐷subscript𝒃𝑖subscriptsuperscript𝒃agent𝑗\varphi_{\text{reach}}(\mathbf{x}|\mathcal{F})=-\sum_{i=1}^{N}\sum_{j=1}^{L}% \mathbf{IoU}_{3D}({\bm{b}}_{i},{\bm{b}}^{\text{agent}}_{j}).italic_φ start_POSTSUBSCRIPT reach end_POSTSUBSCRIPT ( bold_x | caligraphic_F ) = - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT bold_IoU start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT ( bold_italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_b start_POSTSUPERSCRIPT agent end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) . | (8) | | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | --- |

Notably, we can extend the current method to incorporate interaction constraints to ensure the articulated object interaction (e.g., grasping, opening) as well as complex rigid object interaction (e.g., sit) with some simple modifications. More details are provided in the supplementary.

4 Experiment

Table 2: Quantitative comparison on unconditional scene synthesis trained on 3D-Front. We compare PhyScene with ATISS and Diffuscene on common perceptual quality scores FID, SCA, CKL, as well as physical plausibility measured in collision rate 𝐂𝐨𝐥𝐂𝐨𝐥\mathbf{Col}bold_Col.

Refer to caption

Figure 3: Visualization of floor-plan conditioned scene synthesis between PhyScene, ATISS, and DiffuScene. The red, purple, and blue boxes highlight collisions between objects, objects outside the floor plan, and unreachable areas to the embodied agent, respectively.

Dataset

For experimental comparisons, we train our diffusion model on the 3D-FRONT dataset [14] which contains 6813681368136813 houses with 14629146291462914629 rooms. Each room is manually decorated with high-quality furniture objects from the 3D-FUTURE dataset [15]. Following the setting of DiffuScene [51] and ATISS [42], we use 4041 bedrooms, 900 dining rooms, and 813 living rooms for training and testing. In addition, we use both the 3D-FUTURE dataset [15] and GAPartNet [17] for object retrieval. Among them, GAPartNet [17] has abundant interactive assets, containing 1166116611661166 articulated objects from 27272727 object categories. We utilize articulated objects in the table and storage furniture category, such as wardrobe and table, to retrieve related objects in generated scenes, and provide the full object category mapping between datasets in the supplementary.

Baseline

We mainly consider two state-of-the-art scene synthesis methods as baselines: 1) ATISS [42], a transformer-based model that predicts the 3D object bounding box in an autoregressive manner, and 2) DiffuScene [51], a diffusion-based model that learns 3D objects layout without floor plan constraint. We test these baselines in both unconditional synthesis and floor-plan-conditioned synthesis settings to compare our proposed PhyScene model.

Metric

To evaluate the realism and diversity of the synthesized scenes, we follow the previous works and calculate Fréchet Inception Distance [22] (FID), Kernel Inception Distance [5] (KID ×0.001absent0.001\times 0.001× 0.001), Scene Classification Accuracy (SCA), and Category KL divergence (CKL ×0.01absent0.01\times 0.01× 0.01) on 1000 synthesized scenes. In addition, we check the collision rate between each pair of objects in the generated scene using their CAD models. We use 𝐂𝐨𝐥objsubscript𝐂𝐨𝐥obj\mathbf{Col}_{\text{obj}}bold_Col start_POSTSUBSCRIPT obj end_POSTSUBSCRIPT to denote the percentage of objects that collide with other objects in the generated scene, 𝐂𝐨𝐥scenesubscript𝐂𝐨𝐥scene\mathbf{Col}_{\text{scene}}bold_Col start_POSTSUBSCRIPT scene end_POSTSUBSCRIPT to denote the ratio of scenes that possess object collisions over all generated scenes. Since the CAD models in the 3D-FUTURE dataset are usually not watertight, we apply re-meshing for each object mesh before evaluation. To evaluate the violation of the floor plan layout, we mark the rate of objects outside the floor plan as 𝐑outsubscript𝐑out\mathbf{R}_{\text{out}}bold_R start_POSTSUBSCRIPT out end_POSTSUBSCRIPT. Finally, we calculate the average reachable rate of objects in the scene 𝐑reachsubscript𝐑reach\mathbf{R}_{\text{reach}}bold_R start_POSTSUBSCRIPT reach end_POSTSUBSCRIPT starting from a random starting point on the floor plan. We calculate the average ratio of the largest connected walkable area over all walkable areas in the room, denoted as 𝐑walkablesubscript𝐑walkable\mathbf{R}_{\text{walkable}}bold_R start_POSTSUBSCRIPT walkable end_POSTSUBSCRIPT, to evaluate the reachability and interactivity of the generated scenes.

Table 3: Floor-conditioned Scene Synthesis. We compare PhyScene with ATISS and DiffuScene on common perceptual quality scores FID, KID, SCA, CKL, as well as physical plausibility metrics 𝐂𝐨𝐥obj,𝐂𝐨𝐥scene,𝐑out,𝐑reach,𝐑walkablesubscript𝐂𝐨𝐥objsubscript𝐂𝐨𝐥scenesubscript𝐑outsubscript𝐑reachsubscript𝐑walkable\mathbf{Col}_{\text{obj}},\mathbf{Col}_{\text{scene}},\mathbf{R}_{\text{out}},% \mathbf{R}_{\text{reach}},\mathbf{R_{\text{walkable}}}bold_Col start_POSTSUBSCRIPT obj end_POSTSUBSCRIPT , bold_Col start_POSTSUBSCRIPT scene end_POSTSUBSCRIPT , bold_R start_POSTSUBSCRIPT out end_POSTSUBSCRIPT , bold_R start_POSTSUBSCRIPT reach end_POSTSUBSCRIPT , bold_R start_POSTSUBSCRIPT walkable end_POSTSUBSCRIPT.

Room Type	Method	FID ↓↓\downarrow↓	KID ↓↓\downarrow↓	SCA ↓↓\downarrow↓	CKL ↓↓\downarrow↓	𝐂𝐨𝐥objsubscript𝐂𝐨𝐥obj\mathbf{Col}_{\text{obj}}bold_Col start_POSTSUBSCRIPT obj end_POSTSUBSCRIPT ↓↓\downarrow↓	𝐂𝐨𝐥scene↓↓subscript𝐂𝐨𝐥sceneabsent\mathbf{Col}_{\text{scene}}\downarrowbold_Col start_POSTSUBSCRIPT scene end_POSTSUBSCRIPT ↓	𝐑outsubscript𝐑out\mathbf{R}_{\text{out}}bold_R start_POSTSUBSCRIPT out end_POSTSUBSCRIPT ↓↓\downarrow↓	𝐑walkablesubscript𝐑walkable\mathbf{R}_{\text{walkable}}bold_R start_POSTSUBSCRIPT walkable end_POSTSUBSCRIPT ↑↑\uparrow↑	𝐑reachsubscript𝐑reach\mathbf{R}_{\text{reach}}bold_R start_POSTSUBSCRIPT reach end_POSTSUBSCRIPT ↑↑\uparrow↑
Bedroom	ATISS	30.19	0.0010	49.14	0.0028	0.248	0.46	0.286	0.839	0.736
DiffuScene	25.00	0.0004	51.78	0.0031	0.228	0.43	0.272	0.827	0.755
PhyScene (Ours)	25.52	0.0006	50.10	0.0025	0.187	0.36	0.245	0.865	0.762
Living Room	ATISS	45.66	0.0035	51.64	0.0016	0.316	0.85	0.136	0.814	0.791
DiffuScene	38.69	0.0012	54.06	0.0017	0.198	0.69	0.238	0.790	0.756
PhyScene (Ours)	43.33	0.0031	53.50	0.0015	0.191	0.63	0.219	0.815	0.771
Dining Room	ATISS	41.66	0.0039	64.57	0.0040	0.591	0.96	0.132	0.874	0.848
DiffuScene	38.31	0.0020	60.19	0.0013	0.160	0.55	0.244	0.787	0.847
PhyScene (Ours)	39.90	0.0026	60.00	0.0013	0.151	0.53	0.217	0.852	0.789

4.1 Unconditioned Scene Synthesis

We provide quantitative evaluation results in Tab. 2. As shown in Tab. 2, PhyScene achieves state-of-the-art results on almost all metrics, especially with a significant improvement on physical plausibility metrics such as 𝐂𝐨𝐥objsubscript𝐂𝐨𝐥obj\mathbf{Col}_{\text{obj}}bold_Col start_POSTSUBSCRIPT obj end_POSTSUBSCRIPT and 𝐂𝐨𝐥scenesubscript𝐂𝐨𝐥scene\mathbf{Col}_{\text{scene}}bold_Col start_POSTSUBSCRIPT scene end_POSTSUBSCRIPT. This result quantitatively proves that PhyScene effectively produces improved scene layouts with reduced collision rates while achieving better visual plausibility. Notably, diffusion-based scene-synthesis models (i.e., DiffuScene and PhyScene) exhibit superior performance in collision avoidance compared to ATISS. This affirms the advantage of employing diffusion models as the primary generative model for scene synthesis, given their robust performance and adaptability in integrating guidance functions. We provide qualitative results in Fig. 3, demonstrating that our model successfully generates scenes with significantly fewer instances of physical constraint violations due to object collisions while maintaining high levels of naturalness and diversity.

Refer to caption

Figure 4: Generated scenes with articulated objects. We visualize the opening sequence of articulated objects (left) and the generated scenes with texture (right).

4.2 Floor-conditioned Scene Synthesis

We provide comparisons between PhyScene and baseline models in terms of both visual and physical metrics in Tab. 3.PhyScene surpasses baselines in collision metrics and the CKL score. Additionally, compared to DiffuScene, our model consistently exhibits performance improvements across all physical interactability metrics, highlighting the effectiveness of our physical guidance functions in enhancing the generation process of diffusion-based models with physical constraints. It is noteworthy that, except for the Bedroom setting, ATISS achieves favorable results on floor plan violation (𝐑outsubscript𝐑out\mathbf{R}_{\text{out}}bold_R start_POSTSUBSCRIPT out end_POSTSUBSCRIPT) and reachability metric (𝐑reachsubscript𝐑reach\mathbf{R}_{\text{reach}}bold_R start_POSTSUBSCRIPT reach end_POSTSUBSCRIPT). We attribute this to its prioritization of floor plan constraints over collision avoidance within the scene. We provide qualitative visualization of all models’ generations in Fig. 3.

Table 4: Articulated Object Embedding. We compare PhyScene with ATISS and DiffuScene on physical plausibility.

4.3 Scene Synthesis with Articulated Objects

To generate scenes with articulated objects, we utilize the predicted scene layout along with object features to retrieve articulate objects. Recognizing the spatial requirements for interacting with articulated objects, we compute 3D bounding boxes for these objects, considering their joints being manipulated to the fullest extent, and use these expanded bounding boxes for guidance calculation. We show the quantitative results of our guided substitution for articulated objects in the Living Room setting in Tab. 4. Results show the collision rate with articulated objects is much higher than that with rigid objects (compared with the collision rate shown in Tab. 3). And our model shows a great improvement over previous methods. We visualize the qualitative results of our guided substitution for articulated objects in Fig. 4 and leave more visualizations in the supplementary.

Table 5: Ablation study on the use of guidance functions. Our final result balances the effectiveness of three guidances.

4.4 Ablation Study on Guidance

We conduct ablative studies on our proposed guidance functions in the Living Room setting in Tab. 5. Given that these guidance functions serve different roles in layout optimization, they may exhibit potential conflicts with each other. As shown in Tab. 5, the 𝐂𝐨𝐥objsubscript𝐂𝐨𝐥obj\mathbf{Col}_{\text{obj}}bold_Col start_POSTSUBSCRIPT obj end_POSTSUBSCRIPT and 𝐑outsubscript𝐑out\mathbf{R}_{\text{out}}bold_R start_POSTSUBSCRIPT out end_POSTSUBSCRIPT metrics have a negative impact on each other because the collision guidance pushes objects apart while the floor plan guidance pushes objects closer to fit in the scene. However, we managed to strike a balance among these guidances, leading to improvements on each corresponding metric. We provide qualitative visualizations illustrating the effect of each guidance in Fig. 5.

Refer to caption

Figure 5: Ablation on Guidance. Results of different guidance with floor-plan conditions. For each ablation on guidance functions, we show four generated scenes (four columns) without guidance in the first row and mark the violation of constraints in red boxes. The second row shows the improvement after considering guidance functions in green boxes.

5 Conclusion

In this paper, we introduce PhyScene, a guided conditional diffusion model for physically interactable scene synthesis. To ensure the physical plausibility and interactivity of the generated scenes, we devise novel guidance functions converting constraints on object collision, room layout, and interactivity to guidance within each inference step in the diffusion process. Our experimental results demonstrate consistent performance improvement over state-of-the-art baseline models on physical plausibility and interactivity metrics, showcasing the effectiveness of our designed guidance functions and the generation pipeline.

Future work Due to data limitations, PhyScene is presently restricted to considering only limited room types, without incorporating small objects. This limitation poses a significant obstacle to the applicability of these scenes in embodied AI tasks, particularly those involving small object manipulation such as pick and place tasks. We leave this area as an important focus for future research.

Acknowledgement We thank Ms. Zhen Chen from BIGAI for refining the figures, and all colleagues from the BIGAI TongVerse project for fruitful discussions and help on simulation developments. We would also like to thank the anonymous reviewers for their constructive feedback.

References

Ahn et al. [2022] Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances. In Conference on Robot Learning (CoRL), 2022.
Anderson et al. [2018] Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
Bansal et al. [2023] Arpit Bansal, Hong-Min Chu, Avi Schwarzschild, Soumyadip Sengupta, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Universal guidance for diffusion models. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
Baruch et al. [2021] Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, et al. Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2021.
Bińkowski et al. [2018] Mikołaj Bińkowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying mmd gans. arXiv preprint arXiv:1801.01401, 2018.
Brooks et al. [2023] Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
Chang et al. [2014] Angel Chang, Manolis Savva, and Christopher D Manning. Learning spatial knowledge for text to 3d scene generation. In Proceedings of the conference on Empirical Methods in Natural Language Processing (EMNLP), 2014.
Dai et al. [2017] Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
Deitke et al. [2020] Matt Deitke, Winson Han, Alvaro Herrasti, Aniruddha Kembhavi, Eric Kolve, Roozbeh Mottaghi, Jordi Salvador, Dustin Schwenk, Eli VanderBilt, Matthew Wallingford, et al. Robothor: An open simulation-to-real embodied ai platform. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
Deitke et al. [2022] Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Kiana Ehsani, Jordi Salvador, Winson Han, Eric Kolve, Aniruddha Kembhavi, and Roozbeh Mottaghi. Procthor: Large-scale embodied ai using procedural generation. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2022.
Dhamo et al. [2021] Helisa Dhamo, Fabian Manhardt, Nassir Navab, and Federico Tombari. Graph-to-3d: End-to-end generation and manipulation of 3d scenes using scene graphs. In Proceedings of International Conference on Computer Vision (ICCV), 2021.
Driess et al. [2023] Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model. In Proceedings of International Conference on Machine Learning (ICML), 2023.
Fisher et al. [2015] Matthew Fisher, Manolis Savva, Yangyan Li, Pat Hanrahan, and Matthias Nießner. Activity-centric scene synthesis for functional 3d scene modeling. ACM Transactions on Graphics (TOG), 34(6):1–13, 2015.
Fu et al. [2021a] Huan Fu, Bowen Cai, Lin Gao, Ling-Xiao Zhang, Jiaming Wang, Cao Li, Qixun Zeng, Chengyue Sun, Rongfei Jia, Binqiang Zhao, et al. 3d-front: 3d furnished rooms with layouts and semantics. In Proceedings of International Conference on Computer Vision (ICCV), 2021a.
Fu et al. [2021b] Huan Fu, Rongfei Jia, Lin Gao, Mingming Gong, Binqiang Zhao, Steve Maybank, and Dacheng Tao. 3d-future: 3d furniture shape with texture. International Journal of Computer Vision (IJCV), 129:3313–3337, 2021b.
Fu et al. [2017] Qiang Fu, Xiaowu Chen, Xiaotian Wang, Sijia Wen, Bin Zhou, and Hongbo Fu. Adaptive synthesis of indoor scenes via activity-associated object relation graphs. ACM Transactions on Graphics (TOG), 36(6):1–13, 2017.
Geng et al. [2023] Haoran Geng, Helin Xu, Chengyang Zhao, Chao Xu, Li Yi, Siyuan Huang, and He Wang. Gapartnet: Cross-category domain-generalizable object perception and manipulation via generalizable and actionable parts. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
Gong et al. [2023] Ran Gong, Jiangyong Huang, Yizhou Zhao, Haoran Geng, Xiaofeng Gao, Qingyang Wu, Wensi Ai, Ziheng Zhou, Demetri Terzopoulos, Song-Chun Zhu, Baoxiong Jia, and Siyuan Huang. Arnold: A benchmark for language-grounded task learning with continuous states in realistic 3d scenes. In Proceedings of International Conference on Computer Vision (ICCV), 2023.
Gu et al. [2023] Jiayuan Gu, Fanbo Xiang, Xuanlin Li, Zhan Ling, Xiqiang Liu, Tongzhou Mu, Yihe Tang, Stone Tao, Xinyue Wei, Yunchao Yao, et al. Maniskill2: A unified benchmark for generalizable manipulation skills. arXiv preprint arXiv:2302.04659, 2023.
Hart et al. [1968] Peter Hart, Nils Nilsson, and Bertram Raphael. A formal basis for the heuristic determination of minimum cost paths. IEEE Transactions on Systems Science and Cybernetics, 4(2):100–107, 1968.
Hassan et al. [2019] Mohamed Hassan, Vasileios Choutas, Dimitrios Tzionas, and Michael J Black. Resolving 3d human pose ambiguities with 3d scene constraints. In Proceedings of International Conference on Computer Vision (ICCV), 2019.
Heusel et al. [2017] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2017.
Ho and Salimans [2022] Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2020.
Huang et al. [2023a] Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, and Siyuan Huang. An embodied generalist agent in 3d world. arXiv preprint arXiv:2311.12871, 2023a.
Huang et al. [2023b] Siyuan Huang, Zan Wang, Puhao Li, Baoxiong Jia, Tengyu Liu, Yixin Zhu, Wei Liang, and Song-Chun Zhu. Diffusion-based generation, optimization, and planning in 3d scenes. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2023b.
Huang et al. [2022] Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, et al. Inner monologue: Embodied reasoning through planning with language models. In Conference on Robot Learning (CoRL), 2022.
Inoue et al. [2023] Naoto Inoue, Kotaro Kikuchi, Edgar Simo-Serra, Mayu Otani, and Kota Yamaguchi. Layoutdm: Discrete diffusion model for controllable layout generation. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
Jia et al. [2024] Baoxiong Jia, Yixin Chen, Huangyue Yu, Yan Wang, Xuesong Niu, Tengyu Liu, Qing Li, and Siyuan Huang. Sceneverse: Scaling 3d vision-language learning for grounded scene understanding. arXiv preprint arXiv:2401.09340, 2024.
Jiang et al. [2018] Chenfanfu Jiang, Siyuan Qi, Yixin Zhu, Siyuan Huang, Jenny Lin, Lap-Fai Yu, Demetri Terzopoulos, and Song-Chun Zhu. Configurable 3d scene synthesis and 2d image rendering with per-pixel ground truth using stochastic grammars. International Journal of Computer Vision (IJCV), pages 920–941, 2018.
Jiang et al. [2022] Yunfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi Wang, Yongqiang Dou, Yanjun Chen, Li Fei-Fei, Anima Anandkumar, Yuke Zhu, and Linxi Fan. Vima: General robot manipulation with multimodal prompts. arXiv, 2022.
Khanna et al. [2023] Mukul Khanna, Yongsen Mao, Hanxiao Jiang, Sanjay Haresh, Brennan Schacklett, Dhruv Batra, Alexander Clegg, Eric Undersander, Angel X Chang, and Manolis Savva. Habitat synthetic scenes dataset (hssd-200): An analysis of 3d scene scale and realism tradeoffs for objectgoal navigation. arXiv preprint arXiv:2306.11290, 2023.
Kolve et al. [2017] Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Matt Deitke, Kiana Ehsani, Daniel Gordon, Yuke Zhu, et al. Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:1712.05474, 2017.
Krantz et al. [2020] Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. Beyond the nav-graph: Vision-and-language navigation in continuous environments. In Proceedings of European Conference on Computer Vision (ECCV), 2020.
Li et al. [2021] Chengshu Li, Fei Xia, Roberto Martín-Martín, Michael Lingelbach, Sanjana Srivastava, Bokui Shen, Kent Vainio, Cem Gokmen, Gokul Dharan, Tanish Jain, et al. igibson 2.0: Object-centric simulation for robot learning of everyday household tasks. arXiv preprint arXiv:2108.03272, 2021.
Li et al. [2023] Chengshu Li, Ruohan Zhang, Josiah Wong, Cem Gokmen, Sanjana Srivastava, Roberto Martín-Martín, Chen Wang, Gabrael Levine, Michael Lingelbach, Jiankai Sun, et al. Behavior-1k: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation. In Conference on Robot Learning, 2023.
Lin et al. [2021] Xingyu Lin, Yufei Wang, Jake Olkin, and David Held. Softgym: Benchmarking deep reinforcement learning for deformable object manipulation. In Conference on Robot Learning, 2021.
Lu et al. [2022] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095, 2022.
Mittal et al. [2023] Mayank Mittal, Calvin Yu, Qinxi Yu, Jingzhou Liu, Nikita Rudin, David Hoeller, Jia Lin Yuan, Ritvik Singh, Yunrong Guo, Hammad Mazhar, et al. Orbit: A unified simulation framework for interactive robot learning environments. IEEE Robotics and Automation Letters, 2023.
Mu et al. [2021] Tongzhou Mu, Zhan Ling, Fanbo Xiang, Derek Yang, Xuanlin Li, Stone Tao, Zhiao Huang, Zhiwei Jia, and Hao Su. Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations. arXiv preprint arXiv:2107.14483, 2021.
Nie et al. [2023] Yinyu Nie, Angela Dai, Xiaoguang Han, and Matthias Nießner. Learning 3d scene priors with 2d supervision. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
Paschalidou et al. [2021] Despoina Paschalidou, Amlan Kar, Maria Shugrina, Karsten Kreis, Andreas Geiger, and Sanja Fidler. Atiss: Autoregressive transformers for indoor scene synthesis. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2021.
Poole et al. [2022] Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
Purkait et al. [2020] Pulak Purkait, Christopher Zach, and Ian Reid. Sg-vae: Scene grammar variational autoencoder to generate new indoor scenes. In Proceedings of European Conference on Computer Vision (ECCV), 2020.
Qi et al. [2018] Siyuan Qi, Yixin Zhu, Siyuan Huang, Chenfanfu Jiang, and Song-Chun Zhu. Human-centric indoor scene synthesis using stochastic grammar. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
Ramesh et al. [2022] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
Ruan et al. [2023] Ludan Ruan, Yiyang Ma, Huan Yang, Huiguo He, Bei Liu, Jianlong Fu, Nicholas Jing Yuan, Qin Jin, and Baining Guo. Mm-diffusion: Learning multi-modal diffusion models for joint audio and video generation. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
Shridhar et al. [2022] Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Cliport: What and where pathways for robotic manipulation. In Conference on Robot Learning, 2022.
Song et al. [2020] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
Szot et al. [2021] Andrew Szot, Alexander Clegg, Eric Undersander, Erik Wijmans, Yili Zhao, John Turner, Noah Maestre, Mustafa Mukadam, Devendra Singh Chaplot, Oleksandr Maksymets, et al. Habitat 2.0: Training home assistants to rearrange their habitat. Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 34:251–266, 2021.
Tang et al. [2024] Jiapeng Tang, Yinyu Nie, Lev Markhasin, Angela Dai, Justus Thies, and Matthias Nießner. Diffuscene: Denoising diffusion models for gerative indoor scene synthesis. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
Wang et al. [2021a] Jiashun Wang, Huazhe Xu, Jingwei Xu, Sifei Liu, and Xiaolong Wang. Synthesizing long-term 3d human motion and interaction in 3d scenes. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2021a.
Wang et al. [2023] Weiqi Wang, Zihang Zhao, Ziyuan Jiao, Yixin Zhu, Song-Chun Zhu, and Hangxin Liu. Rearrange indoor scenes for human-robot co-activity. In Proceedings of International Conference on Robotics and Automation (ICRA), 2023.
Wang et al. [2021b] Xinpeng Wang, Chandan Yeshwanth, and Matthias Nießner. Sceneformer: Indoor scene generation with transformers. In Proceedings of International Conference on 3D Vision (3DV), 2021b.
Wang et al. [2024] Zan Wang, Yixin Chen, Baoxiong Jia, Puhao Li, Jinlu Zhang, Jingze Zhang, Tengyu Liu, Yixin Zhu, Wei Liang, and Siyuan Huang. Move as you say, interact as you can: Language-guided human motion generation with scene affordance. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
Wu et al. [2023] Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of International Conference on Computer Vision (ICCV), 2023.
Xiang et al. [2020] Fanbo Xiang, Yuzhe Qin, Kaichun Mo, Yikuan Xia, Hao Zhu, Fangchen Liu, Minghua Liu, Hanxiao Jiang, Yifu Yuan, He Wang, et al. Sapien: A simulated part-based interactive environment. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
Xu et al. [2013] Kun Xu, Kang Chen, Hongbo Fu, Wei-Lun Sun, and Shi-Min Hu. Sketch2scene: Sketch-based co-retrieval and co-placement of 3d models. ACM Transactions on Graphics (TOG), 32(4):1–15, 2013.
Yang et al. [2021] Ming-Jia Yang, Yu-Xiao Guo, Bin Zhou, and Xin Tong. Indoor scene generation from a collection of semantic-segmented depth images. In Proceedings of International Conference on Computer Vision (ICCV), 2021.
Yu et al. [2022] Peiyu Yu, Sirui Xie, Xiaojian Ma, Baoxiong Jia, Bo Pang, Ruiqi Gao, Yixin Zhu, Song-Chun Zhu, and Ying Nian Wu. Latent diffusion energy-based model for interpretable text modeling. In Proceedings of International Conference on Machine Learning (ICML), 2022.
Yuan et al. [2023] Ye Yuan, Jiaming Song, Umar Iqbal, Arash Vahdat, and Jan Kautz. Physdiff: Physics-guided human motion diffusion model. In Proceedings of International Conference on Computer Vision (ICCV), 2023.
Zhai et al. [2024] Guangyao Zhai, Evin Pınar Örnek, Shun-Cheng Wu, Yan Di, Federico Tombari, Nassir Navab, and Benjamin Busam. Commonscenes: Generating commonsense 3d indoor scenes with scene graphs. Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2024.
Zhang et al. [2023] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of International Conference on Computer Vision (ICCV), 2023.
Zhang et al. [2020] Zaiwei Zhang, Zhenpei Yang, Chongyang Ma, Linjie Luo, Alexander Huth, Etienne Vouga, and Qixing Huang. Deep generative modeling for scene synthesis via hybrid representations. ACM Transactions on Graphics (TOG), 39(2):1–21, 2020.
Zheng et al. [2022] Kaizhi Zheng, Xiaotong Chen, Odest Chadwicke Jenkins, and Xin Wang. Vlmbench: A compositional benchmark for vision-and-language manipulation. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2022.
Zhou et al. [2019a] Dingfu Zhou, Jin Fang, Xibin Song, Chenye Guan, Junbo Yin, Yuchao Dai, and Ruigang Yang. Iou loss for 2d/3d object detection. In Proceedings of International Conference on 3D Vision (3DV), 2019a.
Zhou et al. [2019b] Yang Zhou, Zachary While, and Evangelos Kalogerakis. Scenegraphnet: Neural message passing for 3d indoor scene augmentation. In Proceedings of International Conference on Computer Vision (ICCV), 2019b.

\thetitle

Supplementary Material

Appendix A Algorithm Details

A.1 Details of Parameters

We introduce the details of α^tsubscript^𝛼𝑡\hat{\alpha}_{t}over^ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in Eq. 1. Given a data sample 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, we can define a forward diffusion process by adding noise. Each forward diffusion process adds Gaussian noise with variance βtsubscript𝛽𝑡\beta_{t}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT on 𝐱t−1subscript𝐱𝑡1\mathbf{x}_{t-1}bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT, resulting in a new variable 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with distribution q⁢(𝐱t|𝐱t−1)𝑞conditionalsubscript𝐱𝑡subscript𝐱𝑡1q(\mathbf{x}_{t}|\mathbf{x}_{t-1})italic_q ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ). This process can be formulated as:

| q(𝐱t|𝐱t−1)=𝒩(𝐱t;𝝁t=1−βt𝐱t−1,𝚺t=βt𝐈).\displaystyle q(\mathbf{x}_{t}|\mathbf{x}_{t-1})=\mathcal{N}(\mathbf{x}_{t};% \bm{\mu}_{t}=\sqrt{1-\beta_{t}}\mathbf{x}_{t-1},\bm{\Sigma}_{t}=\beta_{t}% \mathbf{I}).italic_q ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) = caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_Σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_I ) . | | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |

Then we can formulate the diffusion process with

| q⁢(𝐱1:T|𝐱0)=∏t=1Tq⁢(𝐱t|𝐱t−1),𝑞conditionalsubscript𝐱:1𝑇subscript𝐱0superscriptsubscriptproduct𝑡1𝑇𝑞conditionalsubscript𝐱𝑡subscript𝐱𝑡1\displaystyle q(\mathbf{x}_{1:T}|\mathbf{x}_{0})=\prod_{t=1}^{T}q(\mathbf{x}_{% t}|\mathbf{x}_{t-1}),italic_q ( bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_q ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) , | | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------- |

where q⁢(𝐱1:T)𝑞subscript𝐱:1𝑇q(\mathbf{x}_{1:T})italic_q ( bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) means we apply q𝑞qitalic_q repeatedly from timestep 1 to T𝑇Titalic_T. To simplify this process, we define αt=1−βtsubscript𝛼𝑡1subscript𝛽𝑡\alpha_{t}=1-\beta_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, α^t=∏s=0tαssubscript^𝛼𝑡superscriptsubscriptproduct𝑠0𝑡subscript𝛼𝑠\hat{\alpha}_{t}=\prod_{s=0}^{t}\alpha_{s}over^ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, and ϵ,ϵ0,…,ϵt−1∼𝒩⁢(0,𝐈)similar-tobold-italic-ϵsubscriptbold-italic-ϵ0…subscriptbold-italic-ϵ𝑡1𝒩0𝐈\bm{\epsilon},\bm{\epsilon}_{0},...,\bm{\epsilon}_{t-1}\sim\mathcal{N}(0,% \mathbf{I})bold_italic_ϵ , bold_italic_ϵ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , bold_italic_ϵ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , bold_I ). After reparameterizing with α^tsubscript^𝛼𝑡\hat{\alpha}_{t}over^ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we have:

𝐱t=subscript𝐱𝑡absent\displaystyle\mathbf{x}_{t}=bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =	1−βt⁢𝐱t−1+βt⁢ϵt−11subscript𝛽𝑡subscript𝐱𝑡1subscript𝛽𝑡subscriptbold-italic-ϵ𝑡1\displaystyle\sqrt{1-\beta_{t}}\mathbf{x}_{t-1}+\sqrt{\beta_{t}}\bm{\epsilon}_% {t-1}square-root start_ARG 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + square-root start_ARG italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_ϵ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT
=\displaystyle==	αt⁢𝐱t−1+1−αt⁢ϵt−1subscript𝛼𝑡subscript𝐱𝑡11subscript𝛼𝑡subscriptbold-italic-ϵ𝑡1\displaystyle\sqrt{\alpha_{t}}\mathbf{x}_{t-1}+\sqrt{1-\alpha_{t}}\bm{\epsilon% }_{t-1}square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_ϵ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT
=\displaystyle==	αt⁢(αt−1⁢𝐱t−2+1−αt−1⁢ϵt−2)+1−αt⁢ϵt−1subscript𝛼𝑡subscript𝛼𝑡1subscript𝐱𝑡21subscript𝛼𝑡1subscriptbold-italic-ϵ𝑡21subscript𝛼𝑡subscriptbold-italic-ϵ𝑡1\displaystyle\sqrt{\alpha_{t}}(\sqrt{\alpha_{t-1}}\mathbf{x}_{t-2}+\sqrt{1-% \alpha_{t-1}}\bm{\epsilon}_{t-2})+\sqrt{1-\alpha_{t}}\bm{\epsilon}_{t-1}square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ( square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT + square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG bold_italic_ϵ start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT ) + square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_ϵ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT
=\displaystyle==	αt⁢αt−1⁢𝐱t−2+1−αt⁢αt−1⁢ϵsubscript𝛼𝑡subscript𝛼𝑡1subscript𝐱𝑡21subscript𝛼𝑡subscript𝛼𝑡1bold-italic-ϵ\displaystyle\sqrt{\alpha_{t}\alpha_{t-1}}\mathbf{x}_{t-2}+\sqrt{1-\alpha_{t}% \alpha_{t-1}}\bm{\epsilon}square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT + square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG bold_italic_ϵ
=\displaystyle==	……\displaystyle...…
=\displaystyle==	αt⁢αt−1⁢…⁢α1⁢𝐱0+1−αt⁢αt−1⁢…⁢α1⁢ϵsubscript𝛼𝑡subscript𝛼𝑡1…subscript𝛼1subscript𝐱01subscript𝛼𝑡subscript𝛼𝑡1…subscript𝛼1bold-italic-ϵ\displaystyle\sqrt{\alpha_{t}\alpha_{t-1}...\alpha_{1}}\mathbf{x}_{0}+\sqrt{1-% \alpha_{t}\alpha_{t-1}...\alpha_{1}}\bm{\epsilon}square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT … italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT … italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG bold_italic_ϵ
=\displaystyle==	α^t⁢𝐱0+1−α^t⁢ϵ.subscript^𝛼𝑡subscript𝐱01subscript^𝛼𝑡bold-italic-ϵ\displaystyle\sqrt{\hat{\alpha}_{t}}\mathbf{x}_{0}+\sqrt{1-\hat{\alpha}_{t}}% \bm{\epsilon}.square-root start_ARG over^ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - over^ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_ϵ .

This reflects the derivation between xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT in Eq. 1.

A.2 Details of Reachability Guidance

As mentioned in Sec. 3, we provide the detailed algorithm for calculating the reachability guidance in Algorithm 2.

Module : Reachability guidance function φreach(⋅|ℱ)\varphi_{\text{reach}}(\cdot|\mathcal{F})italic_φ start_POSTSUBSCRIPT reach end_POSTSUBSCRIPT ( ⋅ | caligraphic_F ), search algorithm A∗⁢(⋅)superscriptA⋅\textbf{A}^{*}(\cdot)A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ⋅ ), indicator function 𝟙⁢(⋅)1⋅\mathbbm{1}(\cdot)blackboard_1 ( ⋅ ).

Input: Floor plan ℱℱ\mathcal{F}caligraphic_F, 3D object bboxes {𝒃1,…,𝒃N}subscript𝒃1…subscript𝒃𝑁\{{\bm{b}}_{1},...,{\bm{b}}_{N}\}{ bold_italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_b start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } where N is the number of objects, embodied agent ’s width 𝒅𝒅{\bm{d}}bold_italic_d.

//Generate gaussian cost map

W=𝟙⁢(ℱ)𝑊1ℱW=\mathbbm{1}(\mathcal{F})italic_W = blackboard_1 ( caligraphic_F ) //Init walkable area

C=¬𝟙⁢(ℱ)⋅MAX_VALUE𝐶⋅1ℱMAX_VALUEC=\neg\mathbbm{1}(\mathcal{F})\cdot\text{MAX\_VALUE}italic_C = ¬ blackboard_1 ( caligraphic_F ) ⋅ MAX_VALUE //Init cost map

for i=1,⋯,N𝑖1⋯𝑁i=1,\cdots,Nitalic_i = 1 , ⋯ , italic_N do

𝒃i2⁢D=MapTo2D⁢(𝒃i)subscriptsuperscript𝒃2𝐷𝑖MapTo2Dsubscript𝒃𝑖{\bm{b}}^{2D}_{i}=\textbf{{MapTo2D}}({\bm{b}}_{i})bold_italic_b start_POSTSUPERSCRIPT 2 italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = MapTo2D ( bold_italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )

W=W−𝟙⁢(Dilate⁢(𝒃i2⁢D,𝒅/2))𝑊𝑊1Dilatesubscriptsuperscript𝒃2𝐷𝑖𝒅2W=W-\mathbbm{1}(\textbf{{Dilate}}({\bm{b}}^{2D}_{i},{\bm{d}}/2))italic_W = italic_W - blackboard_1 ( Dilate ( bold_italic_b start_POSTSUPERSCRIPT 2 italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_d / 2 ) )

//Add Gaussian cost for each object

C=C+Gaussian⁢(𝒃i2⁢D)𝐶𝐶Gaussiansubscriptsuperscript𝒃2𝐷𝑖C=C+\textbf{{Gaussian}}({\bm{b}}^{2D}_{i})italic_C = italic_C + Gaussian ( bold_italic_b start_POSTSUPERSCRIPT 2 italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )

end for

//A∗superscript𝐴A^{*}italic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT shortest path search

{𝒄1,…,𝒄M}=FindConnectedArea⁢(W)subscript𝒄1…subscript𝒄𝑀FindConnectedArea𝑊\{{\bm{c}}_{1},...,{\bm{c}}_{M}\}=\textbf{{FindConnectedArea}}(W){ bold_italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_c start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT } = FindConnectedArea ( italic_W )

{𝒑1,…,𝒑M}=FindCenter⁢({𝒄1,…,𝒄M})subscript𝒑1…subscript𝒑𝑀FindCentersubscript𝒄1…subscript𝒄𝑀\{{\bm{p}}_{1},...,{\bm{p}}_{M}\}=\textbf{{FindCenter}}(\{{\bm{c}}_{1},...,{% \bm{c}}_{M}\}){ bold_italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_p start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT } = FindCenter ( { bold_italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_c start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT } )

//Randomly choose 𝒑s⁢t⁢a⁢r⁢tsubscript𝒑𝑠𝑡𝑎𝑟𝑡{\bm{p}}_{start}bold_italic_p start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_t end_POSTSUBSCRIPT and 𝒑e⁢n⁢dsubscript𝒑𝑒𝑛𝑑{\bm{p}}_{end}bold_italic_p start_POSTSUBSCRIPT italic_e italic_n italic_d end_POSTSUBSCRIPT

𝐏𝐚𝐭𝐡shortest=A∗⁢(C,𝒑s⁢t⁢a⁢r⁢t,𝒑e⁢n⁢d)subscript𝐏𝐚𝐭𝐡shortestsuperscriptA𝐶subscript𝒑𝑠𝑡𝑎𝑟𝑡subscript𝒑𝑒𝑛𝑑\mathbf{Path}_{\text{shortest}}=\textit{{A}}^{*}(C,{\bm{p}}_{start},{\bm{p}}_{% end})bold_Path start_POSTSUBSCRIPT shortest end_POSTSUBSCRIPT = A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_C , bold_italic_p start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_t end_POSTSUBSCRIPT , bold_italic_p start_POSTSUBSCRIPT italic_e italic_n italic_d end_POSTSUBSCRIPT )

{𝒃jagent}j=1L=GetAgentBox⁢(𝐏𝐚𝐭𝐡shortest)superscriptsubscriptsubscriptsuperscript𝒃agent𝑗𝑗1𝐿GetAgentBoxsubscript𝐏𝐚𝐭𝐡shortest\{{\bm{b}}^{\text{agent}}_{j}\}_{j=1}^{L}=\textbf{{GetAgentBox}}(\mathbf{Path}% _{\text{shortest}}){ bold_italic_b start_POSTSUPERSCRIPT agent end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT = GetAgentBox ( bold_Path start_POSTSUBSCRIPT shortest end_POSTSUBSCRIPT )

// Reachability Guidance

φreach⁢(𝐱|ℱ)=−∑i=1N∑j=1L𝐈𝐨𝐔3⁢D⁢(𝒃i,𝒃jagent)subscript𝜑reachconditional𝐱ℱsuperscriptsubscript𝑖1𝑁superscriptsubscript𝑗1𝐿subscript𝐈𝐨𝐔3𝐷subscript𝒃𝑖subscriptsuperscript𝒃agent𝑗\varphi_{\text{reach}}(\mathbf{x}|\mathcal{F})=-\sum_{i=1}^{N}\sum_{j=1}^{L}% \mathbf{IoU}_{3D}({\bm{b}}_{i},{\bm{b}}^{\text{agent}}_{j})italic_φ start_POSTSUBSCRIPT reach end_POSTSUBSCRIPT ( bold_x | caligraphic_F ) = - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT bold_IoU start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT ( bold_italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_b start_POSTSUPERSCRIPT agent end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )

return φreach⁢(𝐱|ℱ)subscript𝜑reachconditional𝐱ℱ\varphi_{\text{reach}}(\mathbf{x}|\mathcal{F})italic_φ start_POSTSUBSCRIPT reach end_POSTSUBSCRIPT ( bold_x | caligraphic_F )

Algorithm 2 Reachability Guidance

Refer to caption

Figure A.1: Original 3D-FUTURE model v.s. re-meshed model. We show examples of re-meshed models. Models on the left model are the original CAD model in 3D-FUTURE, and on the right are the re-meshed models. Despite the perceptual similarity, the re-meshed models fill in the hollow area for collision calculation.

Refer to caption

Figure A.2: Original GAPartNet model v.s. sequential model. The original CAD models are always in closed status. To simulate the interactive situation, we open the furniture and record the sequential process in an integrated mesh. The left model shows the original furniture, while the right one is the sequential model. We use the sequential model to compute the collision rate of articulated objects.

Refer to caption

Figure A.3: Examples of articulated objects in GAPartNet dataset. We visualize some models of StorageFurniture and Table. The articulated models have various appearances and different joint types such as revolute and prismatic. Each piece of furniture has several joints for interaction.

Appendix B Data Processing

B.1 3D-FUTURE

The original 3D-FUTURE dataset contains object CAD models that are not watertight, which can not be used for calculating collision directly. To solve this problem for evaluating physical collision between objects, we re-mesh each object model in Blender to compute the collision rate. Some examples of re-meshed models are shown in Fig. A.1, where models on the left are original CAD models in 3D-FUTURE and those on the right are the re-meshed models. Despite the perceptual similarity between models provided and re-meshed, most provided samples contain hollows inside that forbid collision calculation.

B.2 GAPartNet

To simulate the interaction between robots and articulated objects, we build upon the object CAD models and URDF files provided in GAPartNet. Specifically, we generate the articulated object’s states from close to open according to the URDF file and record the sequential process into an integrated mesh. As shown in Fig. A.2, we show the original object CAD model on the left and the integrated mesh covering articulated object states on the right. In our experiments, we use the integrated mesh to compute the collision rate between articulated objects and also use this integrated mesh to compute the opening size of articulated objects for guidance calculation.

B.3 Retrieval Categories

As our method still primarily depends on retrieving object models for generating the final scene, we combine assets from the 3D-FUTURE and GAPartNet datasets for retrieval. In Fig. A.4 we show the utilized categories in the 3D-FUTURE dataset with their corresponding asset numbers. We build a mapping between the 3D-FUTURE object assets and GAPartNet to align interactive categories between two datasets, such as wardrobe in the 3D-FUTURE, shown in orange, for the category of StorageFurniture in the GAPartNet. Fig. A.5 shows the category distribution of GAPartNet models, where StorageFurniture and Table take the largest proportion of this dataset. For example, the number of StorageFurniture is 324 out of the whole dataset number 1045. The articulated models have various appearances and different joint types such as revolute and prismatic. Each piece of furniture has several joints for interaction. We visualize some models of StorageFurniture and Table in GAPartNet in Fig. A.3.

Refer to caption

Figure A.4: Category distribution in 3D-FUTURE dataset. We show the utilized categories in 3D-FUTURE dataset with asset numbers. We choose interactive categories such as wardrobe, shown in orange, to retrieve GAPartNet model.

Refer to caption

Figure A.5: Category distribution in GAPartNet dataset. We show the category distribution of GAPartNet model, where StorageFurniture and Table take the largest proportion of this dataset. These two categories, as shown in orange, are used to composite interactable scenes with cross-dataset retrieval.

Appendix C Additional Results

C.1 Physical Implausible Scenes in 3D-FRONT

As briefly discussed in Tab. 1, we provide further qualitative visualizations on the violation of physical plausibility in 3D-FRONT scene data in Fig. A.6. As shown from the visualizations, some of the scenes used for learning exhibit significant violations of physical plausibility, including object collisions and object-out-of-room scenarios.

Refer to caption

Figure A.6: Visualization on physically implausible scenes in 3D-FRONT. We show original 3D-FRONT scenes with physical and interactive failure cases. The red, purple, and blue boxes respectively indicate collisions between objects, objects outside the floor plan and unreachable areas to the embodied agent. Here we set the floor plan in gray color without texture.

Refer to caption

Figure A.7: Gradient scale varying with the denoising step.

Refer to caption

Figure A.8: Visualization results of PhyScene on 3D Front. The first two rows and the last two rows are the scene synthesis results of the Bedroom and Dining Room respectively.

Table A.1: Comparison against the original 3D-FRONT dataset on collision rate. Both ATISS and DiffuScene have higher collision rates than the 3D-FRONT dataset, while ours is lower than 3D-FRONT in most cases.

C.2 Guidance on Different Agent Size

The reachability guidance is adaptive to different agent sizes. We use 0.2, 0.3, and 0.5 as the agent size separately, where the unit of size is the meter. We show guidance results with different agent sizes in each row and evaluate each guided result on different agent sizes ( shown in each column). Here we show the guidance results in Fig. A.9 with the corresponding walkable map. It shows guidance on size 0.2 is not suitable for agent size 0.5, where the agent can only reach half of the room. And guidance on size 0.5 expands the walkable area to suit the agent in size of 0.5 and make the whole room reachable.

Appendix D Comparison with 3D FRONT

Meanwhile, in Tab. A.1 we show models training on 3D-FRONT dataset can not get rid of the collision prior existed in the training dataset. Both ATISS and DiffuScene have higher collision rates on three types of rooms than 3D-FRONT. However, our PhyScene performs lower scores than 3D-FRONT. The result shows posterior optimization, such as physical and interactive guidance, is necessary to dismiss the unreasonable prior such as collision.

Refer to caption

Figure A.9: Reachability guidance results with different agent sizes. We show the effectiveness of reachability guidance and the influence of the agent size. We compare walkable maps of different agent sizes both in guidance and in evaluation, which are 0.2, 0.3, and 0.5 separately. The unit of size is the meter.

Appendix E Guidance Details

We visualize the gradient scale of each denoising step in Fig. A.7. The gradient of 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT decreases continuously during the denoising process, while the gradient of 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (predicted at each step) has a rapid decline at the beginning and intensively changes in the middle stage. We visualize the layout trajectory at each step and find the layout shrinks to the vicinity of the floor plan at the beginning stage and changes from chaos to order in the middle stage. The layout fine-tunes itself with slight changes at the final steps. According to this observation, we add guidance on the final steps. The results also confirm that adding guidance on the final steps performs the best.

When adding guidance to the data, our guidance is calculated by bounding box, including object size, location, and angle. The purpose is to make the layout more physically plausible and interactable. So we only calculate the gradient of location and angle for guidance to move objects into a more intractable position. Noting that guiding on size will lead to rather small sizes (thickness) of objects.

Refer to caption

Figure A.10: Comparison of different 3D representations in collision guidance.

Appendix F Collision with Finer 3D Representations

In the collision guidance, we calculate the guidance objective on 3D bounding boxes of objects in Eq. 6. We have also considered other finer representations (e.g., occupancy field). As the generation pipeline involves a non-differentiable object retrieval process from the generated object metadata (i.e., location, scale, etc.), using these finer 3D representations introduces non-trivial difficulty in model optimization. Nevertheless, we tried to use bounding boxes as representations for optimization while occupancy field collisions as indicators for loss calculation, i.e., using the following guidance function:

φcoll⁢(𝒙)=−∑i,j,i≠jIoU3⁢D⁢(𝒃i,𝒃j)⁢𝟙⁢(OF⁢(𝒐i,𝒐j)),subscript𝜑coll𝒙subscript𝑖𝑗𝑖𝑗subscriptIoU3𝐷subscript𝒃𝑖subscript𝒃𝑗1OFsubscript𝒐𝑖subscript𝒐𝑗\varphi_{\text{coll}}({\bm{x}})=-\sum_{i,j,i\neq j}\textbf{IoU}_{3D}({\bm{b}}_% {i},{\bm{b}}_{j})\mathbbm{1}(\textbf{OF}({\bm{o}}_{i},{\bm{o}}_{j})),italic_φ start_POSTSUBSCRIPT coll end_POSTSUBSCRIPT ( bold_italic_x ) = - ∑ start_POSTSUBSCRIPT italic_i , italic_j , italic_i ≠ italic_j end_POSTSUBSCRIPT IoU start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT ( bold_italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) blackboard_1 ( OF ( bold_italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) ,

where 𝟙⁢(OF⁢(𝒐i,𝒐j))1OFsubscript𝒐𝑖subscript𝒐𝑗\mathbbm{1}(\textbf{OF}({\bm{o}}_{i},{\bm{o}}_{j}))blackboard_1 ( OF ( bold_italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) checks if two objects have collided occupancy fields. This objective penalizes bounding box collisions only for objects that are collided in their corresponding occupancy fields.

As shown in Fig. A.10, using occupancy fields as indicators can slightly improve the granularity of collision considered. However, as guidance calculation is required in multiple diffusion steps, computing the collision for two occupancy fields significantly increases the computation overhead (55 times slower). Therefore, we leave this exploration to find a better balance between speed and granularity using finer 3D representations as an important future work.

Appendix G Agent Interaction

In the reachability guidance introduced in Algorithm 1, we only consider the walkable area as it is hard to unify guidance functions for object interactions, especially with various planners/modules required for different purposes (e.g., grasping, motion planning). However, as a preliminary attempt, we can extend the current pipeline to incorporate interaction constraints with proper simplifications. To ensure the articulated object interaction, we can use the same reachability guidance function while now 1) enlarging object bounding boxes to the maximum degree (fully opened) for recalculating the walkable map, 2) planning the shortest path from a walkable position to the end position of interactable object parts (e.g., drawer handles), and 3) applying the guidance to move the obstacle objects on this path. Similarly, we can model other interactions with rigid objects (e.g., sit) by planning the shortest path to the interactive areas (e.g., space in front of the chair) correspondingly in the guidance function.

With this simplified estimate, we can improve the interactiveness rate (measured by whether robots could reach the end position of object parts when being maximum interacted) from 0.101 to 0.143. Given our flexible synthesize-with-guidance designs, we believe more fine-grained and effective constraints could be seamlessly integrated into the generation pipeline and will continue to explore this topic in the future.

Appendix H Diffusion v.s. Transformer

ATISS uses an autoregressive model with an end vector to stop predicting new furniture, while we find the object number might be very large, such as predicting 33 objects in a bedroom. In contrast, the diffusion model uses a fixed number of vectors and generates the objects’ layout together. The predicted objects are embedded with overall information about the entire scene as well as inter-object relationships.

Appendix I Additional Visualization

We provide additional qualitative visualization for the effectiveness of guidance functions in Fig. A.11. We also conduct experiments with basic floor plans (i.e., rectangles) in rooms from ProcTHOR and generate scenes with articulated objects. We provide the visualization of the generated results in Fig. A.12.

Refer to caption

Figure A.11: Comparison of PhyScene synthesis without and with guidance. The first two columns and the last two columns are the scene synthesis results without and with guidance respectively.

Refer to caption

Figure A.12: Generated scenes with articulated objects. We show scene synthesis results with diverse layouts and random floor textures. Each scene is embedded with several articulated objects.

PhyScene: Physically Interactable 3D Scene Synthesis for Embodied AI (original) (raw)

Abstract

1 Introduction

2 Related Work

Indoor Scene Synthesis

Physical Plausibility and Interactivity in 3D Scenes

Guided Diffusion Models

3 PhyScene

3.1 Object representation

3.2 Conditional Diffusion for Layout Modeling

3.3 Guidance for Physical Interactivity

4 Experiment

Dataset

Baseline

Metric

4.1 Unconditioned Scene Synthesis

4.2 Floor-conditioned Scene Synthesis

4.3 Scene Synthesis with Articulated Objects

4.4 Ablation Study on Guidance

5 Conclusion

References

Appendix A Algorithm Details

A.1 Details of Parameters

A.2 Details of Reachability Guidance

Appendix B Data Processing

B.1 3D-FUTURE

B.2 GAPartNet

B.3 Retrieval Categories

Appendix C Additional Results

C.1 Physical Implausible Scenes in 3D-FRONT

C.2 Guidance on Different Agent Size

Appendix D Comparison with 3D FRONT

Appendix E Guidance Details

Appendix F Collision with Finer 3D Representations

Appendix G Agent Interaction

Appendix H Diffusion v.s. Transformer

Appendix I Additional Visualization