Generating 3D Rooms from Top-Down View Images via Agentic Code Synthesis (original) (raw)

Teaser

Code-as-Room brings diverse interactive 3D scenes from a single top-down view image. We design an agentic system with a structured execution harness and activate the MLLMs' ability to understand, design, and code the 3D rooms in Blender.

Abstract

Designing realistic and functional 3D indoor rooms is essential for a wide range of applications, including interior design, virtual reality, gaming, and embodied AI. While recent MLLM-based approaches have shown great potential for 3D room synthesis from textual descriptions or reference images, text-based methods struggle to capture precise spatial information, and existing image-conditioned agents suffer from instability and infinite looping when tasked with holistic room generation from top-down views. To address these limitations, we propose Code-as-Room, an MLLM-based agentic framework equipped with a structured execution harness, which represents 3D rooms with Blender codes. Given a top-down room image, the framework parses the reference image to extract scene elements and their spatial relationships, and synthesizes executable Blender code for geometry, materials, and lighting in a principled, multi-stage pipeline. A cross-stage memory module is maintained throughout to mitigate context forgetting inherent to existing agent-based frameworks. We further introduce a dedicated benchmark for code-based 3D room synthesis, encompassing various evaluation protocols. Based on our benchmark, comprehensive comparisons against existing agent-based methods are conducted to validate the effectiveness of our proposed execution harness.

Method Overview

Method Pipeline

Overview of the Code-as-Room pipeline. A single top-down view image is progressively transformed into a fully renderable 3D scene through a sequence of specialized MLLM agent stages, organized into five phases: image-based scene structuring, layout code generation, layout-grounded object profiling, object-level code generation, and interior decoration code generation. Arrows denote data flow through the cross-stage memory system, wherein each stage reads upstream outputs and writes its own results as typed memory entries.

Video Demo

Your browser does not support the video tag.

Turntable Comparisons with VIGA

Comparison of our method with VIGA baseline on different scenes:

Input Gallery

Input - Gallery

Input Living Room

Input - Living Room

Input Dining Room

Input - Dining Room

Input Media Room

Input - Media Room

Model Comparisons on Benchmark

Turntable videos comparing different MLLM models (GPT-5.5, Gemini 3.1, Gemini 3 Flash) on various difficulty levels:

Input Simple Scene

Input - Simple

Input Medium Scene 1

Input - Medium 1

Input Medium Scene 2

Input - Medium 2

Input Special Scene 1

Input - Special 1

Input Special Scene 2

Input - Special 2

More Results

Walkthrough Videos

Kitchen - Farmhouse Style

Qualitative Results

Result 1

Result 2

Comparison with Baselines

Comparison

Re-rendering Results

Rerender Result

Citation

@article{yang2026codeasroom, title={Code-as-Room: Generating 3D Rooms from Top-Down View Images via Agentic Code Synthesis}, author={Yang, Yixuan and Luo, Zhen and Gan, Wanshui and Hao, Jinkun and Lu, Junru and Yan, Jinghao and Lyu, Zhaoyang and Xu, Xudong}, journal={arXiv preprint arXiv:2605.18451}, year={2026} }