View-on-Graph: Zero-Shot 3D Visual Grounding via Vision-Language Reasoning on Scene Graphs (original) (raw)
Authors
- Yuanyuan Liu Key Laboratory of Social Computing and Cognitive Intelligence, Dalian University of Technology
- Haiyang Mei Key Laboratory of Social Computing and Cognitive Intelligence, Dalian University of Technology Show Lab, National University of Singapore
- Dongyang Zhan Key Laboratory of Social Computing and Cognitive Intelligence, Dalian University of Technology
- Jiayue Zhao Key Laboratory of Social Computing and Cognitive Intelligence, Dalian University of Technology
- Dongsheng Zhou Dalian University
- Bo Dong Cephia AI, INC.
- Xin Yang Key Laboratory of Social Computing and Cognitive Intelligence, Dalian University of Technology
DOI:
https://doi.org/10.1609/aaai.v40i9.37677
Abstract
3D visual grounding (3DVG) identifies objects in 3D scenes from language descriptions. Existing zero-shot approaches leverage 2D vision–language models (VLMs) by converting 3D spatial information (SI) into forms amenable to VLM processing, typically as composite inputs such as specified-view renderings or video sequences with overlaid object markers. However, this VLM ⊕ SI paradigm yields entangled visual representations that compel the VLM to process entire cluttered cues, making it hard to exploit spatial–semantic relationships effectively. In this work, we propose a new VLM ⊗ SI paradigm that externalizes the 3D SI into a form enabling the VLM to incrementally retrieve only what it needs during reasoning. We instantiate this paradigm with a novel View-on-Graph (VoG) method, which organizes the scene into a multi-modal, multi-layer scene graph and allows the VLM to operate as an active agent that selectively accesses necessary cues as it traverses the scene. This design offers two intrinsic advantages: (i) by structuring 3D context into a spatially and semantically coherent scene graph rather than confounding the VLM with densely entangled visual inputs, it lowers the VLM's reasoning difficulty; and (ii) by actively exploring and reasoning over the scene graph, it naturally produces transparent, step-by-step traces for interpretable 3DVG. Extensive experiments show that VoG achieves state-of-the-art zero-shot performance, establishing structured scene exploration as a promising strategy for advancing zero-shot 3DVG.
How to Cite
Liu, Y., Mei, H., Zhan, D., Zhao, J., Zhou, D., Dong, B., & Yang, X. (2026). View-on-Graph: Zero-Shot 3D Visual Grounding via Vision-Language Reasoning on Scene Graphs. Proceedings of the AAAI Conference on Artificial Intelligence, 40(9), 7386-7394. https://doi.org/10.1609/aaai.v40i9.37677
Issue
Section
AAAI Technical Track on Computer Vision VI