View-on-Graph: Zero-Shot 3D Visual Grounding via Vision-Language Reasoning on Scene Graphs (original) (raw)

Authors

Yuanyuan Liu Key Laboratory of Social Computing and Cognitive Intelligence, Dalian University of Technology
Haiyang Mei Key Laboratory of Social Computing and Cognitive Intelligence, Dalian University of Technology Show Lab, National University of Singapore
Dongyang Zhan Key Laboratory of Social Computing and Cognitive Intelligence, Dalian University of Technology
Jiayue Zhao Key Laboratory of Social Computing and Cognitive Intelligence, Dalian University of Technology
Dongsheng Zhou Dalian University
Bo Dong Cephia AI, INC.
Xin Yang Key Laboratory of Social Computing and Cognitive Intelligence, Dalian University of Technology

DOI:

https://doi.org/10.1609/aaai.v40i9.37677

Abstract

3D visual grounding (3DVG) identifies objects in 3D scenes from language descriptions. Existing zero-shot approaches leverage 2D vision–language models (VLMs) by converting 3D spatial information (SI) into forms amenable to VLM processing, typically as composite inputs such as specified-view renderings or video sequences with overlaid object markers. However, this VLM ⊕ SI paradigm yields entangled visual representations that compel the VLM to process entire cluttered cues, making it hard to exploit spatial–semantic relationships effectively. In this work, we propose a new VLM ⊗ SI paradigm that externalizes the 3D SI into a form enabling the VLM to incrementally retrieve only what it needs during reasoning. We instantiate this paradigm with a novel View-on-Graph (VoG) method, which organizes the scene into a multi-modal, multi-layer scene graph and allows the VLM to operate as an active agent that selectively accesses necessary cues as it traverses the scene. This design offers two intrinsic advantages: (i) by structuring 3D context into a spatially and semantically coherent scene graph rather than confounding the VLM with densely entangled visual inputs, it lowers the VLM's reasoning difficulty; and (ii) by actively exploring and reasoning over the scene graph, it naturally produces transparent, step-by-step traces for interpretable 3DVG. Extensive experiments show that VoG achieves state-of-the-art zero-shot performance, establishing structured scene exploration as a promising strategy for advancing zero-shot 3DVG.

How to Cite

Liu, Y., Mei, H., Zhan, D., Zhao, J., Zhou, D., Dong, B., & Yang, X. (2026). View-on-Graph: Zero-Shot 3D Visual Grounding via Vision-Language Reasoning on Scene Graphs. Proceedings of the AAAI Conference on Artificial Intelligence, 40(9), 7386-7394. https://doi.org/10.1609/aaai.v40i9.37677

Issue

Section

AAAI Technical Track on Computer Vision VI