Embodied Generalist LEO (original) (raw)

An Embodied Generalist Agent in 3D World

ICML 2024

Silong Yong1,3✶, Xiaojian Ma1✶, Xiongkun Linghu1✶, Puhao Li1,4,
Yan Wang1, Qing Li1, Song-Chun Zhu1,2,4, Baoxiong Jia1, Siyuan Huang1

1Beijing Institute for General Artificial Intelligence (BIGAI)
2Peking University 3Carnegie Mellon University 4Tsinghua University

✶ indicates equal contribution

Abstract

Leveraging massive knowledge and learning schemes from large language models (LLMs), recent machine learning models show notable successes in building generalist agents that exhibit the capability of general-purpose task solving in diverse domains, including natural language processing, computer vision, and robotics. However, a significant challenge remains as these models exhibit limited ability in understanding and interacting with the 3D world. We argue this limitation significantly hinders the current models from performing real-world tasks and further achieving general intelligence. To this end, we introduce an embodied multi-modal and multi-task generalist agent that excels in perceiving, grounding, reasoning, planning, and acting in the 3D world. Our proposed agent, referred to as LEO, is trained with shared LLM-based model architectures, objectives, and weights in two stages: (i) 3D vision-language alignment and (ii) 3D vision-language-action instruction tuning. To facilitate the training, we meticulously curate and generate an extensive dataset comprising object-level and scene-level multi-modal tasks with exceeding scale and complexity, necessitating a deep understanding of and interaction with the 3D world. Through rigorous experiments, we demonstrate LEO's remarkable proficiency across a wide spectrum of tasks, including 3D captioning, question answering, embodied reasoning, embodied navigation, and robotic manipulation. Our ablation results further provide valuable insights for the development of future embodied generalist agents.

Model

Scene representation. The scene point cloud is partitioned into object-centric point clouds (either ground truth or predicted proposals), which are then processed by the 3D encoder to obtain object-centric features. We also incorporate an optional 2D branch, where a 2D encoder processes the agent's ego-view observation to obtain ego-centric features.

Unified sequence and objective. The sequence begins with a system message that tells the agent its role and situation. Subsequent 2D image tokens and 3D object tokens provide the perceived scene information. Next an instruction specifies the task or context, and also prompts for the final response. The learning objective is a simple auto-regressive loss.

Data

Two-stage scheme: alignment & instruction tuning. We combine existing datasets and LLM-prompted data to create LEO-align and LEO-instruct.

Demo

Select 3D VL capabilities

Captioning Reasoning Dialogue Planning

Prompt text in gray.

LEO's response in blue shade.

Robotic Manipulation

Embodied Navigation

BibTeX

@inproceedings{huang2024embodied,
  title={An Embodied Generalist Agent in 3D World},
  author={Huang, Jiangyong and Yong, Silong and Ma, Xiaojian and Linghu, Xiongkun and Li, Puhao and Wang, Yan and Li, Qing and Zhu, Song-Chun and Jia, Baoxiong and Huang, Siyuan},
  booktitle={Proceedings of the International Conference on Machine Learning (ICML)},
  year={2024}
}