Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks (original) (raw)
Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks
Abstract
Building a general-purpose agent is a long-standing vision in the field of artificial intelligence. Existing agents have made remarkable progress in many domains, yet they still struggle to complete long-horizon tasks in an open world. We attribute this to the lack of necessary world knowledge and experience that can guide agent through a variety of long-horizon tasks. In this paper, we propose a Hybrid Multimodal Memory module to address above challenges. It 1) transforms knowledge into Hierarchical Directed Knowledge Graph that allow agents to explicitly represent and learn world knowledge, and 2) summarises historical information into Abstracted Multimodal Experience Pool that provide agents with rich references for in-context learning. On top of the Hybrid Multimodal Memory module, a multimodal multimodular agent, Optimus-1, is constructed with dedicated Knowledge-guided Planner andExperience-Driven Reflector in Minecraft, contributing to a better planning and reflection in the face of long-horizon tasks. Extensive experimental results show that Optimus-1 significantly outperforms all existing agents on challenging long-horizon tasks benchmark, and exhibits near human-level performance on many tasks. In addition, we introduce various Multimodal Large Language Models (MLLM) as the backbone of Optimus-1, and the experimental results show that Optimus-1 exhibit strong generalisation with the help of Hybrid Multimodal Memory module, outperforming the GPT-4V baseline on many tasks. The extensive experimental results show that Optimus-1 makes a major step towards a general agent with a human-like level of performance.
Demos
Wooden Group
Your browser does not support the video tag.
Craft a crafting table
- mine 1 logs
- craft 4 planks
- craft a crafting table
Your browser does not support the video tag.
Craft a wooden pickaxe
- mine 3 logs
- craft 9 planks
- craft 2 sticks
- craft 1 crafting table
- craft 1 wooden pickaxe
Your browser does not support the video tag.
Craft a wooden sword
- mine 3 logs
- craft 8 planks
- craft 1 sticks
- craft 1 crafting table
- craft 1 wooden sword
Stone Group
Your browser does not support the video tag.
Craft a torch
- mine 3 logs
- craft 10 planks
- craft 3 stick
- craft 1 crafting table
- craft 1 wooden pickaxe
- equip 1 wooden pickaxe
- dig down and mine 1 coal
- craft 1 torch
Your browser does not support the video tag.
Craft a stone pickaxe
- mine 2 logs
- craft 6 planks
- craft 2 sticks
- craft 1 crafting table
- mine 1 logs
- craft 1 planks
- craft 1 wooden pickaxe
- equip 1 wooden pickaxe
- dig down and mine 3 stone
- craft 1 stone pickaxe
Your browser does not support the video tag.
Craft a stone sword
- mine 4 logs
- craft 15 planks
- craft 3 stick
- craft 1 crafting table
- craft 1 wooden pickaxe
- equip 1 wooden_pickaxe
- dig down and mine 2 cobblestone
- craft 1 stone sword
Iron Group
Your browser does not support the video tag.
Craft an iron pickaxe
- mine 10 logs
- craft 38 planks
- craft 8 sticks
- craft 1 crafting table
- craft 1 wooden pickaxe
- equip 1 wooden pickaxe
- dig down and break down 12 cobblestone
- craft 1 stone pickaxe
- equip 1 stone pickaxe
- craft 1 furnace
- dig down and break down 3 iron ore
- smelt 3 iron ore
- craft 1 iron pickaxe
Your browser does not support the video tag.
Craft an iron sword
- mine 7 logs
- craft 23 planks
- craft 6 stick
- craft 1 crafting table
- craft 1 wooden pickaxe
- equip 1 wooden pickaxe
- dig down and break down 11 cobblestone
- craft 1 stone pickaxe
- equip 1 stone pickaxe
- craft 1 furnace
- dig down and break down 2 iron ore
- smelt 2 iron ore
- craft 1 iron sword
Your browser does not support the video tag.
Craft rails
- mine 7 logs
- craft 21 planks
- craft 5 stick
- craft 1 crafting table
- craft 1 wooden pickaxe
- equip 1 wooden pickaxe
- dig down and mine 11 stone
- craft 1 stone pickaxe
- dig down and mine 6 iron ore
- craft 1 furnace
- smelt 6 iron ore into iron ingots
- craft 1 rail
Golden Group
Your browser does not support the video tag.
Craft golden axe
- mine 10 logs
- craft 38 planks
- craft 8 sticks
- craft 1 crafting table
- craft 1 wooden pickaxe
- equip 1 wooden pickaxe
- dig down and break down 12 cobblestone
- craft 1 stone pickaxe
- equip 1 stone pickaxe
- craft 1 furnace
- dig down and break down 3 iron ore
- smelt 3 iron ore
- craft 1 iron pickaxe
- dig down and mine 3 gold
- smelt 3 gold
- craft 1 golden axe
Your browser does not support the video tag.
Craft golden sword
- mine 10 logs
- craft 38 planks
- craft 8 sticks
- craft 1 crafting table
- craft 1 wooden pickaxe
- equip 1 wooden pickaxe
- dig down and break down 12 cobblestone
- craft 1 stone pickaxe
- equip 1 stone pickaxe
- craft 1 furnace
- dig down and break down 3 iron ore
- smelt 3 iron ore
- craft 1 iron pickaxe
- dig down and mine 2 gold
- smelt 2 gold
- craft 1 golden sword
Your browser does not support the video tag.
Craft golden shovel
- mine 10 logs
- craft 38 planks
- craft 8 sticks
- craft 1 crafting table
- craft 1 wooden pickaxe
- equip 1 wooden pickaxe
- dig down and break down 12 cobblestone
- craft 1 stone pickaxe
- equip 1 stone pickaxe
- craft 1 furnace
- dig down and break down 3 iron ore
- smelt 3 iron ore
- craft 1 iron pickaxe
- dig down and mine 1 gold
- smelt 1 gold
- craft 1 golden shovel
Diamond Group
Your browser does not support the video tag.
Craft a diamond axe
- mine 10 logs
- craft 38 planks
- craft 8 sticks
- craft 1 crafting table
- craft 1 wooden pickaxe
- equip 1 wooden pickaxe
- explore 1 to find a good spot to dig down
- dig down and break down 12 cobblestone
- craft 1 stone pickaxe
- equip 1 stone pickaxe
- craft 1 furnace
- dig down and break down 3 iron ore
- smelt 3 iron ore
- craft 1 iron pickaxe
- dig down and mine 3 diamond
- craft 1 diamond axe
Your browser does not support the video tag.
Craft a diamond pickaxe
- mine 10 logs
- craft 38 planks
- craft 8 sticks
- craft 1 crafting table
- craft 1 wooden pickaxe
- equip 1 wooden pickaxe
- dig down and break down 12 cobblestone
- craft 1 stone pickaxe
- equip 1 stone pickaxe
- craft 1 furnace
- dig down and break down 3 iron ore
- smelt 3 iron ore
- craft 1 iron pickaxe
- dig down and mine 3 diamond
- craft 1 diamond pickaxe
Your browser does not support the video tag.
Craft a diamond hoe
- mine 10 logs
- craft 38 planks
- craft 8 sticks
- craft 1 crafting table
- craft 1 wooden pickaxe
- equip 1 wooden pickaxe
- dig down and break down 12 cobblestone
- craft 1 stone pickaxe
- equip 1 stone pickaxe
- craft 1 furnace
- dig down and break down 3 iron ore
- smelt 3 iron ore
- craft 1 iron pickaxe
- dig 2 down and mine 2 diamond
- craft 1 diamond hoe
Armor Group
Your browser does not support the video tag.
Craft golden boots
- mine 10 logs
- craft 38 planks
- craft 8 sticks
- craft 1 crafting table
- craft 1 wooden pickaxe
- equip 1 wooden pickaxe
- dig down and break down 12 cobblestone
- craft 1 stone pickaxe
- equip 1 stone pickaxe
- craft 1 furnace
- dig down and break down 3 iron ore
- smelt 3 iron ore
- craft 1 iron pickaxe
- equip 1 iron pickaxe
- dig down and mine 4 gold
- smelt 4 gold
- craft 1 golden boots
Your browser does not support the video tag.
Craft an iron leggings
- mine 9 logs
- craft 36 planks
- craft 8 sticks
- craft 1 crafting table
- craft 1 chest
- craft 1 wooden pickaxe
- equip 1 wooden pickaxe
- dig down and break down 12 cobblestone
- craft 1 stone pickaxe
- craft 1 furnace
- equip 1 stone pickaxe
- dig down and break down 5 iron ore
- smelt 5 iron ore
- craft 1 iron helmet
Your browser does not support the video tag.
Craft a diamond helmet
- mine 10 logs
- craft 38 planks
- craft 8 sticks
- craft 1 crafting table
- craft 1 wooden pickaxe
- equip 1 wooden pickaxe
- explore 1 to find a good spot to dig down
- dig down and break down 12 cobblestone
- craft 1 stone pickaxe
- equip 1 stone pickaxe
- craft 1 furnace
- dig down and break down 3 iron ore
- smelt 3 iron ore
- craft 1 iron pickaxe
- equip 1 iron pickaxe
- dig down and mine5 diamond
- craft 1 diamond helmet
Redstone Group
Your browser does not support the video tag.
Craft piston
- mine 10 logs
- craft 38 planks
- craft 8 sticks
- craft 1 crafting table
- craft 1 wooden pickaxe
- equip 1 wooden pickaxe
- dig down and break down 12 cobblestone
- craft 1 stone pickaxe
- craft 1 furnace
- equip 1 stone pickaxe
- dig down and break down 4 iron ore
- smelt 4 iron ore
- craft 1 iron pickaxe
- equip 1 iron pickaxe
- dig down and break down 1 redstone
- craft 1 piston
Your browser does not support the video tag.
Craft redstone torch
- mine 10 logs
- craft 38 planks
- craft 8 sticks
- craft 1 crafting table
- craft 1 wooden pickaxe
- equip 1 wooden pickaxe
- dig down and break down 12 cobblestone
- craft 1 stone pickaxe
- craft 1 furnace
- equip 1 stone pickaxe
- dig down and break down 3 iron ore
- smelt 3 iron ore
- craft 1 iron pickaxe
- equip 1 iron pickaxe
- dig down and break down 1 redstone
- craft 1 restone torch
Your browser does not support the video tag.
Craft activator rail
- mine 10 logs
- craft 38 planks
- craft 10 sticks
- craft 1 crafting table
- craft 1 wooden pickaxe
- equip 1 wooden pickaxe
- dig down and break down 11 cobblestone
- craft 1 stone pickaxe
- craft 1 a furnace
- equip 1 stone pickaxe
- dig down and break down 9 iron ore
- smelt 9 iron ore
- craft 1 iron pickaxe
- equip 1 iron pickaxe
- dig down and break down 1 redstone
- craft 1 restone torch
- craft 1 activator rail
Overview framework of our Optimus-1
We divide the structure of Optimus-1 into Knowledge-Guided Planner, Experience-Driven Reflector, and Action Controller. In a given game environment with a long-horizon task, the Knowledge-Guided Planner senses the environment, retrieves knowledge from HDKG, and decomposes the task into executable sub-goals. The action controller then sequentially executes these sub-goals. During execution, the Experience-Driven Reflector is activated periodically, leveraging historical experience from AMEP to assess whether Optimus-1 can complete the current sub-goal. If not, it instructs the Knowledge-Guided Planner to revise its plan. Through iterative interaction with the environment,Optimus-1 ultimately completes the task.
Hybrid Multimodal Memory
(a) Extraction process of multimodal experience. The frames are filtered through video buffer and image buffer, then MineCLIP is employed to compute the visual and sub-goal similarities and finally they are stored in Abstracted Multimodal Experience Pool. (b) Overview of Hierarchical Directed Knowledge Graph. Knowledge is stored as a directed graph, where its nodes represent objects, and directed edges point to materials that can be crafted by this object.
Experiment
Main Result of Optimus-1 on long-horizon tasks benchmark.
We report the average success rate (SR), average number of steps (AS), and average time (AT) on each task group, the results of each task can be found in the Appendix experiment. Lower AS and AT metrics mean that the agent is more efficient at completing the task, while ∞ indicates that the agent is unable to complete the task. Overall represents the average result on the five groups of Iron, Gold, Diamond, Redstone, and Armor.
Generalisation and Self-Evoluation
(a) With the help of Hybrid Multimodal Memory, various MLLM-based Optimus-1 have demonstrated 2 to 6 times performance improvement. (b) Illustration of the change in Optimus-1 success rate on the unseen task over 4 epochs.
Conclusion
In this paper, we propose Hybrid Multimodal Memory module, which is inspired by the major influence of the human long-term memory system on the completion of long-horizon tasks. Hybrid Multimodal Memory module consists of two parts: HDKG and AMEP. HDKG provides the necessary world knowledge for the planning phase of the agent, and AMEP provides the refined historical experience for the reflection phase of the agent. On top of the Hybrid Multimodal Memory, we construct the multimodal and multimodular agent Optimus-1 in Minecraft. Extensive experimental results show that Optimus-1 outperforms all existing agents on long-horizon tasks. Furthermore, we validate that general-purpose MLLM, based on our proposed Hybrid Multimodal Memory and without additional parameter updates, can exceed the powerful GPT-4V baseline. This self-evolution approach provides novel insights and directions for the study of general-purpose agents.
BibTeX
@inproceedings{li2024optimus,
title={Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks},
author={Li, Zaijing and Xie, Yuquan and Shao, Rui and Chen, Gongwei and Jiang, Dongmei and Nie, Liqiang},
booktitle={NeurIPS},
year={2024}
}