JARVIS-1: Open-Ended Multi-task Agents with Memory-Augmented Multimodal Language Models (original) (raw)
Abstract
Achieving human-like planning and control with multimodal observations in an open world is a key milestone for more functional generalist agents. Existing approaches can handle certain long-horizon tasks in an open world. However, they still struggle when the number of open-world tasks could potentially be infinite and lack the capability to progressively enhance task completion as game time progresses. We introduce JARVIS-1, an open-world agent that can perceive multimodal input (visual observations and human instructions), generate sophisticated plans, and perform embodied control, all within the popular yet challenging open-world Minecraft universe. Specifically, we develop JARVIS-1 on top of pre-trained multimodal language models, which map visual observations and textual instructions to plans. The plans will be ultimately dispatched to the goal-conditioned controllers. We outfit JARVIS-1 with a multimodal memory, which facilitates planning using both pre-trained knowledge and its actual game survival experiences. JARVIS-1 is the existing most general agent in Minecraft, capable of completing over 200 different tasks using control and observation space similar to humans. These tasks range from short-horizon tasks, e.g., "chopping trees" to long-horizon tasks, e.g., "obtaining a diamond pickaxe". JARVIS-1 performs exceptionally well in short-horizon tasks, achieving nearly perfect performance. In the classic long-term task of ObtainDiamondPickaxe, JARVIS-1 surpasses the reliability of current state-of-the-art agents by 5 times and can successfully complete longer-horizon and more challenging tasks.
Self-Improving JARVIS-1
JARVIS-1 is able to self-improve following a life-long learning paradigm thanks to growing multimodal memory, sparking a more general intelligence and improved autonomy. Next, we will demonstrate the performance of JARVIS-1 at different learning stages when completing the same task. (One Epoch represents that all tasks in the task-pool have been executed by JARVIS-1 in the environment once, regardless of success or failure.)
Epoch 1:
- mine 3 logs
- craft 12 planks
- craft 1 crafting_table
- craft 4 stick
- craft 1 wooden_pickaxe
- mine 3 cobblestone
- craft 1 stone_pickaxe
- mine 2 iron_ore
9) smelt 2 iron_ingot10) craft 1 shears(Lack of furnace as tool)
Epoch 2:
- Mine 3 logs
- Craft 12 planks
- Craft 1 crafting_table
- Craft 4 sticks
- Craft 1 wooden_pickaxe
- Mine 8 cobblestone
- Craft 1 furnace
- Mine 3 cobblestone
- Craft 1 stone_pickaxe
- Mine 2 iron_ore
- Smelt 2 iron_ingot
- Craft 1 shears
(Lack of fuel sometimes)
Epoch 3:
- mine 4 logs (One more as fuel)
- craft 12 planks
- craft 1 crafting_table
- craft 4 stick
- craft 1 wooden_pickaxe
- mine 11 cobblestone
- craft 1 furnace
- craft 1 stone_pickaxe
- mine 2 iron_ore
- smelt 2 iron_ingot
- craft 1 shears
(More accurate and efficient!)
1.5x Speed
Intruction-Following in Diverse Biomes
JARVIS-1 can execute human instructions in diverse environments. We illustrate executions in different biomes below.
Execution in Birch Forest:
More Results
Below we share some additional results of JARVIS-1 on Minecraft.
Wood Stone Iron Gold Diamond Redstone Blocks Armor Decoration Food
Wood Group
Generated Language Plan (Click to view relevant part of executation!):
Language Plan Executation_(1.5x Speed)_:
Generated Language Plan (Click to view relevant part of executation!):
Language Plan Executation_(1.5x Speed)_:
Stone Group
Generated Language Plan (Click to view relevant part of executation!):
Language Plan Executation_(1.5x Speed)_:
Iron Group
Generated Language Plan (Click to view relevant part of executation!):
Language Plan Executation_(1.5x Speed)_:
Gold Group
Generated Language Plan (Click to view relevant part of executation!):
Language Plan Executation_(1.5x Speed)_:
Diamond Group
Generated Language Plan (Click to view relevant part of executation!):
Language Plan Executation_(1.5x Speed)_:
Redstone Group
Generated Language Plan (Click to view relevant part of executation!):
Language Plan Executation_(1.5x Speed)_:
Armor Group
Generated Language Plan (Click to view relevant part of executation!):
Language Plan Executation_(1.5x Speed)_:
Decoration Group
Generated Language Plan (Click to view relevant part of executation!):
Language Plan Executation_(1.5x Speed)_:
Food Group
Generated Language Plan (Click to view relevant part of executation!):
Language Plan Executation_(1.5x Speed)_:
Related Projects
Check out some of our related projects below!
![]() |
GROOT: Learning to Follow Instructions by Watching Gameplay Videos This work proposes to follow reference videos as instructions, which offer expressive goal specifications while eliminating the need for expensive text-gameplay annotations, and implements the agent GROOT in a simple yet effective encoder-decoder architecture based on causal transformers. |
|---|---|
![]() |
Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents DEPS is an interactive planning approach based on Large Language Models (LLMs) for open-ended multi-task agents. It helps with better error correction from the feedback during the long-haul planning, while also bringing the sense of proximity via goal Selector, a learnable module that ranks parallel sub-goals based on the estimated steps of completion and improves the original plan accordingly. |
![]() |
MCU: A Task-centric Framework for Open-ended Agent Evaluation in Minecraft MCU is an open-ended Minecraft agent evaluation framework that can generate infinite tasks and reveal the difficulty of tasks. In MCU, "task" is a structured data object. MCU leverages "atom tasks" as building blocks to compose complex tasks. Each task is measured with six distinct difficulty scores, which offer a multi-dimensional assessment of a task from different angles. We also maintain a unified benchmark, namely SkillForge, which comprises representative tasks under MCU framework. Researchers can filter specific tasks with certain properties or attributes from SkillForge to test their agent. |
![]() |
Open-World Multi-Task Control Through Goal-Aware Representation Learning and Adaptive Horizon Prediction This paper studies the problem of learning goal-conditioned policies in Minecraft. It first identify two main challenges of learning such policies and then propose to combine a goal-sensitive backbone and an adaptive horizon prediction module to tackle these challenges. |
BibTex
@article{wang2023jarvis1,
title = {JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models},
author = {Zihao Wang and Shaofei Cai and Anji Liu and Yonggang Jin and Jinbing Hou and Bowei Zhang and Haowei Lin and Zhaofeng He and Zilong Zheng and Yaodong Yang and Xiaojian Ma and Yitao Liang},
year = {2023},
journal = {arXiv preprint arXiv: 2311.05997}
}
Video 


