JARVIS-1: Open-Ended Multi-task Agents with Memory-Augmented Multimodal Language Models (original) (raw)

Abstract

Achieving human-like planning and control with multimodal observations in an open world is a key milestone for more functional generalist agents. Existing approaches can handle certain long-horizon tasks in an open world. However, they still struggle when the number of open-world tasks could potentially be infinite and lack the capability to progressively enhance task completion as game time progresses. We introduce JARVIS-1, an open-world agent that can perceive multimodal input (visual observations and human instructions), generate sophisticated plans, and perform embodied control, all within the popular yet challenging open-world Minecraft universe. Specifically, we develop JARVIS-1 on top of pre-trained multimodal language models, which map visual observations and textual instructions to plans. The plans will be ultimately dispatched to the goal-conditioned controllers. We outfit JARVIS-1 with a multimodal memory, which facilitates planning using both pre-trained knowledge and its actual game survival experiences. JARVIS-1 is the existing most general agent in Minecraft, capable of completing over 200 different tasks using control and observation space similar to humans. These tasks range from short-horizon tasks, e.g., "chopping trees" to long-horizon tasks, e.g., "obtaining a diamond pickaxe". JARVIS-1 performs exceptionally well in short-horizon tasks, achieving nearly perfect performance. In the classic long-term task of ObtainDiamondPickaxe, JARVIS-1 surpasses the reliability of current state-of-the-art agents by 5 times and can successfully complete longer-horizon and more challenging tasks.

Self-Improving JARVIS-1

JARVIS-1 is able to self-improve following a life-long learning paradigm thanks to growing multimodal memory, sparking a more general intelligence and improved autonomy. Next, we will demonstrate the performance of JARVIS-1 at different learning stages when completing the same task. (One Epoch represents that all tasks in the task-pool have been executed by JARVIS-1 in the environment once, regardless of success or failure.)

Epoch 1:

mine 3 logs
craft 12 planks
craft 1 crafting_table
craft 4 stick
craft 1 wooden_pickaxe
mine 3 cobblestone
craft 1 stone_pickaxe
mine 2 iron_ore
~~9) smelt 2 iron_ingot~~ ~~10) craft 1 shears~~ (Lack of furnace as tool)

Epoch 2:

Mine 3 logs
Craft 12 planks
Craft 1 crafting_table
Craft 4 sticks
Craft 1 wooden_pickaxe
Mine 8 cobblestone
Craft 1 furnace
Mine 3 cobblestone
Craft 1 stone_pickaxe
Mine 2 iron_ore
Smelt 2 iron_ingot
Craft 1 shears
(Lack of fuel sometimes)

Epoch 3:

mine 4 logs (One more as fuel)
craft 12 planks
craft 1 crafting_table
craft 4 stick
craft 1 wooden_pickaxe
mine 11 cobblestone
craft 1 furnace
craft 1 stone_pickaxe
mine 2 iron_ore
smelt 2 iron_ingot
craft 1 shears
(More accurate and efficient!)

1.5x Speed

Intruction-Following in Diverse Biomes

JARVIS-1 can execute human instructions in diverse environments. We illustrate executions in different biomes below.

Execution in Birch Forest:

More Results

Below we share some additional results of JARVIS-1 on Minecraft.

Wood Stone Iron Gold Diamond Redstone Blocks Armor Decoration Food