FLIP : Flow-Centric Generative Planning as General-Purpose Manipulation World Model (original) (raw)
1National University of Singapore, 2Peking University
Your browser does not support the video tag.
Abstract
We aim to develop a model-based planning framework for world models that can be scaled with increasing model and data budgets for general-purpose manipulation tasks with only language and vision inputs.
To this end, we present FLow centrIc generative Planning (FLIP), a model-based planning algorithm on visual space that features three key modules: 1. a multi-modal flow generation model as the general-purpose action proposal module; 2. a flow-conditioned video generation model as the dynamics module; and 3. a vision-language representation learning model as the value module. Given an initial image and language instruction as the goal, FLIP can progressively search for long-horizon flow and video plans that maximize the discounted return to accomplish the task. FLIP is able to synthesize long-horizon plans across objects, robots, and tasks with image flows as the general action representation, and the dense flow information also provides rich guidance for long-horizon video generation. In addition, the synthesized flow and video plans can guide the training of low-level control policies for robot execution.
Experiments on diverse benchmarks demonstrate that FLIP can improve both the success rates and quality of long-horizon video plan synthesis and has the interactive world model property, opening up wider applications for future works.
Three Modules of FLIP
Left: the tokenizing process of different modalities in training data.
Middle Left: we use a Conditional VAE to generate flows as actions. It separately generates the delta scale and directions on each query point for flow reconstruction.
Middle Right: we use a DiT model with the spatial-temporal attention mechanism for flow-conditioned video generation. Flows (and observation history) are conditioned with cross attention, while languages and timestep are conditioned with AdaLN-zero.
Right: The value module of FLIP. We follow the idea of LIV and use time-contrastive learning for the visual-language representation, but we treat each video clip (rather than each frame) as a state. The fine-tuned value curves of LIV and ours are shown at bottom.
LIBERO-LONG Planing Results
We train FLIP on LIBERO-LONG, a suite of long-horizon tabletop manipulation tasks with 50 demonstrations for each task.
The flows, videos, and value curves are all generated by FLIP.
Your browser does not support the video tag.
LIBERO-LONG Low-Level Policy Results
We use the flow and video plans from FLIP to train a low-level conditional diffusion policy on the LIBERO-LONG tasks, and achieve better results than previous methods. Among them, Ours-FV achieves the best results, showing the advantages of using both flow and video conditions.
Real World Policy Results: Tea Scooping
Real World Policy Results: Cloth Unfolding
Interactive World Model Results
We manually specify image flows for several LIBERO-LONG tasks to demonstrate the interactive property of FLIP.
Note the flows are different with the flows in the training dataset.
Your browser does not support the video tag.
Your browser does not support the video tag.
Your browser does not support the video tag.
Resist Visual Distractions
We show how will FLIP perfrom if we add visual distractions on the initial image. In the two cases, we manually add an image of apple as the visual distraction. If the apple is on the target obejct (left case), the model-based planning will fall in several steps; however, if the apple is on the background (right case), the model-based planning can successfully continue for a long time, showing our model can resist kinds of visual distractions.
It is worth noting that the generated flows are stabler than the generated videos, showing the advantages of using flows as the action representation.
Your browser does not support the video tag.
Your browser does not support the video tag.
Failure Cases
We show the failure cases of FLIP. These are caused by the accumulated error during the planning.
Your browser does not support the video tag.
Your browser does not support the video tag.
Your browser does not support the video tag.
Your browser does not support the video tag.
Your browser does not support the video tag.
Aloha Real Results
Aloha Sim Results
FMB Benchmark Results
Cube Results
Egg Peeling Results
Folding Results
Unfolding Results
Pen Spin Results
Tying Plastic Bag Results
Fruit Peel Results
BibTeX
@article{flip,
author = {Gao, Chongkai and Zhang, Haozhuo and Xu, Zhixuan and Cai, Zhehao and Shao, Lin},
title = {FLIP: Flow-Centric Generative Planning for General-Purpose Manipulation Tasks},
journal = {arXiv preprint arXiv:2412.08261},
year = {2024},
url = {https://arxiv.org/abs/2412.08261},
}