Short-to-Long Vidoes Diffusion Model for Generative Transition and Prediction (original) (raw)
Yaohui Wang1*, Lingjun Zhang2,1‡, Shaobin Zhuang1,3, Xin Ma4,1‡, Jiashuo Yu1, Yali Wang1, Dahua Lin1†, Yu Qiao1†, Ziwei Liu6†
1Shanghai Artificial Intelligence Laboratory 2East China Normal University 3Shanghai Jiao Tong University 4Monash University 5Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences 6S-Lab, Nanyang Technological University
*Equal contribution †Corresponding authors
‡Work done during internship at Shanghai AI Laboratory
Image-to-Video Generation
(Click image to play video)
Input image
Cars on the snowy ground of Doomsday Highway.
Input Image
Frozen City with crowded cars on the snowy ground.
Input Image
Notes:Our model excels with a width of 512 pixels. For results of larger sizes, a super-resolution algorithm is employed.
Transition Results
(Click image to play video)
Scene 1
Scene 2
Spiderman becomes a sand sculpture.
Scene 1
Scene 2
Flying through the clouds, a landscape appears.
Scene 1
Scene 2
A cat from sitting on the coach transfers to lying on the sand.
Diverse Results for Transition
(Click image to play transition video)
Example 1
Reference scenes
Example 2
Reference scenes
Auto-regressive Video Prediction Results
(Click image to play video)
Ironman flying in the sky.
A beautiful coastal beach in spring, waves lapping on sand by Vincent van Gogh.
A raccoon dressed in suit playing the trumpet, stage background, 4k, high resolution.
A teddy bear washing the dishes.
A panda playing on a swing set.
Long Video Demo
The red boxes represent the transitions generated by our model, while the blue boxes (in the end of video) represent the long-shot videos generated through prediction.
Notes:Our model excels with a width of 512 pixels. For results of larger sizes, a super-resolution algorithm is employed.
Abstract
Recently video generation has achieved substantial progress with realistic results. Nevertheless, existing AI-generated videos are usually very short clips ("shot-level'') depicting a single scene. To deliver a coherent long video ("story-level''), it is desirable to have creative transition and prediction effects across different clips. This paper presents a short-to-long video diffusion model, SEINE, that focuses on generative transition and prediction. The goal is to generate high-quality long videos with smooth and creative transitions between scenes and varying lengths of shot-level videos. Specifically, we propose a random-mask video diffusion model to automatically generate transitions based on textual descriptions. By providing the images of different scenes as inputs, combined with text-based control, our model generates transition videos that ensure coherence and visual quality. Furthermore, the model can be readily extended to various tasks such as image-to-video animation and autoregressive video prediction. To conduct a comprehensive evaluation of this new generative task, we propose three assessing criteria for smooth and creative transition: temporal consistency, semantic similarity, and video-text semantic alignment. Extensive experiments validate the effectiveness of our approach over existing methods for generative transition and prediction, enabling the creation of story-level long videos.