LaVie: High-Quality Video Generation with Cascaded Latent Diffusion Models (original) (raw)

Xinyuan Chen1*, Xin Ma1,4*‡, Shangchen Zhou2, Ziqi Huang2, Yi Wang1, Ceyuan Yang1, Yinan He1, Jiashuo Yu1, Peiqing Yang2, Yuwei Guo1,3, Tianxing Wu2, Chenyang Si2, Yuming Jiang2, Cunjian Chen4, Chen Change Loy2, Bo Dai1, Dahua Lin1,3†, Yu Qiao1†, Ziwei Liu2,1†

1Shanghai Artificial Intelligence Laboratory 2S-Lab, Nanyang Technological University 3The Chinese University of Hong Kong 4Monash University

*Equal contribution †Corresponding author

‡Work done during internship at Shanghai AI Laboratory

Text-to-Video Generation

(Click image to play video)

Cinematic shot of Van Gogh's selfie, Van Gogh style

A corgi’s head depicted as an explosion of a nebula, high quality

A panda drinking coffee in a cafe in Paris

Iron Man flying in the sky

A jellyfish floating through the ocean, with bioluminescent tentacles

A Mars rover moving on Mars

The bund Shanghai, oil painting

A fantasy landscape, trending on artstation, 4k, high resolution

A space shuttle launching into orbit, with flames and smoke billowing out from the engines

A super cool giant robot in Cyberpunk city, artstation

A tropical beach at sunrise, with palm trees and crystal-clear water in the foreground

A robot dj is playing the turntable, in heavy raining futuristic tokyo rooftop cyberpunk night, sci-fi, fantasy, intrica

Gwen Stacy reading a book

A future where humans have achieved teleportation technology

A steam train moving on a mountainside by Vincent van Gogh

Yoda playing guitar on the stage

A beautiful coastal beach in spring, waves lapping on sand by Hokusai, in the style of Ukiyo

A happy fuzzy panda playing guitar nearby a campfire, snow mountain in the background

-->

A cat eating food out of a bowl, in style of Van Gogh

A boat sailing leisurely along the Seine River with the Eiffel Tower in background by Vincent van Gogh

Vincent van Gogh is painting in the room

Abstract

This work aims to learn a high-quality text-to-video (T2V) generative model by leveraging a pre-trained text-to-image (T2I) model as a basis. It is a highly desirable yet challenging task to simultaneously a) accomplish the synthesis of visually realistic and temporally coherent videos while b) preserving the strong creative generation nature of the pre-trained T2I model. To this end, we propose LaVie, an integrated video generation framework that operates on cascaded video latent diffusion models, comprising a base T2V model, a temporal interpolation model, and a video super-resolution model. Our key insights are two-fold: 1) We reveal that the incorporation of simple temporal self-attentions, coupled with relative positional encoding, adequately captures the temporal correlations inherent in video data. 2) Additionally, we validate that the process of joint image-video fine-tuning plays a pivotal role in producing high-quality and creative outcomes. To enhance the performance of LaVie, we contribute a comprehensive and diverse video dataset named Vimeo25M, consisting of 25 million text-video pairs that prioritize quality, diversity, and aesthetic appeal. Extensive experiments demonstrate that LaVie achieves state-of-the-art performance both quantitatively and qualitatively. Furthermore, we showcase the versatility of pre-trained LaVie models in various long video generation and personalized video synthesis applications.

BibTeX

@article{wang2023lavie,
  title={LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models},
  author={Wang, Yaohui and Chen, Xinyuan and Ma, Xin and Zhou, Shangchen and Huang, Ziqi and Wang, Yi and Yang, Ceyuan and He, Yinan and Yu, Jiashuo and Yang, Peiqing and others},
  journal={arXiv preprint arXiv:2309.15103},
  year={2023}}