VEnhancer (original) (raw)

Generative Space-Time Enhancement for Video Generation

1The Chinese University of Hong Kong, 2Shanghai Artificial Intelligence Laboratory,

3S-Lab, Nanyang Technological University

†Corresponding authors

Clown fish swimming through the coral reef.

A fat rabbit wearing a purple robe walking through a fantasy landscape.

A cat wearing sunglasses at a pool.

A cute raccoon playing guitar in a boat on the ocean.

Gwen Stacy reading a book, black and white.

A storm trooper vacuuming the beach.

a teddy bear is swimming in the ocean.

Iron Man flying in the sky.

The bund Shanghai by Hokusai, in the style of Ukiyo.

An astronaut is riding a horse in the space in a photorealistic style.

A cute happy Corgi playing in park, sunset, in cyberpunk style.

A teddy bear is playing drum kit in NYC Times Square.

An astronaut flying in space, featuring a steady and smooth perspective.

An epic tornado attacking above a glowing city at night, the tornado is made of smoke.

Abstract

We present VEnhancer, a generative space-time enhancement framework improves the existing text-to-video results by adding more details in spatial domain and synthetic detailed motion in temporal domain. Given a generated low-quality video, our approach can increase its spatial and temporal resolution simultaneously with arbitrary up-sampling space and time scales through a unified video diffusion model. Furthermore, VEnhancer effectively removes generated spatial artifacts and temporal flickering of generated videos. To achieve this, basing on a pretrained video diffusion model, we train a video ControlNet and inject it to the diffusion model as a condition on low frame-rate and low-resolution videos. To effectively train this video ControlNet, we design space-time data augmentation as well as video-aware conditioning. Benefiting from the above designs, VEnhancer yields to be stable during training and shares an elegant end-to-end training manner. Extensive experiments show that VEnhancer surpasses existing state-of-the-art video super-resolution and space-time super-resolution methods in enhancing AI-generated videos. Moreover, with VEnhancer, exisiting open-source state-of-the-art text-to-video method, VideoCrafter-2, reaches the top one in video generation benchmark -- VBench.

Method

The architecture of VEnhancer. It follows ControlNet and copies the architecures and weights of multi-frame encoder and middle block of a pretrained video diffusion model to build a trainable condition network. This video ControlNet accepts low-resolution key frames as well as full frames of noisy latents as inputs. Also, the noise level sigma\sigmasigma regarding noise augmentation and downscaling factor sss serve as additional network conditioning apart from timestep ttt and prompt ctextc_{text}ctext.

VEnhancer Demo

BibTeX

@article{he2024venhancer,
      title={VEnhancer: Generative Space-Time Enhancement for Video Generation},
      author={He, Jingwen and Xue, Tianfan and Liu, Dongyang and Lin, Xinqi and Gao, Peng and Lin, Dahua and Qiao, Yu and Ouyang, Wanli and Liu, Ziwei},
      journal={arXiv preprint arXiv:2407.07667},
      year={2024}
    }
}