GitHub - Vchitect/VideoBooth: [CVPR2024] VideoBooth: Diffusion-based Video Generation with Image Prompts (original) (raw)

VideoBooth

This repository will contain the implementation of the following paper:

VideoBooth: Diffusion-based Video Generation with Image Prompts
Yuming Jiang, Tianxing Wu, Shuai Yang, Chenyang Si, Dahua Lin, Yu Qiao, Chen Change Loy, Ziwei Liu

From MMLab@NTU affliated with S-Lab, Nanyang Technological University and Shanghai AI Laboratory.

Overview

Our VideoBooth generates videos with the subjects specified in the image prompts.

Installation

Clone the repository.

git clone https://github.com/Vchitect/VideoBooth.git cd VideoBooth

Install the environment.

conda env create -f environment.yml conda activate videobooth

Download pretrained models (Stable Diffusion v1.4, VideoBooth), and put them under the folder ./pretrained_models/.

Inference

Here, we provide one example to perform the inference.

python sample_scripts/sample.py --config sample_scripts/configs/panda.yaml

If you want to use your own image, you need to segment the object first. We use Grounded-SAM to segment the subject from images.

Training

VideoBooth is training in a coarse-to-fine manner.

Stage 1: Coarse Stage Training

srun --mpi=pmi2 torchrun --nnodes=1 --nproc_per_node=8 --master_port=29125 train_stage1.py
--model TAVU
--num-frames 16
--dataset WebVideoImageStage1
--frame-interval 4
--ckpt-every 1000
--clip-max-norm 0.1
--global-batch-size 16
--reg-text-weight 0
--results-dir ./results
--pretrained-t2v-model path-to-t2v-model
--global-mapper-path path-to-elite-global-model

Stage 2: Fine Stage Training

srun --mpi=pmi2 torchrun --nnodes=1 --nproc_per_node=8 --master_port=29125 train_stage2.py
--model TAVU
--num-frames 16
--dataset WebVideoImageStage2
--frame-interval 4
--ckpt-every 1000
--clip-max-norm 0.1
--global-batch-size 16
--reg-text-weight 0
--results-dir ./results
--pretrained-t2v-model path-to-t2v-model
--global-mapper-path path-to-stage1-model

Dataset Preparation

You can download our proposed dataset in HuggingFace.

merge the splited zip files

zip -F webvid_parsing_2M_split.zip --out single-archive.zip

replace the path-to-webvid-parsing to this path

unzip single-archive.zip

replace the path-to-videobooth-subset to this path

unzip webvid_parsing_videobooth_subset.zip

Citation

If you find our repo useful for your research, please consider citing our paper:

@article{jiang2023videobooth, author = {Jiang, Yuming and Wu, Tianxing and Yang, Shuai and Si, Chenyang and Lin, Dahua and Qiao, Yu and Loy, Chen Change and Liu, Ziwei}, title = {VideoBooth: Diffusion-based Video Generation with Image Prompts}, year = {2023} }