GitHub - Vchitect/Vchitect-2.0: Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models (original) (raw)

Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models

1Shanghai Artificial Intelligence Laboratory

🔥 Update and News

[2025.03.17] 🔥 Our Vchitect-T2V-Dataverse is released.
[2025.01.25] Our paper is released.
[2024.09.14] Inference code and checkpoint are released.

😲 Gallery

Installation

1. Create a conda environment and install PyTorch

Note: You may want to adjust the CUDA version according to your driver version.

conda create -n VchitectXL -y conda activate VchitectXL conda install python=3.11 pytorch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pytorch-cuda=12.1 -c pytorch -c nvidia -y

2. Install dependencies

pip install -r requirements.txt

Inference

First download the checkpoint.

save_dir=$1 ckpt_path=$2

python inference.py --test_file assets/test.txt --save_dir "${save_dir}" --ckpt_path "${ckpt_path}"

In inference.py, arguments for inference:

num_inference_steps: Denoising steps, default is 100
guidance_scale: CFG scale to use, default is 7.5
width: The width of the output video, default is 768
height: The height of the output video, default is 432
frames: The number of frames, default is 40

The results below were generated using the example prompt.

The base T2V model supports generating videos with resolutions up to 720x480 and 8fps. Then，VEnhancer is used to upscale the resolution to 2K and interpolate the frame rate to 24fps.

BibTex

@article{fan2025vchitect,
  title={Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models},
  author={Fan, Weichen and Si, Chenyang and Song, Junhao and Yang, Zhenyu and He, Yinan and Zhuo, Long and Huang, Ziqi and Dong, Ziyue and He, Jingwen and Pan, Dongwei and others},
  journal={arXiv preprint arXiv:2501.08453},
  year={2025}
}

🔑 License

This code is licensed under Apache-2.0. The framework is fully open for academic research and also allows free commercial usage.

Disclaimer

We disclaim responsibility for user-generated content. The model was not trained to realistically represent people or events, so using it to generate such content is beyond the model's capabilities. It is prohibited for pornographic, violent and bloody content generation, and to generate content that is demeaning or harmful to people or their environment, culture, religion, etc. Users are solely liable for their actions. The project contributors are not legally affiliated with, nor accountable for users' behaviors. Use the generative model responsibly, adhering to ethical and legal standards.