GitHub - THUDM/CogVideo: text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023) (original) (raw)

CogVideo & CogVideoX

中文阅读

日本語で読む

Experience the CogVideoX-5B model online at 🤗 Huggingface Space or 🤖 ModelScope Space

📚 View the paper and user guide

👋 Join our WeChat and Discord

📍 Visit QingYing and API Platform to experience larger-scale commercial video generation models.

Project Updates

Table of Contents

Jump to a specific section:

Quick Start

Prompt Optimization

Before running the model, please refer to this guide to see how we use large models like GLM-4 (or other comparable products, such as GPT-4) to optimize the model. This is crucial because the model is trained with long prompts, and a good prompt directly impacts the quality of the video generation.

SAT

Please make sure your Python version is between 3.10 and 3.12, inclusive of both 3.10 and 3.12.

Follow instructions in sat_demo: Contains the inference code and fine-tuning code of SAT weights. It is recommended to improve based on the CogVideoX model structure. Innovative researchers use this code to better perform rapid stacking and development.

Diffusers

Please make sure your Python version is between 3.10 and 3.12, inclusive of both 3.10 and 3.12.

pip install -r requirements.txt

Then follow diffusers_demo: A more detailed explanation of the inference code, mentioning the significance of common parameters.

For more details on quantized inference, please refer to diffusers-torchao. With Diffusers and TorchAO, quantized inference is also possible leading to memory-efficient inference as well as speedup in some cases when compiled. A full list of memory and time benchmarks with various settings on A100 and H100 has been published at diffusers-torchao.

CogVideoX-5B

5b_1.mp4 5b_2.mp4 5b_3.mp4 5b_4.mp4
5b_5.mp4 5b_6.mp4 5b_7.mp4 5b_8.mp4

CogVideoX-2B

To view the corresponding prompt words for the gallery, please click here

Model Introduction

CogVideoX is an open-source version of the video generation model originating from QingYing. The table below displays the list of video generation models we currently offer, along with their foundational information.

Model Name CogVideoX1.5-5B (Latest) CogVideoX1.5-5B-I2V (Latest) CogVideoX-2B CogVideoX-5B CogVideoX-5B-I2V
Release Date November 8, 2024 November 8, 2024 August 6, 2024 August 27, 2024 September 19, 2024
Video Resolution 1360 * 768 Min(W, H) = 768 768 ≤ Max(W, H) ≤ 1360 Max(W, H) % 16 = 0 720 * 480
Number of Frames Should be 16N + 1 where N <= 10 (default 81) Should be 8N + 1 where N <= 6 (default 49)
Inference Precision BF16 (Recommended), FP16, FP32, FP8*, INT8, Not supported: INT4 FP16*(Recommended), BF16, FP32, FP8*, INT8, Not supported: INT4 BF16 (Recommended), FP16, FP32, FP8*, INT8, Not supported: INT4
Single GPU Memory Usage SAT BF16: 76GB diffusers BF16: from 10GB* diffusers INT8(torchao): from 7GB* SAT FP16: 18GB diffusers FP16: 4GB minimum* diffusers INT8 (torchao): 3.6GB minimum* SAT BF16: 26GB diffusers BF16 : 5GB minimum* diffusers INT8 (torchao): 4.4GB minimum*
Multi-GPU Memory Usage BF16: 24GB* using diffusers FP16: 10GB* using diffusers BF16: 15GB* using diffusers
Inference Speed(Step = 50, FP/BF16) Single A100: ~1000 seconds (5-second video)Single H100: ~550 seconds (5-second video) Single A100: ~90 secondsSingle H100: ~45 seconds Single A100: ~180 secondsSingle H100: ~90 seconds
Prompt Language English*
Prompt Token Limit 224 Tokens 226 Tokens
Video Length 5 seconds or 10 seconds 6 seconds
Frame Rate 16 frames / second 8 frames / second
Position Encoding 3d_rope_pos_embed 3d_sincos_pos_embed 3d_rope_pos_embed 3d_rope_pos_embed + learnable_pos_embed
Download Link (Diffusers) 🤗 HuggingFace🤖 ModelScope🟣 WiseModel 🤗 HuggingFace🤖 ModelScope🟣 WiseModel 🤗 HuggingFace🤖 ModelScope🟣 WiseModel 🤗 HuggingFace🤖 ModelScope🟣 WiseModel 🤗 HuggingFace🤖 ModelScope🟣 WiseModel
Download Link (SAT) 🤗 HuggingFace🤖 ModelScope🟣 WiseModel SAT

Data Explanation

pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

We highly welcome contributions from the community and actively contribute to the open-source community. The following works have already been adapted for CogVideoX, and we invite everyone to use them:

Project Structure

This open-source repository will guide developers to quickly get started with the basic usage and fine-tuning examples of the CogVideoX open-source model.

Quick Start with Colab

Here provide three projects that can be run directly on free Colab T4 instances:

Inference

finetune

sat

Tools

This folder contains some tools for model conversion / caption generation, etc.

CogVideo(ICLR'23)

The official repo for the paper: CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformersis on the CogVideo branch

**CogVideo is able to generate relatively high-frame-rate videos.**A 4-second clip of 32 frames is shown below.

High-frame-rate sample

Intro images

cogvideo.mp4

The demo for CogVideo is at https://models.aminer.cn/cogvideo, where you can get hands-on practice on text-to-video generation. The original input is in Chinese.

Citation

🌟 If you find our work helpful, please leave us a star and cite our paper.

@article{yang2024cogvideox,
  title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer},
  author={Yang, Zhuoyi and Teng, Jiayan and Zheng, Wendi and Ding, Ming and Huang, Shiyu and Xu, Jiazheng and Yang, Yuanming and Hong, Wenyi and Zhang, Xiaohan and Feng, Guanyu and others},
  journal={arXiv preprint arXiv:2408.06072},
  year={2024}
}
@article{hong2022cogvideo,
  title={CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers},
  author={Hong, Wenyi and Ding, Ming and Zheng, Wendi and Liu, Xinghan and Tang, Jie},
  journal={arXiv preprint arXiv:2205.15868},
  year={2022}
}

Model-License

The code in this repository is released under the Apache 2.0 License.

The CogVideoX-2B model (including its corresponding Transformers module and VAE module) is released under the Apache 2.0 License.

The CogVideoX-5B model (Transformers module, include I2V and T2V) is released under the CogVideoX LICENSE.