GitHub - hpcaitech/Open-Sora: Open-Sora: Democratizing Efficient Video Production for All (original) (raw)

Open-Sora: Democratizing Efficient Video Production for All

We design and implement Open-Sora, an initiative dedicated to efficiently producing high-quality video. We hope to make the model, tools and all details accessible to all. By embracing open-source principles, Open-Sora not only democratizes access to advanced video generation techniques, but also offers a streamlined and user-friendly platform that simplifies the complexities of video generation. With Open-Sora, our goal is to foster innovation, creativity, and inclusivity within the field of content creation.

🎬 For a professional AI video-generation product, try Video Ocean β€” powered by a superior model.

πŸ“° News

πŸ“ Since Open-Sora is under active development, we remain different branches for different versions. The latest version is main. Old versions include: v1.0, v1.1, v1.2, v1.3.

πŸŽ₯ Latest Demo

Demos are presented in compressed GIF format for convenience. For original quality samples and their corresponding prompts, please visit our Gallery.

5s 1024Γ—576 5s 576Γ—1024 5s 576Γ—1024

OpenSora 1.3 Demo

5s 720Γ—1280 5s 720Γ—1280 5s 720Γ—1280

OpenSora 1.2 Demo

4s 720Γ—1280 4s 720Γ—1280 4s 720Γ—1280

OpenSora 1.1 Demo

2s 240Γ—426 2s 240Γ—426
2s 426Γ—240 4s 480Γ—854
16s 320Γ—320 16s 224Γ—448 2s 426Γ—240

OpenSora 1.0 Demo

2s 512Γ—512 2s 512Γ—512 2s 512Γ—512
A serene night scene in a forested area. [...] The video is a time-lapse, capturing the transition from day to night, with the lake and forest serving as a constant backdrop. A soaring drone footage captures the majestic beauty of a coastal cliff, [...] The water gently laps at the rock base and the greenery that clings to the top of the cliff. The majestic beauty of a waterfall cascading down a cliff into a serene lake. [...] The camera angle provides a bird's eye view of the waterfall.
A bustling city street at night, filled with the glow of car headlights and the ambient light of streetlights. [...] The vibrant beauty of a sunflower field. The sunflowers are arranged in neat rows, creating a sense of order and symmetry. [...] A serene underwater scene featuring a sea turtle swimming through a coral reef. The turtle, with its greenish-brown shell [...]

Videos are downsampled to .gif for display. Click for original videos. Prompts are trimmed for display, see here for full prompts.

πŸ”† Reports

πŸ“ Since Open-Sora is under active development, we remain different branches for different versions. The latest version is main. Old versions include: v1.0, v1.1, v1.2, v1.3.

Quickstart

Installation

create a virtual env and activate (conda as an example)

conda create -n opensora python=3.10 conda activate opensora

download the repo

git clone https://github.com/hpcaitech/Open-Sora cd Open-Sora

Ensure torch >= 2.4.0

pip install -v . # for development mode, pip install -v -e . pip install xformers==0.0.27.post2 --index-url https://download.pytorch.org/whl/cu121 # install xformers according to your cuda version pip install flash-attn --no-build-isolation

Optionally, you can install flash attention 3 for faster speed.

git clone https://github.com/Dao-AILab/flash-attention # 4f0640d5 cd flash-attention/hopper python setup.py install

Model Download

Our 11B model supports 256px and 768px resolution. Both T2V and I2V are supported by one model. πŸ€— Huggingface πŸ€– ModelScope.

Download from huggingface:

pip install "huggingface_hub[cli]" huggingface-cli download hpcai-tech/Open-Sora-v2 --local-dir ./ckpts

Download from ModelScope:

pip install modelscope modelscope download hpcai-tech/Open-Sora-v2 --local_dir ./ckpts

Text-to-Video Generation

Our model is optimized for image-to-video generation, but it can also be used for text-to-video generation. To generate high quality videos, with the help of flux text-to-image model, we build a text-to-image-to-video pipeline. For 256x256 resolution:

Generate one given prompt

torchrun --nproc_per_node 1 --standalone scripts/diffusion/inference.py configs/diffusion/inference/t2i2v_256px.py --save-dir samples --prompt "raining, sea"

Save memory with offloading

torchrun --nproc_per_node 1 --standalone scripts/diffusion/inference.py configs/diffusion/inference/t2i2v_256px.py --save-dir samples --prompt "raining, sea" --offload True

Generation with csv

torchrun --nproc_per_node 1 --standalone scripts/diffusion/inference.py configs/diffusion/inference/t2i2v_256px.py --save-dir samples --dataset.data-path assets/texts/example.csv

For 768x768 resolution:

One GPU

torchrun --nproc_per_node 1 --standalone scripts/diffusion/inference.py configs/diffusion/inference/t2i2v_768px.py --save-dir samples --prompt "raining, sea"

Multi-GPU with colossalai sp

torchrun --nproc_per_node 8 --standalone scripts/diffusion/inference.py configs/diffusion/inference/t2i2v_768px.py --save-dir samples --prompt "raining, sea"

You can adjust the generation aspect ratio by --aspect_ratio and the generation length by --num_frames. Candidate values for aspect_ratio includes 16:9, 9:16, 1:1, 2.39:1. Candidate values for num_frames should be 4k+1 and less than 129.

You can also run direct text-to-video by:

One GPU for 256px

torchrun --nproc_per_node 1 --standalone scripts/diffusion/inference.py configs/diffusion/inference/256px.py --prompt "raining, sea"

Multi-GPU for 768px

torchrun --nproc_per_node 8 --standalone scripts/diffusion/inference.py configs/diffusion/inference/768px.py --prompt "raining, sea"

Image-to-Video Generation

Given a prompt and a reference image, you can generate a video with the following command:

256px

torchrun --nproc_per_node 1 --standalone scripts/diffusion/inference.py configs/diffusion/inference/256px.py --cond_type i2v_head --prompt "A plump pig wallows in a muddy pond on a rustic farm, its pink snout poking out as it snorts contentedly. The camera captures the pig's playful splashes, sending ripples through the water under the midday sun. Wooden fences and a red barn stand in the background, framed by rolling green hills. The pig's muddy coat glistens in the sunlight, showcasing the simple pleasures of its carefree life." --ref assets/texts/i2v.png

256px with csv

torchrun --nproc_per_node 1 --standalone scripts/diffusion/inference.py configs/diffusion/inference/256px.py --cond_type i2v_head --dataset.data-path assets/texts/i2v.csv

Multi-GPU 768px

torchrun --nproc_per_node 8 --standalone scripts/diffusion/inference.py configs/diffusion/inference/768px.py --cond_type i2v_head --dataset.data-path assets/texts/i2v.csv

Advanced Usage

Motion Score

During training, we provide motion score into the text prompt. During inference, you can use the following command to generate videos with motion score (the default score is 4):

torchrun --nproc_per_node 1 --standalone scripts/diffusion/inference.py configs/diffusion/inference/t2i2v_256px.py --save-dir samples --prompt "raining, sea" --motion-score 4

We also provide a dynamic motion score evaluator. After setting your OpenAI API key, you can use the following command to evaluate the motion score of a video:

torchrun --nproc_per_node 1 --standalone scripts/diffusion/inference.py configs/diffusion/inference/t2i2v_256px.py --save-dir samples --prompt "raining, sea" --motion-score dynamic

Score 1 4 7

Prompt Refine

We take advantage of ChatGPT to refine the prompt. You can use the following command to refine the prompt. The function is available for both text-to-video and image-to-video generation.

export OPENAI_API_KEY=sk-xxxx torchrun --nproc_per_node 1 --standalone scripts/diffusion/inference.py configs/diffusion/inference/t2i2v_256px.py --save-dir samples --prompt "raining, sea" --refine-prompt True

Reproductivity

To make the results reproducible, you can set the random seed by:

torchrun --nproc_per_node 1 --standalone scripts/diffusion/inference.py configs/diffusion/inference/t2i2v_256px.py --save-dir samples --prompt "raining, sea" --sampling_option.seed 42 --seed 42

Use --num-sample k to generate k samples for each prompt.

Computational Efficiency

We test the computational efficiency of text-to-video on H100/H800 GPU. For 256x256, we use colossalai's tensor parallelism, and --offload True is used. For 768x768, we use colossalai's sequence parallelism. All use number of steps 50. The results are presented in the format: colorbluetextTotaltime(s)/colorredtextpeakGPUmemory(GB)\color{blue}{\text{Total time (s)}}/\color{red}{\text{peak GPU memory (GB)}}colorbluetextTotaltime(s)/colorredtextpeakGPUmemory(GB)

Resolution 1x GPU 2x GPUs 4x GPUs 8x GPUs
256x256 colorblue60/colorred52.5\color{blue}{60}/\color{red}{52.5}colorblue60/colorred52.5 colorblue40/colorred44.3\color{blue}{40}/\color{red}{44.3}colorblue40/colorred44.3 colorblue34/colorred44.3\color{blue}{34}/\color{red}{44.3}colorblue34/colorred44.3
768x768 colorblue1656/colorred60.3\color{blue}{1656}/\color{red}{60.3}colorblue1656/colorred60.3 colorblue863/colorred48.3\color{blue}{863}/\color{red}{48.3}colorblue863/colorred48.3 colorblue466/colorred44.3\color{blue}{466}/\color{red}{44.3}colorblue466/colorred44.3 colorblue276/colorred44.3\color{blue}{276}/\color{red}{44.3}colorblue276/colorred44.3

Evaluation

On VBench, Open-Sora 2.0 significantly narrows the gap with OpenAI’s Sora, reducing it from 4.52% β†’ 0.69% compared to Open-Sora 1.2.

VBench

Human preference results show our model is on par with HunyuanVideo 11B and Step-Video 30B.

Win Rate

With strong performance, Open-Sora 2.0 is cost-effective.

Cost

Contribution

Thanks goes to these wonderful contributors:

If you wish to contribute to this project, please refer to the Contribution Guideline.

Acknowledgement

Here we only list a few of the projects. For other works and datasets, please refer to our report.

Citation

@article{opensora, title={Open-sora: Democratizing efficient video production for all}, author={Zheng, Zangwei and Peng, Xiangyu and Yang, Tianji and Shen, Chenhui and Li, Shenggui and Liu, Hongxin and Zhou, Yukun and Li, Tianyi and You, Yang}, journal={arXiv preprint arXiv:2412.20404}, year={2024} }

@article{opensora2, title={Open-Sora 2.0: Training a Commercial-Level Video Generation Model in $200k}, author={Xiangyu Peng and Zangwei Zheng and Chenhui Shen and Tom Young and Xinying Guo and Binluo Wang and Hang Xu and Hongxin Liu and Mingyan Jiang and Wenjun Li and Yuhui Wang and Anbang Ye and Gang Ren and Qianran Ma and Wanying Liang and Xiang Lian and Xiwen Wu and Yuting Zhong and Zhuangyan Li and Chaoyu Gong and Guojun Lei and Leijun Cheng and Limin Zhang and Minghao Li and Ruijie Zhang and Silan Hu and Shijie Huang and Xiaokang Wang and Yuanheng Zhao and Yuqi Wang and Ziang Wei and Yang You}, year={2025}, journal={arXiv preprint arXiv:2503.09642}, }

Star History

Star History Chart