GitHub - svg-project/Sparse-VideoGen: [ICML2025] Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity (original) (raw)

Accelerate Video Generation with High Pixel-level Fidelity

| Blog | Paper | Twitter/X |

🔥News🔥

[2025/05] Sparse VideoGen is accepted by ICML 2025!
[2025/04] Wan 2.1 is supported! Both T2V and I2V are accelerated.
[2025/03] Sparse VideoGen is open-sourced! HunyuanVideo and CogVideoX v1.5 can be accelerated by 2×

📚 About

Sparse VideoGen (SVG) is a training-free framework that leverages inherent spatial and temporal sparsity in the 3D Full Attention operations. Sparse VideoGen's core contributions include:

Identifying the spatial and temporal sparsity patterns in video diffusion models.
Proposing an Online Profiling Strategy to dynamically identify these patterns.
Implementing an end-to-end generation framework through efficient algorithm-system co-design, with hardware-efficient layout transformation and customized kernels.

🎥 Demo

🛠️ Installation

Begin by cloning the repository:

GIT_LFS_SKIP_SMUDGE=1 git clone https://github.com/svg-project/Sparse-VideoGen.git # Do not clone the demo, otherwise is too large cd Sparse-VideoGen

We recommend using CUDA versions 12.4 / 12.8 + PyTorch versions 2.5.1 / 2.6.0

1. Create and activate conda environment

conda create -n SVG python==3.10.9 conda activate SVG

2. Install PyTorch

pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124

3. Install pip dependencies from CogVideoX and HunyuanVideo

pip install -r requirements.txt pip install flash-attn --no-build-isolation

4. (Optional) Install customized kernels for maximized speedup. (You might need to upgrade your cmake and CUDA version.)

git submodule update --init --recursive cd svg/kernels bash setup.sh

🚀 Inference Examples

Wan 2.1

We support running Wan 2.1 inference using diffusers. Please make sure to install the latest version of diffusers.

pip install git+https://github.com/huggingface/diffusers

We support Text-to-Video and Image-to-Video inference of Wan 2.1 model. The running scripts are:

Text-to-Video

bash scripts/wan_t2v_inference.sh

Image-to-Video

bash scripts/wan_i2v_inference.sh

Command Line:

Text-to-Video

python wan_t2v_inference.py
--prompt ${prompt}
--height 720
--width 1280
--pattern "SVG"
--num_sampled_rows 64
--sparsity 0.25
--first_times_fp 0.025
--first_layers_fp 0.075

Image-to-Video

python wan_i2v_inference.py
--prompt "$prompt"
--image_path "$image_path"
--seed 0
--num_inference_steps 40
--pattern "SVG"
--num_sampled_rows 64
--sparsity 0.25
--first_times_fp 0.025
--first_layers_fp 0.075

If you want to run 480p video generation, please change the height and weight arguments to 480 and 832.

HunyuanVideo

To run HunyuanVideo Text-to-Video inference examples, you first need to download the checkpoints under ckpts following the official guide. Then, run

bash scripts/hyvideo_inference.sh

Command line:

python3 hyvideo_inference.py
--video-size 720 1280
--video-length 129
--infer-steps 50
--seed 0
--prompt "A cat walks on the grass, realistic style."
--embedded-cfg-scale 6.0
--flow-shift 7.0
--flow-reverse
--use-cpu-offload
--output_path ./output.mp4
--pattern "SVG"
--num_sampled_rows 64
--sparsity 0.2
--first_times_fp 0.055
--first_layers_fp 0.025

On a single H100, the generation should takes 14 minutes.

CogVideoX v1.5

To run CogVideoX v1.5 Image-to-Video inference exmaples, run

bash scripts/cog_inference.sh

Command line:

python3 cog_inference.py
--prompt "A bright yellow water taxi glides smoothly across the choppy waters, creating gentle ripples in its wake. The iconic Brooklyn Bridge looms majestically in the background, its intricate web of cables and towering stone arches standing out against the city skyline. The boat, bustling with passengers, offers a lively contrast to the serene, expansive sky dotted with fluffy clouds. As it cruises forward, the vibrant cityscape of New York unfolds, with towering skyscrapers and historic buildings lining the waterfront, capturing the dynamic essence of urban life."
--image_path "examples/cog/img/boat.jpg"
--output_path "output.mp4"

On a single H100, the generation should takes 4 minutes.

📑 Open-source Plan

Support FP8 attention
Support Wan 2.1
Support Cosmos

🔗 BibTeX

If you find Sparse VideoGen useful for your research and applications or interesting, please cite our work using BibTeX:

@article{xi2025sparse, title={Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity}, author={Xi, Haocheng and Yang, Shuo and Zhao, Yilong and Xu, Chenfeng and Li, Muyang and Li, Xiuyu and Lin, Yujun and Cai, Han and Zhang, Jintao and Li, Dacheng and others}, journal={arXiv preprint arXiv:2502.01776}, year={2025} }

@article{yang2025sparse, title={Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation}, author={Yang, Shuo and Xi, Haocheng and Zhao, Yilong and Li, Muyang and Zhang, Jintao and Cai, Han and Lin, Yujun and Li, Xiuyu and Xu, Chenfeng and Peng, Kelly and others}, journal={arXiv preprint arXiv:2505.18875}, year={2025} }