GitHub - Tencent-Hunyuan/HunyuanVideo-Avatar (original) (raw)

HunyuanVideo-Avatar 🌅

HunyuanVideo-Avatar: High-Fidelity Audio-Driven Human Animation for Multiple Characters

🔥🔥🔥 News!!

Jun 06, 2025: 🔥 HunyuanVideo-Avatar supports Single GPU with only 10GB VRAM, with TeaCache included, HUGE THANKS to Wan2GP
May 28, 2025: 🔥 HunyuanVideo-Avatar is available in Cloud-Native-Build (CNB) HunyuanVideo-Avatar.
May 28, 2025: 👋 We release the inference code and model weights of HunyuanVideo-Avatar. Download.

📑 Open-source Plan

HunyuanVideo-Avatar
- Inference
- Checkpoints
- ComfyUI

HunyuanVideo-Avatar 🌅

Abstract

Recent years have witnessed significant progress in audio-driven human animation. However, critical challenges remain in (i) generating highly dynamic videos while preserving character consistency, (ii) achieving precise emotion alignment between characters and audio, and (iii) enabling multi-character audio-driven animation. To address these challenges, we propose HunyuanVideo-Avatar, a multimodal diffusion transformer (MM-DiT)-based model capable of simultaneously generating dynamic, emotion-controllable, and multi-character dialogue videos. Concretely, HunyuanVideo-Avatar introduces three key innovations: (i) A character image injection module is designed to replace the conventional addition-based character conditioning scheme, eliminating the inherent condition mismatch between training and inference. This ensures the dynamic motion and strong character consistency; (ii) An Audio Emotion Module (AEM) is introduced to extract and transfer the emotional cues from an emotion reference image to the target generated video, enabling fine-grained and accurate emotion style control; (iii) A Face-Aware Audio Adapter (FAA) is proposed to isolate the audio-driven character with latent-level face mask, enabling independent audio injection via cross-attention for multi-character scenarios. These innovations empower HunyuanVideo-Avatar to surpass state-of-the-art methods on benchmark datasets and a newly proposed wild dataset, generating realistic avatars in dynamic, immersive scenarios. The source code and model weights will be released publicly.

HunyuanVideo-Avatar Overall Architecture

We propose HunyuanVideo-Avatar, a multi-modal diffusion transformer(MM-DiT)-based model capable of generating dynamic, emotion-controllable, and multi-character dialogue videos.

🎉 HunyuanVideo-Avatar Key Features

High-Dynamic and Emotion-Controllable Video Generation

HunyuanVideo-Avatar supports animating any input avatar images to high-dynamic and emotion-controllable videos with simple audio conditions. Specifically, it takes as input multi-style avatar images at arbitrary scales and resolutions. The system supports multi-style avatars encompassing photorealistic, cartoon, 3D-rendered, and anthropomorphic characters. Multi-scale generation spanning portrait, upper-body and full-body. It generates videos with high-dynamic foreground and background, achieving superior realistic and naturalness. In addition, the system supports controlling facial emotions of the characters conditioned on input audio.

Various Applications

HunyuanVideo-Avatar supports various downstream tasks and applications. For instance, the system generates talking avatar videos, which could be applied to e-commerce, online streaming, social media video production, etc. In addition, its multi-character animation feature enlarges the application such as video content creation, editing, etc.

📜 Requirements

An NVIDIA GPU with CUDA support is required.
- The model is tested on a machine with 8GPUs.
- Minimum: The minimum GPU memory required is 24GB for 704px768px129f but very slow.
- Recommended: We recommend using a GPU with 96GB of memory for better generation quality.
- Tips: If OOM occurs when using GPU with 80GB of memory, try to reduce the image resolution.
Tested operating system: Linux

🛠️ Dependencies and Installation

Begin by cloning the repository:

git clone https://github.com/Tencent-Hunyuan/HunyuanVideo-Avatar.git cd HunyuanVideo-Avatar

Installation Guide for Linux

We recommend CUDA versions 12.4 or 11.8 for the manual installation.

Conda's installation instructions are available here.

1. Create conda environment

conda create -n HunyuanVideo-Avatar python==3.10.9

2. Activate the environment

conda activate HunyuanVideo-Avatar

3. Install PyTorch and other dependencies using conda

For CUDA 11.8

conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=11.8 -c pytorch -c nvidia

For CUDA 12.4

conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.4 -c pytorch -c nvidia

4. Install pip dependencies

python -m pip install -r requirements.txt

5. Install flash attention v2 for acceleration (requires CUDA 11.8 or above)

python -m pip install ninja python -m pip install git+https://github.com/Dao-AILab/flash-attention.git@v2.6.3

In case of running into float point exception(core dump) on the specific GPU type, you may try the following solutions:

Option 1: Making sure you have installed CUDA 12.4, CUBLAS>=12.4.5.8, and CUDNN>=9.00 (or simply using our CUDA 12 docker image).

pip install nvidia-cublas-cu12==12.4.5.8 export LD_LIBRARY_PATH=/opt/conda/lib/python3.8/site-packages/nvidia/cublas/lib/

Option 2: Forcing to explicitly use the CUDA 11.8 compiled version of Pytorch and all the other packages

pip uninstall -r requirements.txt # uninstall all packages pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu118 pip install -r requirements.txt pip install ninja pip install git+https://github.com/Dao-AILab/flash-attention.git@v2.6.3

Additionally, you can also use HunyuanVideo Docker image. Use the following command to pull and run the docker image.

For CUDA 12.4 (updated to avoid float point exception)

docker pull hunyuanvideo/hunyuanvideo:cuda_12 docker run -itd --gpus all --init --net=host --uts=host --ipc=host --name hunyuanvideo --security-opt=seccomp=unconfined --ulimit=stack=67108864 --ulimit=memlock=-1 --privileged hunyuanvideo/hunyuanvideo:cuda_12 pip install gradio==3.39.0 diffusers==0.33.0 transformers==4.41.2

For CUDA 11.8

docker pull hunyuanvideo/hunyuanvideo:cuda_11 docker run -itd --gpus all --init --net=host --uts=host --ipc=host --name hunyuanvideo --security-opt=seccomp=unconfined --ulimit=stack=67108864 --ulimit=memlock=-1 --privileged hunyuanvideo/hunyuanvideo:cuda_11 pip install gradio==3.39.0 diffusers==0.33.0 transformers==4.41.2

🧱 Download Pretrained Models

The details of download pretrained models are shown here.

🚀 Parallel Inference on Multiple GPUs

For example, to generate a video with 8 GPUs, you can use the following command:

cd HunyuanVideo-Avatar

JOBS_DIR=$(dirname (dirname"(dirname "(dirname"0")) export PYTHONPATH=./ export MODEL_BASE="./weights" checkpoint_path=${MODEL_BASE}/ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states.pt

torchrun --nnodes=1 --nproc_per_node=8 --master_port 29605 hymm_sp/sample_batch.py
--input 'assets/test.csv'
--ckpt ${checkpoint_path}
--sample-n-frames 129
--seed 128
--image-size 704
--cfg-scale 7.5
--infer-steps 50
--use-deepcache 1
--flow-shift-eval-video 5.0
--save-path ${OUTPUT_BASEPATH}

🔑 Single-gpu Inference

For example, to generate a video with 1 GPU, you can use the following command:

cd HunyuanVideo-Avatar

JOBS_DIR=$(dirname (dirname"(dirname "(dirname"0")) export PYTHONPATH=./

export MODEL_BASE=./weights OUTPUT_BASEPATH=./results-single checkpoint_path=${MODEL_BASE}/ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states_fp8.pt

export DISABLE_SP=1 CUDA_VISIBLE_DEVICES=0 python3 hymm_sp/sample_gpu_poor.py
--input 'assets/test.csv'
--ckpt ${checkpoint_path}
--sample-n-frames 129
--seed 128
--image-size 704
--cfg-scale 7.5
--infer-steps 50
--use-deepcache 1
--flow-shift-eval-video 5.0
--save-path ${OUTPUT_BASEPATH}
--use-fp8
--infer-min

Run with very low VRAM

cd HunyuanVideo-Avatar

JOBS_DIR=$(dirname (dirname"(dirname "(dirname"0")) export PYTHONPATH=./

export MODEL_BASE=./weights OUTPUT_BASEPATH=./results-poor

checkpoint_path=${MODEL_BASE}/ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states_fp8.pt

export CPU_OFFLOAD=1 CUDA_VISIBLE_DEVICES=0 python3 hymm_sp/sample_gpu_poor.py
--input 'assets/test.csv'
--ckpt ${checkpoint_path}
--sample-n-frames 129
--seed 128
--image-size 704
--cfg-scale 7.5
--infer-steps 50
--use-deepcache 1
--flow-shift-eval-video 5.0
--save-path ${OUTPUT_BASEPATH}
--use-fp8
--cpu-offload
--infer-min

Run with 10GB VRAM GPU (TeaCache supported)

Thanks to Wan2GP, HunyuanVideo-Avatar now supports single GPU mode with even lower VRAM (10GB) without quality degradation. Check out this great repo.

Run a Gradio Server

cd HunyuanVideo-Avatar

bash ./scripts/run_gradio.sh

🔗 BibTeX

If you find HunyuanVideo-Avatar useful for your research and applications, please cite using this BibTeX:

@misc{hu2025HunyuanVideo-Avatar, title={HunyuanVideo-Avatar: High-Fidelity Audio-Driven Human Animation for Multiple Characters}, author={Yi Chen and Sen Liang and Zixiang Zhou and Ziyao Huang and Yifeng Ma and Junshu Tang and Qin Lin and Yuan Zhou and Qinglin Lu}, year={2025}, eprint={2505.20156}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/pdf/2505.20156}, }

Acknowledgements

We would like to thank the contributors to the HunyuanVideo, SD3, FLUX, Llama, LLaVA, Xtuner, diffusers and HuggingFace repositories, for their open research and exploration.

GitHub - Tencent-Hunyuan/HunyuanVideo-Avatar (original) (raw)

HunyuanVideo-Avatar 🌅

🔥🔥🔥 News!!

📑 Open-source Plan

Contents

Abstract

HunyuanVideo-Avatar Overall Architecture

🎉 HunyuanVideo-Avatar Key Features

High-Dynamic and Emotion-Controllable Video Generation

Various Applications

📜 Requirements

🛠️ Dependencies and Installation

Installation Guide for Linux

1. Create conda environment

2. Activate the environment

3. Install PyTorch and other dependencies using conda

For CUDA 11.8

For CUDA 12.4

4. Install pip dependencies

5. Install flash attention v2 for acceleration (requires CUDA 11.8 or above)

Option 1: Making sure you have installed CUDA 12.4, CUBLAS>=12.4.5.8, and CUDNN>=9.00 (or simply using our CUDA 12 docker image).

Option 2: Forcing to explicitly use the CUDA 11.8 compiled version of Pytorch and all the other packages

For CUDA 12.4 (updated to avoid float point exception)

For CUDA 11.8

🧱 Download Pretrained Models

🚀 Parallel Inference on Multiple GPUs

🔑 Single-gpu Inference

Run with very low VRAM

Run with 10GB VRAM GPU (TeaCache supported)

Run a Gradio Server

🔗 BibTeX

Acknowledgements