HunyuanVideo-1.5 (original) (raw)

HunyuanVideo-1.5 is a lightweight yet powerful video generation model that achieves state-of-the-art visual quality and motion coherence with only 8.3 billion parameters, enabling efficient inference on consumer-grade GPUs. This achievement is built upon several key components, including meticulous data curation, an advanced DiT architecture with selective and sliding tile attention (SSTA), enhanced bilingual understanding through glyph-aware text encoding, progressive pre-training and post-training, and an efficient video super-resolution network. Leveraging these designs, we developed a unified framework capable of high-quality text-to-video and image-to-video generation across multiple durations and resolutions. Extensive experiments demonstrate that this compact and proficient model establishes a new state-of-the-art among open-source models.

You can find all the original HunyuanVideo checkpoints under the Tencent organization.

Click on the HunyuanVideo models in the right sidebar for more examples of video generation tasks.

The examples below use a checkpoint from hunyuanvideo-community because the weights are stored in a layout compatible with Diffusers.

The example below demonstrates how to generate a video optimized for memory or inference speed.

Refer to the Reduce memory usage guide for more details about the various memory saving techniques.

import torch from diffusers import AutoModel, HunyuanVideo15Pipeline from diffusers.utils import export_to_video

pipeline = HunyuanVideo15Pipeline.from_pretrained( "HunyuanVideo-1.5-Diffusers-480p_t2v", torch_dtype=torch.bfloat16, )

pipeline.enable_model_cpu_offload() pipeline.vae.enable_tiling()

prompt = "A fluffy teddy bear sits on a bed of soft pillows surrounded by children's toys." video = pipeline(prompt=prompt, num_frames=61, num_inference_steps=30).frames[0] export_to_video(video, "output.mp4", fps=15)

Notes

Refer to the Attention backends guide for more details about using a different backend.

pipe.transformer.set_attention_backend("flash_hub")

You can check the default guider configuration using pipe.guider:

pipe.guider ClassifierFreeGuidance { "_class_name": "ClassifierFreeGuidance", "_diffusers_version": "0.36.0.dev0", "enabled": true, "guidance_rescale": 0.0, "guidance_scale": 6.0, "start": 0.0, "stop": 1.0, "use_original_formulation": false }

State: step: None num_inference_steps: None timestep: None count_prepared: 0 enabled: True num_conditions: 2

To update guider configuration, you can run pipe.guider = pipe.guider.new(...)

pipe.guider = pipe.guider.new(guidance_scale=5.0)

Read more on Guider here.

HunyuanVideo15Pipeline

class diffusers.HunyuanVideo15Pipeline

< source >

( text_encoder: Qwen2_5_VLTextModel tokenizer: Qwen2Tokenizer transformer: HunyuanVideo15Transformer3DModel vae: AutoencoderKLHunyuanVideo15 scheduler: FlowMatchEulerDiscreteScheduler text_encoder_2: T5EncoderModel tokenizer_2: ByT5Tokenizer guider: ClassifierFreeGuidance )

Parameters

Pipeline for text-to-video generation using HunyuanVideo1.5.

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).

__call__

< source >

( prompt: typing.Union[str, typing.List[str]] = None negative_prompt: typing.Union[str, typing.List[str]] = None height: typing.Optional[int] = None width: typing.Optional[int] = None num_frames: int = 121 num_inference_steps: int = 50 sigmas: typing.List[float] = None num_videos_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None prompt_embeds_mask: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds_mask: typing.Optional[torch.Tensor] = None prompt_embeds_2: typing.Optional[torch.Tensor] = None prompt_embeds_mask_2: typing.Optional[torch.Tensor] = None negative_prompt_embeds_2: typing.Optional[torch.Tensor] = None negative_prompt_embeds_mask_2: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'np' return_dict: bool = True attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None ) → ~HunyuanVideo15PipelineOutput or tuple

Parameters

Returns

~HunyuanVideo15PipelineOutput or tuple

If return_dict is True, HunyuanVideo15PipelineOutput is returned, otherwise a tuple is returned where the first element is a list with the generated videos.

The call function to the pipeline for generation.

Examples:

import torch from diffusers import HunyuanVideo15Pipeline from diffusers.utils import export_to_video

model_id = "hunyuanvideo-community/HunyuanVideo-1.5-480p_t2v" pipe = HunyuanVideo15Pipeline.from_pretrained(model_id, torch_dtype=torch.float16) pipe.vae.enable_tiling() pipe.to("cuda")

output = pipe( ... prompt="A cat walks on the grass, realistic", ... num_inference_steps=50, ... ).frames[0] export_to_video(output, "output.mp4", fps=15)

encode_prompt

< source >

( prompt: typing.Union[str, typing.List[str]] device: typing.Optional[torch.device] = None dtype: typing.Optional[torch.dtype] = None batch_size: int = 1 num_videos_per_prompt: int = 1 prompt_embeds: typing.Optional[torch.Tensor] = None prompt_embeds_mask: typing.Optional[torch.Tensor] = None prompt_embeds_2: typing.Optional[torch.Tensor] = None prompt_embeds_mask_2: typing.Optional[torch.Tensor] = None )

Parameters

prepare_cond_latents_and_mask

< source >

( latents dtype: typing.Optional[torch.dtype] device: typing.Optional[torch.device] ) → tuple

Parameters

(cond_latents_concat, mask_concat) - both are zero tensors for t2v

Prepare conditional latents and mask for t2v generation.

HunyuanVideo15ImageToVideoPipeline

class diffusers.HunyuanVideo15ImageToVideoPipeline

< source >

( text_encoder: Qwen2_5_VLTextModel tokenizer: Qwen2Tokenizer transformer: HunyuanVideo15Transformer3DModel vae: AutoencoderKLHunyuanVideo15 scheduler: FlowMatchEulerDiscreteScheduler text_encoder_2: T5EncoderModel tokenizer_2: ByT5Tokenizer guider: ClassifierFreeGuidance image_encoder: SiglipVisionModel feature_extractor: SiglipImageProcessor )

Parameters

Pipeline for image-to-video generation using HunyuanVideo1.5.

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).

__call__

< source >

( image: Image prompt: typing.Union[str, typing.List[str]] = None negative_prompt: typing.Union[str, typing.List[str]] = None num_frames: int = 121 num_inference_steps: int = 50 sigmas: typing.List[float] = None num_videos_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None prompt_embeds_mask: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds_mask: typing.Optional[torch.Tensor] = None prompt_embeds_2: typing.Optional[torch.Tensor] = None prompt_embeds_mask_2: typing.Optional[torch.Tensor] = None negative_prompt_embeds_2: typing.Optional[torch.Tensor] = None negative_prompt_embeds_mask_2: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'np' return_dict: bool = True attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None ) → ~HunyuanVideo15PipelineOutput or tuple

Parameters

Returns

~HunyuanVideo15PipelineOutput or tuple

If return_dict is True, HunyuanVideo15PipelineOutput is returned, otherwise a tuple is returned where the first element is a list with the generated videos.

The call function to the pipeline for generation.

Examples:

import torch from diffusers import HunyuanVideo15ImageToVideoPipeline from diffusers.utils import export_to_video

model_id = "hunyuanvideo-community/HunyuanVideo-1.5-480p_i2v" pipe = HunyuanVideo15ImageToVideoPipeline.from_pretrained(model_id, torch_dtype=torch.float16) pipe.vae.enable_tiling() pipe.to("cuda")

image = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/wan_i2v_input.JPG")

output = pipe( ... prompt="Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside.", ... image=image, ... num_inference_steps=50, ... ).frames[0] export_to_video(output, "output.mp4", fps=24)

encode_prompt

< source >

( prompt: typing.Union[str, typing.List[str]] device: typing.Optional[torch.device] = None dtype: typing.Optional[torch.dtype] = None batch_size: int = 1 num_videos_per_prompt: int = 1 prompt_embeds: typing.Optional[torch.Tensor] = None prompt_embeds_mask: typing.Optional[torch.Tensor] = None prompt_embeds_2: typing.Optional[torch.Tensor] = None prompt_embeds_mask_2: typing.Optional[torch.Tensor] = None )

Parameters

prepare_cond_latents_and_mask

< source >

( latents: Tensor image: Image batch_size: int height: int width: int dtype: dtype device: device ) → tuple

Parameters

(cond_latents_concat, mask_concat) - both are zero tensors for t2v

Prepare conditional latents and mask for t2v generation.

HunyuanVideo15PipelineOutput

class diffusers.pipelines.hunyuan_video1_5.pipeline_output.HunyuanVideo15PipelineOutput

< source >

( frames: Tensor )

Parameters

Output class for HunyuanVideo1.5 pipelines.

Update on GitHub