Framepack (original) (raw)

Packing Input Frame Context in Next-Frame Prediction Models for Video Generation by Lvmin Zhang and Maneesh Agrawala.

We present a neural network structure, FramePack, to train next-frame (or next-frame-section) prediction models for video generation. The FramePack compresses input frames to make the transformer context length a fixed number regardless of the video length. As a result, we are able to process a large number of frames using video diffusion with computation bottleneck similar to image diffusion. This also makes the training video batch sizes significantly higher (batch sizes become comparable to image diffusion training). We also propose an anti-drifting sampling method that generates frames in inverted temporal order with early-established endpoints to avoid exposure bias (error accumulation over iterations). Finally, we show that existing video diffusion models can be finetuned with FramePack, and their visual quality may be improved because the next-frame prediction supports more balanced diffusion schedulers with less extreme flow shift timesteps.

Make sure to check out the Schedulers guide to learn how to explore the tradeoff between scheduler speed and quality, and see the reuse components across pipelines section to learn how to efficiently load the same components into multiple pipelines.

Available models

Model name	Description
- lllyasviel/FramePackI2V_HY	Trained with the “inverted anti-drifting” strategy as described in the paper. Inference requires setting sampling_type="inverted_anti_drifting" when running the pipeline.
- lllyasviel/FramePack_F1_I2V_HY_20250503	Trained with a novel anti-drifting strategy but inference is performed in “vanilla” strategy as described in the paper. Inference requires setting sampling_type="vanilla" when running the pipeline.

Usage

Refer to the pipeline documentation for basic usage examples. The following section contains examples of offloading, different sampling methods, quantization, and more.

First and last frame to video

The following example shows how to use Framepack with start and end image controls, using the inverted anti-drifiting sampling model.

import torch from diffusers import HunyuanVideoFramepackPipeline, HunyuanVideoFramepackTransformer3DModel from diffusers.utils import export_to_video, load_image from transformers import SiglipImageProcessor, SiglipVisionModel

transformer = HunyuanVideoFramepackTransformer3DModel.from_pretrained( "lllyasviel/FramePackI2V_HY", torch_dtype=torch.bfloat16 ) feature_extractor = SiglipImageProcessor.from_pretrained( "lllyasviel/flux_redux_bfl", subfolder="feature_extractor" ) image_encoder = SiglipVisionModel.from_pretrained( "lllyasviel/flux_redux_bfl", subfolder="image_encoder", torch_dtype=torch.float16 ) pipe = HunyuanVideoFramepackPipeline.from_pretrained( "hunyuanvideo-community/HunyuanVideo", transformer=transformer, feature_extractor=feature_extractor, image_encoder=image_encoder, torch_dtype=torch.float16, )

pipe.enable_model_cpu_offload() pipe.vae.enable_tiling()

prompt = "CG animation style, a small blue bird takes off from the ground, flapping its wings. The bird's feathers are delicate, with a unique pattern on its chest. The background shows a blue sky with white clouds under bright sunshine. The camera follows the bird upward, capturing its flight and the vastness of the sky from a close-up, low-angle perspective." first_image = load_image( "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_first_frame.png" ) last_image = load_image( "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_last_frame.png" ) output = pipe( image=first_image, last_image=last_image, prompt=prompt, height=512, width=512, num_frames=91, num_inference_steps=30, guidance_scale=9.0, generator=torch.Generator().manual_seed(0), sampling_type="inverted_anti_drifting", ).frames[0] export_to_video(output, "output.mp4", fps=30)

Vanilla sampling

The following example shows how to use Framepack with the F1 model trained with vanilla sampling but new regulation approach for anti-drifting.

transformer = HunyuanVideoFramepackTransformer3DModel.from_pretrained( "lllyasviel/FramePack_F1_I2V_HY_20250503", torch_dtype=torch.bfloat16 ) feature_extractor = SiglipImageProcessor.from_pretrained( "lllyasviel/flux_redux_bfl", subfolder="feature_extractor" ) image_encoder = SiglipVisionModel.from_pretrained( "lllyasviel/flux_redux_bfl", subfolder="image_encoder", torch_dtype=torch.float16 ) pipe = HunyuanVideoFramepackPipeline.from_pretrained( "hunyuanvideo-community/HunyuanVideo", transformer=transformer, feature_extractor=feature_extractor, image_encoder=image_encoder, torch_dtype=torch.float16, )

pipe.enable_model_cpu_offload() pipe.vae.enable_tiling()

image = load_image( "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/penguin.png" ) output = pipe( image=image, prompt="A penguin dancing in the snow", height=832, width=480, num_frames=91, num_inference_steps=30, guidance_scale=9.0, generator=torch.Generator().manual_seed(0), sampling_type="vanilla", ).frames[0] export_to_video(output, "output.mp4", fps=30)

Group offloading

Group offloading (apply_group_offloading()) provides aggressive memory optimizations for offloading internal parts of any model to the CPU, with possibly no additional overhead to generation time. If you have very low VRAM available, this approach may be suitable for you depending on the amount of CPU RAM available.

import torch from diffusers import HunyuanVideoFramepackPipeline, HunyuanVideoFramepackTransformer3DModel from diffusers.hooks import apply_group_offloading from diffusers.utils import export_to_video, load_image from transformers import SiglipImageProcessor, SiglipVisionModel

onload_device = torch.device("cuda") offload_device = torch.device("cpu") list(map( lambda x: apply_group_offloading(x, onload_device, offload_device, offload_type="leaf_level", use_stream=True, low_cpu_mem_usage=True), [pipe.text_encoder, pipe.text_encoder_2, pipe.transformer] )) pipe.image_encoder.to(onload_device) pipe.vae.to(onload_device) pipe.vae.enable_tiling()

HunyuanVideoFramepackPipeline

class diffusers.HunyuanVideoFramepackPipeline

< source >

( text_encoder: LlamaModel tokenizer: LlamaTokenizerFast transformer: HunyuanVideoFramepackTransformer3DModel vae: AutoencoderKLHunyuanVideo scheduler: FlowMatchEulerDiscreteScheduler text_encoder_2: CLIPTextModel tokenizer_2: CLIPTokenizer image_encoder: SiglipVisionModel feature_extractor: SiglipImageProcessor )

Parameters

text_encoder (LlamaModel) —Llava Llama3-8B.
tokenizer (LlamaTokenizer) — Tokenizer from Llava Llama3-8B.
transformer (HunyuanVideoTransformer3DModel) — Conditional Transformer to denoise the encoded image latents.
scheduler (FlowMatchEulerDiscreteScheduler) — A scheduler to be used in combination with transformer to denoise the encoded image latents.
vae (AutoencoderKLHunyuanVideo) — Variational Auto-Encoder (VAE) Model to encode and decode videos to and from latent representations.
text_encoder_2 (CLIPTextModel) —CLIP, specifically the clip-vit-large-patch14 variant.
tokenizer_2 (CLIPTokenizer) — Tokenizer of classCLIPTokenizer.

Pipeline for text-to-video generation using HunyuanVideo.

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).

call

< source >

( image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] last_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor], NoneType] = None prompt: typing.Union[str, typing.List[str]] = None prompt_2: typing.Union[str, typing.List[str]] = None negative_prompt: typing.Union[str, typing.List[str]] = None negative_prompt_2: typing.Union[str, typing.List[str]] = None height: int = 720 width: int = 1280 num_frames: int = 129 latent_window_size: int = 9 num_inference_steps: int = 50 sigmas: typing.List[float] = None true_cfg_scale: float = 1.0 guidance_scale: float = 6.0 num_videos_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None image_latents: typing.Optional[torch.Tensor] = None last_image_latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None pooled_prompt_embeds: typing.Optional[torch.Tensor] = None prompt_attention_mask: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None negative_pooled_prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_attention_mask: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] prompt_template: typing.Dict[str, typing.Any] = {'template': '<|start_header_id|>system<|end_header_id|>\n\nDescribe the video by detailing the following aspects: 1. The main content and theme of the video.2. The color, shape, size, texture, quantity, text, and spatial relationships of the objects.3. Actions, events, behaviors temporal relationships, physical movement changes of the objects.4. background environment, light, style and atmosphere.5. camera angles, movements, and transitions used in the video:<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{}<|eot_id|>', 'crop_start': 95} max_sequence_length: int = 256 sampling_type: FramepackSamplingType = <FramepackSamplingType.INVERTED_ANTI_DRIFTING: 'inverted_anti_drifting'> ) → ~HunyuanVideoFramepackPipelineOutput or tuple

Parameters

image (PIL.Image.Image or np.ndarray or torch.Tensor) — The image to be used as the starting point for the video generation.
last_image (PIL.Image.Image or np.ndarray or torch.Tensor, optional) — The optional last image to be used as the ending point for the video generation. This is useful for generating transitions between two images.
prompt (str or List[str], optional) — The prompt or prompts to guide the image generation. If not defined, one has to pass prompt_embeds. instead.
prompt_2 (str or List[str], optional) — The prompt or prompts to be sent to tokenizer_2 and text_encoder_2. If not defined, prompt is will be used instead.
negative_prompt (str or List[str], optional) — The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embeds instead. Ignored when not using guidance (i.e., ignored if true_cfg_scale is not greater than 1).
negative_prompt_2 (str or List[str], optional) — The prompt or prompts not to guide the image generation to be sent to tokenizer_2 andtext_encoder_2. If not defined, negative_prompt is used in all the text-encoders.
height (int, defaults to 720) — The height in pixels of the generated image.
width (int, defaults to 1280) — The width in pixels of the generated image.
num_frames (int, defaults to 129) — The number of frames in the generated video.
num_inference_steps (int, defaults to 50) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
sigmas (List[float], optional) — Custom sigmas to use for the denoising process with schedulers which support a sigmas argument in their set_timesteps method. If not defined, the default behavior when num_inference_steps is passed will be used.
true_cfg_scale (float, optional, defaults to 1.0) — When > 1.0 and a provided negative_prompt, enables true classifier-free guidance.
guidance_scale (float, defaults to 6.0) — Guidance scale as defined in Classifier-Free Diffusion Guidance. guidance_scale is defined as w of equation 2. of Imagen Paper. Guidance scale is enabled by settingguidance_scale > 1. Higher guidance scale encourages to generate images that are closely linked to the text prompt, usually at the expense of lower image quality. Note that the only available HunyuanVideo model is CFG-distilled, which means that traditional guidance between unconditional and conditional latent is not applied.
num_videos_per_prompt (int, optional, defaults to 1) — The number of images to generate per prompt.
generator (torch.Generator or List[torch.Generator], optional) — A torch.Generator to make generation deterministic.
image_latents (torch.Tensor, optional) — Pre-encoded image latents. If not provided, the image will be encoded using the VAE.
last_image_latents (torch.Tensor, optional) — Pre-encoded last image latents. If not provided, the last image will be encoded using the VAE.
prompt_embeds (torch.Tensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided, text embeddings are generated from the prompt input argument.
pooled_prompt_embeds (torch.FloatTensor, optional) — Pre-generated pooled text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, pooled text embeddings will be generated from prompt input argument.
negative_prompt_embeds (torch.FloatTensor, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated from negative_prompt input argument.
negative_pooled_prompt_embeds (torch.FloatTensor, optional) — Pre-generated negative pooled text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, pooled negative_prompt_embeds will be generated from negative_promptinput argument.
output_type (str, optional, defaults to "pil") — The output format of the generated image. Choose between PIL.Image or np.array.
return_dict (bool, optional, defaults to True) — Whether or not to return a HunyuanVideoFramepackPipelineOutput instead of a plain tuple.
attention_kwargs (dict, optional) — A kwargs dictionary that if specified is passed along to the AttentionProcessor as defined underself.processor indiffusers.models.attention_processor.
clip_skip (int, optional) — Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that the output of the pre-final layer will be used for computing the prompt embeddings.
callback_on_step_end (Callable, PipelineCallback, MultiPipelineCallbacks, optional) — A function or a subclass of PipelineCallback or MultiPipelineCallbacks that is called at the end of each denoising step during the inference. with the following arguments: callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict). callback_kwargs will include a list of all tensors as specified by callback_on_step_end_tensor_inputs.
callback_on_step_end_tensor_inputs (List, optional) — The list of tensor inputs for the callback_on_step_end function. The tensors specified in the list will be passed as callback_kwargs argument. You will only be able to include variables listed in the._callback_tensor_inputs attribute of your pipeline class.

Returns

~HunyuanVideoFramepackPipelineOutput or tuple

If return_dict is True, HunyuanVideoFramepackPipelineOutput is returned, otherwise a tuple is returned where the first element is a list with the generated images and the second element is a list of bools indicating whether the corresponding generated image contains “not-safe-for-work” (nsfw) content.

The call function to the pipeline for generation.

Examples:

Image-to-Video

import torch from diffusers import HunyuanVideoFramepackPipeline, HunyuanVideoFramepackTransformer3DModel from diffusers.utils import export_to_video, load_image from transformers import SiglipImageProcessor, SiglipVisionModel

transformer = HunyuanVideoFramepackTransformer3DModel.from_pretrained( ... "lllyasviel/FramePackI2V_HY", torch_dtype=torch.bfloat16 ... ) feature_extractor = SiglipImageProcessor.from_pretrained( ... "lllyasviel/flux_redux_bfl", subfolder="feature_extractor" ... ) image_encoder = SiglipVisionModel.from_pretrained( ... "lllyasviel/flux_redux_bfl", subfolder="image_encoder", torch_dtype=torch.float16 ... ) pipe = HunyuanVideoFramepackPipeline.from_pretrained( ... "hunyuanvideo-community/HunyuanVideo", ... transformer=transformer, ... feature_extractor=feature_extractor, ... image_encoder=image_encoder, ... torch_dtype=torch.float16, ... ) pipe.vae.enable_tiling() pipe.to("cuda")

image = load_image( ... "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/penguin.png" ... ) output = pipe( ... image=image, ... prompt="A penguin dancing in the snow", ... height=832, ... width=480, ... num_frames=91, ... num_inference_steps=30, ... guidance_scale=9.0, ... generator=torch.Generator().manual_seed(0), ... sampling_type="inverted_anti_drifting", ... ).frames[0] export_to_video(output, "output.mp4", fps=30)

First and Last Image-to-Video

import torch from diffusers import HunyuanVideoFramepackPipeline, HunyuanVideoFramepackTransformer3DModel from diffusers.utils import export_to_video, load_image from transformers import SiglipImageProcessor, SiglipVisionModel

transformer = HunyuanVideoFramepackTransformer3DModel.from_pretrained( ... "lllyasviel/FramePackI2V_HY", torch_dtype=torch.bfloat16 ... ) feature_extractor = SiglipImageProcessor.from_pretrained( ... "lllyasviel/flux_redux_bfl", subfolder="feature_extractor" ... ) image_encoder = SiglipVisionModel.from_pretrained( ... "lllyasviel/flux_redux_bfl", subfolder="image_encoder", torch_dtype=torch.float16 ... ) pipe = HunyuanVideoFramepackPipeline.from_pretrained( ... "hunyuanvideo-community/HunyuanVideo", ... transformer=transformer, ... feature_extractor=feature_extractor, ... image_encoder=image_encoder, ... torch_dtype=torch.float16, ... ) pipe.to("cuda")

prompt = "CG animation style, a small blue bird takes off from the ground, flapping its wings. The bird's feathers are delicate, with a unique pattern on its chest. The background shows a blue sky with white clouds under bright sunshine. The camera follows the bird upward, capturing its flight and the vastness of the sky from a close-up, low-angle perspective." first_image = load_image( ... "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_first_frame.png" ... ) last_image = load_image( ... "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_last_frame.png" ... ) output = pipe( ... image=first_image, ... last_image=last_image, ... prompt=prompt, ... height=512, ... width=512, ... num_frames=91, ... num_inference_steps=30, ... guidance_scale=9.0, ... generator=torch.Generator().manual_seed(0), ... sampling_type="inverted_anti_drifting", ... ).frames[0] export_to_video(output, "output.mp4", fps=30)

Disable sliced VAE decoding. If enable_vae_slicing was previously enabled, this method will go back to computing decoding in one step.

Disable tiled VAE decoding. If enable_vae_tiling was previously enabled, this method will go back to computing decoding in one step.

Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.

Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow processing larger images.

HunyuanVideoPipelineOutput

class diffusers.pipelines.hunyuan_video.pipeline_output.HunyuanVideoPipelineOutput

< source >

( frames: Tensor )

Parameters

frames (torch.Tensor, np.ndarray, or List[List[PIL.Image.Image]]) — List of video outputs - It can be a nested list of length batch_size, with each sub-list containing denoised PIL image sequences of length num_frames. It can also be a NumPy array or Torch tensor of shape(batch_size, num_frames, channels, height, width).

Output class for HunyuanVideo pipelines.

Update on GitHub

Framepack (original) (raw)

Available models

Usage

First and last frame to video

Vanilla sampling

Group offloading

HunyuanVideoFramepackPipeline

class diffusers.HunyuanVideoFramepackPipeline

__call__

Image-to-Video

First and Last Image-to-Video

HunyuanVideoPipelineOutput

class diffusers.pipelines.hunyuan_video.pipeline_output.HunyuanVideoPipelineOutput

call