I2VGen-XL (original) (raw)

I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models by Shiwei Zhang, Jiayu Wang, Yingya Zhang, Kang Zhao, Hangjie Yuan, Zhiwu Qin, Xiang Wang, Deli Zhao, and Jingren Zhou.

The abstract from the paper is:

Video synthesis has recently made remarkable strides benefiting from the rapid development of diffusion models. However, it still encounters challenges in terms of semantic accuracy, clarity and spatio-temporal continuity. They primarily arise from the scarcity of well-aligned text-video data and the complex inherent structure of videos, making it difficult for the model to simultaneously ensure semantic and qualitative excellence. In this report, we propose a cascaded I2VGen-XL approach that enhances model performance by decoupling these two factors and ensures the alignment of the input data by utilizing static images as a form of crucial guidance. I2VGen-XL consists of two stages: i) the base stage guarantees coherent semantics and preserves content from input images by using two hierarchical encoders, and ii) the refinement stage enhances the video’s details by incorporating an additional brief text and improves the resolution to 1280×720. To improve the diversity, we collect around 35 million single-shot text-video pairs and 6 billion text-image pairs to optimize the model. By this means, I2VGen-XL can simultaneously enhance the semantic accuracy, continuity of details and clarity of generated videos. Through extensive experiments, we have investigated the underlying principles of I2VGen-XL and compared it with current top methods, which can demonstrate its effectiveness on diverse data. The source code and models will be publicly available at this https URL.

The original codebase can be found here. The model checkpoints can be found here.

Make sure to check out the Schedulers guide to learn how to explore the tradeoff between scheduler speed and quality, and see the reuse components across pipelines section to learn how to efficiently load the same components into multiple pipelines. Also, to know more about reducing the memory usage of this pipeline, refer to the [“Reduce memory usage”] section here.

Sample output with I2VGenXL:

library.
library

Notes

I2VGenXLPipeline

class diffusers.I2VGenXLPipeline

< source >

( vae: AutoencoderKL text_encoder: CLIPTextModel tokenizer: CLIPTokenizer image_encoder: CLIPVisionModelWithProjection feature_extractor: CLIPImageProcessor unet: I2VGenXLUNet scheduler: DDIMScheduler )

Parameters

Pipeline for image-to-video generation as proposed in I2VGenXL.

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).

__call__

< source >

( prompt: typing.Union[str, typing.List[str]] = None image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] = None height: typing.Optional[int] = 704 width: typing.Optional[int] = 1280 target_fps: typing.Optional[int] = 16 num_frames: int = 16 num_inference_steps: int = 50 guidance_scale: float = 9.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None eta: float = 0.0 num_videos_per_prompt: typing.Optional[int] = 1 decode_chunk_size: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None clip_skip: typing.Optional[int] = 1 ) → pipelines.i2vgen_xl.pipeline_i2vgen_xl.I2VGenXLPipelineOutput or tuple

Parameters

If return_dict is True, pipelines.i2vgen_xl.pipeline_i2vgen_xl.I2VGenXLPipelineOutput is returned, otherwise a tuple is returned where the first element is a list with the generated frames.

The call function to the pipeline for image-to-video generation with I2VGenXLPipeline.

Examples:

import torch from diffusers import I2VGenXLPipeline from diffusers.utils import export_to_gif, load_image

pipeline = I2VGenXLPipeline.from_pretrained( ... "ali-vilab/i2vgen-xl", torch_dtype=torch.float16, variant="fp16" ... ) pipeline.enable_model_cpu_offload()

image_url = ( ... "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/i2vgen_xl_images/img_0009.png" ... ) image = load_image(image_url).convert("RGB")

prompt = "Papers were floating in the air on a table in the library" negative_prompt = "Distorted, discontinuous, Ugly, blurry, low resolution, motionless, static, disfigured, disconnected limbs, Ugly faces, incomplete arms" generator = torch.manual_seed(8888)

frames = pipeline( ... prompt=prompt, ... image=image, ... num_inference_steps=50, ... negative_prompt=negative_prompt, ... guidance_scale=9.0, ... generator=generator, ... ).frames[0] video_path = export_to_gif(frames, "i2v.gif")

encode_prompt

< source >

( prompt device num_videos_per_prompt negative_prompt = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None clip_skip: typing.Optional[int] = None )

Parameters

Encodes the prompt into text encoder hidden states.

I2VGenXLPipelineOutput

class diffusers.pipelines.i2vgen_xl.pipeline_i2vgen_xl.I2VGenXLPipelineOutput

< source >

( frames: typing.Union[torch.Tensor, numpy.ndarray, typing.List[typing.List[PIL.Image.Image]]] )

Parameters

Output class for image-to-video pipeline.

PIL image sequences of length num_frames. It can also be a NumPy array or Torch tensor of shape(batch_size, num_frames, channels, height, width)

< > Update on GitHub