Latte (original) (raw)

latte text-to-video

Latte: Latent Diffusion Transformer for Video Generation from Monash University, Shanghai AI Lab, Nanjing University, and Nanyang Technological University.

The abstract from the paper is:

We propose a novel Latent Diffusion Transformer, namely Latte, for video generation. Latte first extracts spatio-temporal tokens from input videos and then adopts a series of Transformer blocks to model video distribution in the latent space. In order to model a substantial number of tokens extracted from videos, four efficient variants are introduced from the perspective of decomposing the spatial and temporal dimensions of input videos. To improve the quality of generated videos, we determine the best practices of Latte through rigorous experimental analysis, including video clip patch embedding, model variants, timestep-class information injection, temporal positional embedding, and learning strategies. Our comprehensive evaluation demonstrates that Latte achieves state-of-the-art performance across four standard video generation datasets, i.e., FaceForensics, SkyTimelapse, UCF101, and Taichi-HD. In addition, we extend Latte to text-to-video generation (T2V) task, where Latte achieves comparable results compared to recent T2V models. We strongly believe that Latte provides valuable insights for future research on incorporating Transformers into diffusion models for video generation.

Highlights: Latte is a latent diffusion transformer proposed as a backbone for modeling different modalities (trained for text-to-video generation here). It achieves state-of-the-art performance across four standard video benchmarks - FaceForensics, SkyTimelapse, UCF101 and Taichi-HD. To prepare and download the datasets for evaluation, please refer to this https URL.

This pipeline was contributed by maxin-cn. The original codebase can be found here. The original weights can be found under hf.co/maxin-cn.

Make sure to check out the Schedulers guide to learn how to explore the tradeoff between scheduler speed and quality, and see the reuse components across pipelines section to learn how to efficiently load the same components into multiple pipelines.

Inference

Use torch.compile to reduce the inference latency.

First, load the pipeline:

import torch from diffusers import LattePipeline

pipeline = LattePipeline.from_pretrained( "maxin-cn/Latte-1", torch_dtype=torch.float16 ).to("cuda")

Then change the memory layout of the pipelines transformer and vae components to torch.channels-last:

pipeline.transformer.to(memory_format=torch.channels_last) pipeline.vae.to(memory_format=torch.channels_last)

Finally, compile the components and run inference:

pipeline.transformer = torch.compile(pipeline.transformer) pipeline.vae.decode = torch.compile(pipeline.vae.decode)

video = pipeline(prompt="A dog wearing sunglasses floating in space, surreal, nebulae in background").frames[0]

The benchmark results on an 80GB A100 machine are:

Without torch.compile(): Average inference time: 16.246 seconds. With torch.compile(): Average inference time: 14.573 seconds.

Quantization

Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model.

Refer to the Quantization overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized LattePipeline for inference with bitsandbytes.

import torch from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, LatteTransformer3DModel, LattePipeline from diffusers.utils import export_to_gif from transformers import BitsAndBytesConfig as BitsAndBytesConfig, T5EncoderModel

quant_config = BitsAndBytesConfig(load_in_8bit=True) text_encoder_8bit = T5EncoderModel.from_pretrained( "maxin-cn/Latte-1", subfolder="text_encoder", quantization_config=quant_config, torch_dtype=torch.float16, )

quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True) transformer_8bit = LatteTransformer3DModel.from_pretrained( "maxin-cn/Latte-1", subfolder="transformer", quantization_config=quant_config, torch_dtype=torch.float16, )

pipeline = LattePipeline.from_pretrained( "maxin-cn/Latte-1", text_encoder=text_encoder_8bit, transformer=transformer_8bit, torch_dtype=torch.float16, device_map="balanced", )

prompt = "A small cactus with a happy face in the Sahara desert." video = pipeline(prompt).frames[0] export_to_gif(video, "latte.gif")

LattePipeline

class diffusers.LattePipeline

< source >

( tokenizer: T5Tokenizer text_encoder: T5EncoderModel vae: AutoencoderKL transformer: LatteTransformer3DModel scheduler: KarrasDiffusionSchedulers )

Parameters

Pipeline for text-to-video generation using Latte.

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods the library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)

__call__

< source >

( prompt: typing.Union[str, typing.List[str]] = None negative_prompt: str = '' num_inference_steps: int = 50 timesteps: typing.Optional[typing.List[int]] = None guidance_scale: float = 7.5 num_images_per_prompt: int = 1 video_length: int = 16 height: int = 512 width: int = 512 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.FloatTensor] = None prompt_embeds: typing.Optional[torch.FloatTensor] = None negative_prompt_embeds: typing.Optional[torch.FloatTensor] = None output_type: str = 'pil' return_dict: bool = True callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] clean_caption: bool = True mask_feature: bool = True enable_temporal_attentions: bool = True decode_chunk_size: int = 14 ) → LattePipelineOutput or tuple

Parameters

Returns

LattePipelineOutput or tuple

If return_dict is True, LattePipelineOutput is returned, otherwise a tuple is returned where the first element is a list with the generated images

Function invoked when calling the pipeline for generation.

Examples:

import torch from diffusers import LattePipeline from diffusers.utils import export_to_gif

pipe = LattePipeline.from_pretrained("maxin-cn/Latte-1", torch_dtype=torch.float16)

pipe.enable_model_cpu_offload()

prompt = "A small cactus with a happy face in the Sahara desert." videos = pipe(prompt).frames[0] export_to_gif(videos, "latte.gif")

encode_prompt

< source >

( prompt: typing.Union[str, typing.List[str]] do_classifier_free_guidance: bool = True negative_prompt: str = '' num_images_per_prompt: int = 1 device: typing.Optional[torch.device] = None prompt_embeds: typing.Optional[torch.FloatTensor] = None negative_prompt_embeds: typing.Optional[torch.FloatTensor] = None clean_caption: bool = False mask_feature: bool = True dtype = None )

Parameters

Encodes the prompt into text encoder hidden states.

< > Update on GitHub