Text-to-video (original) (raw)

πŸ§ͺ This pipeline is for research purposes only.

LoRA

ModelScope Text-to-Video Technical Report is by Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, Shiwei Zhang.

The abstract from the paper is:

This paper introduces ModelScopeT2V, a text-to-video synthesis model that evolves from a text-to-image synthesis model (i.e., Stable Diffusion). ModelScopeT2V incorporates spatio-temporal blocks to ensure consistent frame generation and smooth movement transitions. The model could adapt to varying frame numbers during training and inference, rendering it suitable for both image-text and video-text datasets. ModelScopeT2V brings together three components (i.e., VQGAN, a text encoder, and a denoising UNet), totally comprising 1.7 billion parameters, in which 0.5 billion parameters are dedicated to temporal capabilities. The model demonstrates superior performance over state-of-the-art methods across three evaluation metrics. The code and an online demo are available at https://modelscope.cn/models/damo/text-to-video-synthesis/summary.

You can find additional information about Text-to-Video on the project page, original codebase, and try it out in a demo. Official checkpoints can be found at damo-vilab and cerspense.

Usage example

text-to-video-ms-1.7b

Let’s start by generating a short video with the default length of 16 frames (2s at 8 fps):

import torch from diffusers import DiffusionPipeline from diffusers.utils import export_to_video

pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16") pipe = pipe.to("cuda")

prompt = "Spiderman is surfing" video_frames = pipe(prompt).frames[0] video_path = export_to_video(video_frames) video_path

Diffusers supports different optimization techniques to improve the latency and memory footprint of a pipeline. Since videos are often more memory-heavy than images, we can enable CPU offloading and VAE slicing to keep the memory footprint at bay.

Let’s generate a video of 8 seconds (64 frames) on the same GPU using CPU offloading and VAE slicing:

import torch from diffusers import DiffusionPipeline from diffusers.utils import export_to_video

pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16") pipe.enable_model_cpu_offload()

pipe.enable_vae_slicing()

prompt = "Darth Vader surfing a wave" video_frames = pipe(prompt, num_frames=64).frames[0] video_path = export_to_video(video_frames) video_path

It just takes 7 GBs of GPU memory to generate the 64 video frames using PyTorch 2.0, β€œfp16” precision and the techniques mentioned above.

We can also use a different scheduler easily, using the same method we’d use for Stable Diffusion:

import torch from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler from diffusers.utils import export_to_video

pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16") pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config) pipe.enable_model_cpu_offload()

prompt = "Spiderman is surfing" video_frames = pipe(prompt, num_inference_steps=25).frames[0] video_path = export_to_video(video_frames) video_path

Here are some sample outputs:

cerspense/zeroscope_v2_576w & cerspense/zeroscope_v2_XL

Zeroscope are watermark-free model and have been trained on specific sizes such as 576x320 and 1024x576. One should first generate a video using the lower resolution checkpoint cerspense/zeroscope_v2_576w with TextToVideoSDPipeline, which can then be upscaled using VideoToVideoSDPipeline and cerspense/zeroscope_v2_XL.

import torch from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler from diffusers.utils import export_to_video from PIL import Image

pipe = DiffusionPipeline.from_pretrained("cerspense/zeroscope_v2_576w", torch_dtype=torch.float16) pipe.enable_model_cpu_offload()

pipe.unet.enable_forward_chunking(chunk_size=1, dim=1) pipe.enable_vae_slicing()

prompt = "Darth Vader surfing a wave" video_frames = pipe(prompt, num_frames=24).frames[0] video_path = export_to_video(video_frames) video_path

Now the video can be upscaled:

pipe = DiffusionPipeline.from_pretrained("cerspense/zeroscope_v2_XL", torch_dtype=torch.float16) pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config) pipe.enable_model_cpu_offload()

pipe.unet.enable_forward_chunking(chunk_size=1, dim=1) pipe.enable_vae_slicing()

video = [Image.fromarray(frame).resize((1024, 576)) for frame in video_frames]

video_frames = pipe(prompt, video=video, strength=0.6).frames[0] video_path = export_to_video(video_frames) video_path

Here are some sample outputs:

Darth vader surfing in waves.
Darth vader surfing in waves.

Tips

Video generation is memory-intensive and one way to reduce your memory usage is to set enable_forward_chunking on the pipeline’s UNet so you don’t run the entire feedforward layer at once. Breaking it up into chunks in a loop is more efficient.

Check out the Text or image-to-video guide for more details about how certain parameters can affect video generation and how to optimize inference by reducing memory usage.

Make sure to check out the Schedulers guide to learn how to explore the tradeoff between scheduler speed and quality, and see the reuse components across pipelines section to learn how to efficiently load the same components into multiple pipelines.

TextToVideoSDPipeline

Pipeline for text-to-video generation.

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).

The pipeline also inherits the following loading methods:

__call__

< source >

( prompt: typing.Union[str, typing.List[str]] = None height: typing.Optional[int] = None width: typing.Optional[int] = None num_frames: int = 16 num_inference_steps: int = 50 guidance_scale: float = 9.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'np' return_dict: bool = True callback: typing.Optional[typing.Callable[[int, int, torch.Tensor], NoneType]] = None callback_steps: int = 1 cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None clip_skip: typing.Optional[int] = None ) β†’ TextToVideoSDPipelineOutput or tuple

Parameters

If return_dict is True, TextToVideoSDPipelineOutput is returned, otherwise a tuple is returned where the first element is a list with the generated frames.

The call function to the pipeline for generation.

Examples:

import torch from diffusers import TextToVideoSDPipeline from diffusers.utils import export_to_video

pipe = TextToVideoSDPipeline.from_pretrained( ... "damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16" ... ) pipe.enable_model_cpu_offload()

prompt = "Spiderman is surfing" video_frames = pipe(prompt).frames[0] video_path = export_to_video(video_frames) video_path

encode_prompt

< source >

( prompt device num_images_per_prompt do_classifier_free_guidance negative_prompt = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None lora_scale: typing.Optional[float] = None clip_skip: typing.Optional[int] = None )

Parameters

Encodes the prompt into text encoder hidden states.

VideoToVideoSDPipeline

Pipeline for text-guided video-to-video generation.

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).

The pipeline also inherits the following loading methods:

__call__

< source >

( prompt: typing.Union[str, typing.List[str]] = None video: typing.Union[typing.List[numpy.ndarray], torch.Tensor] = None strength: float = 0.6 num_inference_steps: int = 50 guidance_scale: float = 15.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'np' return_dict: bool = True callback: typing.Optional[typing.Callable[[int, int, torch.Tensor], NoneType]] = None callback_steps: int = 1 cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None clip_skip: typing.Optional[int] = None ) β†’ TextToVideoSDPipelineOutput or tuple

Parameters

If return_dict is True, TextToVideoSDPipelineOutput is returned, otherwise a tuple is returned where the first element is a list with the generated frames.

The call function to the pipeline for generation.

Examples:

import torch from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler from diffusers.utils import export_to_video

pipe = DiffusionPipeline.from_pretrained("cerspense/zeroscope_v2_576w", torch_dtype=torch.float16) pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config) pipe.to("cuda")

prompt = "spiderman running in the desert" video_frames = pipe(prompt, num_inference_steps=40, height=320, width=576, num_frames=24).frames[0]

video_path = export_to_video(video_frames, output_video_path="./video_576_spiderman.mp4")

pipe.to("cpu")

pipe = DiffusionPipeline.from_pretrained( ... "cerspense/zeroscope_v2_XL", torch_dtype=torch.float16, revision="refs/pr/15" ... ) pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config) pipe.enable_model_cpu_offload()

pipe.vae.enable_slicing()

video = [Image.fromarray(frame).resize((1024, 576)) for frame in video_frames]

video_frames = pipe(prompt, video=video, strength=0.6).frames[0] video_path = export_to_video(video_frames, output_video_path="./video_1024_spiderman.mp4") video_path

encode_prompt

< source >

( prompt device num_images_per_prompt do_classifier_free_guidance negative_prompt = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None lora_scale: typing.Optional[float] = None clip_skip: typing.Optional[int] = None )

Parameters

Encodes the prompt into text encoder hidden states.

TextToVideoSDPipelineOutput

class diffusers.pipelines.text_to_video_synthesis.TextToVideoSDPipelineOutput

< source >

( frames: typing.Union[torch.Tensor, numpy.ndarray, typing.List[typing.List[PIL.Image.Image]]] )

Parameters

Output class for text-to-video pipelines.

PIL image sequences of length num_frames. It can also be a NumPy array or Torch tensor of shape(batch_size, num_frames, channels, height, width)

< > Update on GitHub