Stable Video Diffusion (original) (raw)

Stable Video Diffusion (SVD) is a powerful image-to-video generation model that can generate 2-4 second high resolution (576x1024) videos conditioned on an input image.

This guide will show you how to use SVD to generate short videos from images.

Before you begin, make sure you have the following libraries installed:

!pip install -q -U diffusers transformers accelerate

The are two variants of this model, SVD and SVD-XT. The SVD checkpoint is trained to generate 14 frames and the SVD-XT checkpoint is further finetuned to generate 25 frames.

You’ll use the SVD-XT checkpoint for this guide.

import torch

from diffusers import StableVideoDiffusionPipeline from diffusers.utils import load_image, export_to_video

pipe = StableVideoDiffusionPipeline.from_pretrained( "stabilityai/stable-video-diffusion-img2vid-xt", torch_dtype=torch.float16, variant="fp16" ) pipe.enable_model_cpu_offload()

image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png") image = image.resize((1024, 576))

generator = torch.manual_seed(42) frames = pipe(image, decode_chunk_size=8, generator=generator).frames[0]

export_to_video(frames, "generated.mp4", fps=7)

"source image of a rocket"

"generated video from source image"

torch.compile

You can gain a 20-25% speedup at the expense of slightly increased memory by compiling the UNet.

Reduce memory usage

Video generation is very memory intensive because you’re essentially generating num_frames all at once, similar to text-to-image generation with a high batch size. To reduce the memory requirement, there are multiple options that trade-off inference speed for lower memory requirement:

Using all these tricks together should lower the memory requirement to less than 8GB VRAM.

Micro-conditioning

Stable Diffusion Video also accepts micro-conditioning, in addition to the conditioning image, which allows more control over the generated video:

For example, to generate a video with more motion, use the motion_bucket_id and noise_aug_strength micro-conditioning parameters:

import torch

from diffusers import StableVideoDiffusionPipeline from diffusers.utils import load_image, export_to_video

pipe = StableVideoDiffusionPipeline.from_pretrained( "stabilityai/stable-video-diffusion-img2vid-xt", torch_dtype=torch.float16, variant="fp16" ) pipe.enable_model_cpu_offload()

image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png") image = image.resize((1024, 576))

generator = torch.manual_seed(42) frames = pipe(image, decode_chunk_size=8, generator=generator, motion_bucket_id=180, noise_aug_strength=0.1).frames[0] export_to_video(frames, "generated.mp4", fps=7)

< > Update on GitHub