[refactor] CogVideoX followups + tiled decoding support by a-r-r-o-w · Pull Request #9150 · huggingface/diffusers (original) (raw)

What does this PR do?

CogVideoX followups from Add CogVideoX text-to-video generation model #9082
Support for tiled decoding Code

import gc

import torch from diffusers import CogVideoXPipeline, CogVideoXDDIMScheduler from diffusers.utils import export_to_video

def reset_memory(): gc.collect() torch.cuda.empty_cache() torch.cuda.reset_accumulated_memory_stats() torch.cuda.reset_peak_memory_stats()

def print_memory(): memory = round(torch.cuda.memory_allocated() / 10243, 2) max_memory = round(torch.cuda.max_memory_allocated() / 10243, 2) max_reserved = round(torch.cuda.max_memory_reserved() / 1024**3, 2) print(f"{memory=} GB") print(f"{max_memory=} GB") print(f"{max_reserved=} GB")

prompt = ( "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. " "The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other " "pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, " "casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. " "The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical " "atmosphere of this unique musical performance." ) pipe = CogVideoXPipeline.from_pretrained("/raid/aryan/CogVideoX-trial", torch_dtype=torch.float16) pipe.scheduler = CogVideoXDDIMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing")

pipe.enable_model_cpu_offload()

reset_memory() video = pipe(prompt=prompt, num_frames=48, guidance_scale=6, num_inference_steps=50, generator=torch.Generator().manual_seed(42)).frames[0] print_memory() export_to_video(video, "output.mp4", fps=8)

pipe.vae.enable_tiling()

Memory usage:

Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 4.51it/s] Loading pipeline components...: 40%|████████████████████████████████████████████████████████████████████████████▊ | 2/5 [00:00<00:00, 3.29it/s]The config attributes {'mid_block_add_attention': True, 'sample_size': 256} were passed to AutoencoderKLCogVideoX, but are not expected and will be ignored. Please verify your config.json configuration file. Loading pipeline components...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:01<00:00, 4.41it/s] 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [02:44<00:00, 3.28s/it]

CPU offloading, normal VAE decoding

memory=0.01 GB max_memory=12.39 GB max_reserved=20.39 GB 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [02:35<00:00, 3.11s/it]

CPU offloading, tiled VAE decoding

memory=0.01 GB max_memory=10.81 GB max_reserved=10.83 GB

Results:

Normal
output.webm
Tiled
output_tiling.webm

Note that you will need to install accelerate:main from source for this to work and get the expected numbers I'm getting above. If you're using the stable version of accelerate, you might see an addition 5-7GB usage.

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@DN6 @sayakpaul @zRzRzRzRzRzRzR