[core] Support VideoToVideo with CogVideoX by a-r-r-o-w · Pull Request #9333 · huggingface/diffusers (original) (raw)

num_frames can't be defined after the latest patch.

Yes, this parameter was removed to keep it similar to our other video-to-video pipelines. Since the pipeline takes an input video, it would be weird if the length of list of video frames did not match num_frames and simply raise an error. So, it is expected that the input video has the correct number of frames

Are there any for CogVideoX? And how do you load them?

Currently, we're working on hosting the quantized checkpoints. It's not necessary to have a quantized checkpoint though, since it can also be created on the fly. You could follow the torchao quantized inference guides here for more detailed examples. A concise example would be something like:

import torch from diffusers import CogVideoXTransformer3DModel, CogVideoXPipeline from diffusers.utils import export_to_video from torchao.quantization import ( quantize_, int4_weight_only, int8_dynamic_activation_int4_weight, int8_weight_only, int8_dynamic_activation_int8_weight, )

Either "THUDM/CogVideoX-2b" or "THUDM/CogVideoX-5b"

model_id = "THUDM/CogVideoX-5b"

1. Quantize models

transformer = CogVideoXTransformer3DModel.from_pretrained(model_id, subfolder="transformer", torch_dtype=torch.bfloat16) quantize_(transformer, int8_weight_only())

2. Create pipeline

pipe = CogVideoXPipeline.from_pretrained(model_id, transformer=transformer, torch_dtype=torch.bfloat16).to("cuda")

3. Inference

prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."

video = pipe( prompt=prompt, guidance_scale=6, use_dynamic_cfg=True, num_inference_steps=50, ).frames[0] export_to_video(video, "output.mp4", fps=8)

You would need torchao installed from source and pytorch nightly for this to work until the next release. Full table of benchmarks here.