[training] CogVideoX Lora by a-r-r-o-w · Pull Request #9302 · huggingface/diffusers (original) (raw)

What does this PR do?

Adds LoRA training and loading support for CogVideoX.

This is a rough draft and incomplete conversion from CogVideoX SAT.

#!/bin/bash

export TORCH_LOGS="+dynamo,recompiles,graph_breaks" export TORCHDYNAMO_VERBOSE=1

GPU_IDS="3"

accelerate launch --gpu_ids $GPU_IDS examples/cogvideo/train_cogvideox_lora.py
--pretrained_model_name_or_path THUDM/CogVideoX-2b
--cache_dir
--instance_data_root
--caption_column
--video_column
--id_token
--validation_prompt " A black and white animated scene unfolds, featuring a bulldog in overalls and a hat, standing on a ship's deck. The bulldog assumes various poses, then walks towards a dockside with two ducks and a cow. A wooden platform reads 'PODUNK LANDING,' while a building marked 'BOAT TICKETS' and scattered barrels hint at a destination. The bulldog and ducks move purposefully, possibly heading towards a food stand or boating services, amidst a monochromatic backdrop with no noticeable changes in environment or lighting:::A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance"
--validation_prompt_separator :::
--num_validation_videos 1
--validation_epochs 10
--seed 42
--rank 64
--lora_alpha 64
--mixed_precision fp16
--output_dir /raid/aryan/cogvideox-lora
--height 480 --width 720 --fps 8 --max_num_frames 49 --skip_frames_start 0 --skip_frames_end 0
--train_batch_size 1
--num_train_epochs 40
--checkpointing_steps 1000
--gradient_accumulation_steps 1
--learning_rate 1e-3
--lr_scheduler cosine_with_restarts
--lr_warmup_steps 200
--lr_num_cycles 1
--enable_slicing
--enable_tiling
--optimizer Adam
--adam_beta1 0.9
--adam_beta2 0.95
--max_grad_norm 1.0
--report_to wandb

The above is assuming a 50-video dataset (total of 2000 training steps)

TODO:

Implement tiled encoding (current OOMs for Cog-5B but works for Cog-2B)
Test with Prodigy optimizer
Determine best data preparation format and make the process more clean
Prepare dummy test data repository for others to test (Edit: Available internally on our org. No public release from diffusers team on this at the moment)
Remove unnecessary parameters
~~Verify outputs against SAT implementation~~ Don't match 1:1 possibly due to many reasons
Add lora tests
Docs

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@sayakpaul @yiyixuxu @linoytsaban

cc @zRzRzRzRzRzRzR @bghira