[training] CogVideoX Lora by a-r-r-o-w · Pull Request #9302 · huggingface/diffusers (original) (raw)
What does this PR do?
Adds LoRA training and loading support for CogVideoX.
This is a rough draft and incomplete conversion from CogVideoX SAT.
#!/bin/bash
export TORCH_LOGS="+dynamo,recompiles,graph_breaks" export TORCHDYNAMO_VERBOSE=1
GPU_IDS="3"
accelerate launch --gpu_ids $GPU_IDS examples/cogvideo/train_cogvideox_lora.py
--pretrained_model_name_or_path THUDM/CogVideoX-2b
--cache_dir
--instance_data_root
--caption_column
--video_column
--id_token
--validation_prompt " A black and white animated scene unfolds, featuring a bulldog in overalls and a hat, standing on a ship's deck. The bulldog assumes various poses, then walks towards a dockside with two ducks and a cow. A wooden platform reads 'PODUNK LANDING,' while a building marked 'BOAT TICKETS' and scattered barrels hint at a destination. The bulldog and ducks move purposefully, possibly heading towards a food stand or boating services, amidst a monochromatic backdrop with no noticeable changes in environment or lighting:::A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance"
--validation_prompt_separator :::
--num_validation_videos 1
--validation_epochs 10
--seed 42
--rank 64
--lora_alpha 64
--mixed_precision fp16
--output_dir /raid/aryan/cogvideox-lora
--height 480 --width 720 --fps 8 --max_num_frames 49 --skip_frames_start 0 --skip_frames_end 0
--train_batch_size 1
--num_train_epochs 40
--checkpointing_steps 1000
--gradient_accumulation_steps 1
--learning_rate 1e-3
--lr_scheduler cosine_with_restarts
--lr_warmup_steps 200
--lr_num_cycles 1
--enable_slicing
--enable_tiling
--optimizer Adam
--adam_beta1 0.9
--adam_beta2 0.95
--max_grad_norm 1.0
--report_to wandb
The above is assuming a 50-video dataset (total of 2000 training steps)
TODO:
- Implement tiled encoding (current OOMs for Cog-5B but works for Cog-2B)
- Test with Prodigy optimizer
- Determine best data preparation format and make the process more clean
- Prepare dummy test data repository for others to test (Edit: Available internally on our org. No public release from diffusers team on this at the moment)
- Remove unnecessary parameters
Verify outputs against SAT implementationDon't match 1:1 possibly due to many reasons- Add lora tests
- Docs
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.