[refactor] move positional embeddings to patch embed layer for CogVideoX by a-r-r-o-w · Pull Request #9263 · huggingface/diffusers (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation36 Commits12 Checks18 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
What does this PR do?
removes the 49-frame limit since CogVideoX-5B generalizes better than 2B and is able to generate more framesmoves the positional embedding creation logic to the pipeline similar to rotary embeddings- move the positional embedding logic to patch embed layer
as a side-effect of this PR, one can generate > 49 frames with CogVideoX-2b which will produce bad results, but we can add a recommendation about this.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
a-r-r-o-w changed the title
[refactor] removes the frame limititation in CogVideoX [refactor] removes the frame limitation in CogVideoX
a-r-r-o-w changed the title
[refactor] removes the frame limitation in CogVideoX [refactor] remove the frame limitation in CogVideoX
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
cc @zRzRzRzRzRzRzR for visibility
The changes here should not affect the quality of final outputs in any way
- The results before and after this PR for CogVideoX-2B still match 1:1 as expected.
- This PR also should not affect CogVideoX-5B.
I've verified these but a second look by others would be appreciated
Pushing the num_frames to the max (89 frames) after which results start to get much worse, here are the results:
cogvideox-2b-89frames.webm | cogvideox-5b-89frames.webm |
---|
Is the resolution hard-coded for this model? (720x480) And everything else will break?
Is the resolution hard-coded for this model? (720x480) And everything else will break?
You can generate at other resolutions as well just like any other model. The recommendations are 720x480 because that's the resolution of most of the training dataset I believe. I haven't experimented extensively yet but I've gotten good results at 512x512, 512x768, 768x768, 640x640 with the 5B model. Have not tried different resolutions on the 2B model but if you face any issues, we could work together on fixing it
The 5B model is yet to be officially released. It will be out in a day or two. The 2B model does not use rotary positional embeddings and uses basic positional embeddings, so the hardcoded values that you see is irrelevant there.
For the rotary embeds, the embedding grid needs to be appropriately resized/rescaled based on user input. It's generally done by using a fixed base latent resolution (and generally chosen to be the resolution that is most frequent in the dataset). In this case, that happens to be 720 // (8 * 2)
x 480 // (8 * 2)
. You can see another example of this here for HunyuanDiT.
@a-r-r-o-w Btw. I'm getting this message when running 2b via Diffusers:The config attributes {'mid_block_add_attention': True, 'sample_size': 256} were passed to AutoencoderKLCogVideoX, but are not expected and will be ignored. Please verify your config.json configuration file.
@a-r-r-o-w This is 2b doing 512x512:
-1872826932_A_rollercoaster_car_is_launched_from_a.mp4
Ah yes, these parameters were removed from the implementation since they were not relevant/correct. It's a harmless warning. I'll open a PR to the 2B repo in a bit to remove these from the VAE config.json
@a-r-r-o-w This is 2d doing 512x512:
Keep in mind that normal positional encoding, as used in 2B, is not great at generalizing to different resolutions when the training data does not include ample amount of multires examples. RoPE, as used in 5B, can somewhat deal with these issues with lesser examples, however further multires finetuning is required in Cog-2B/5B to properly address these concerns (something that the community might pick up on fairly soon hopefully). Here's a 5B 512x512 result:
cog-5b-512.webm
@a-r-r-o-w PM me, if you need someone to test Diffusers 5b locally on a RTX 4090 - on Windows - before the release.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
I'm aware there is some discrepancy between the the rotary and sincos embedding, but moving it to pipeline will create more discrepancy from rest of the code base so I think we can either leave it as it is for now, or maybe consider adding a positional embedding in transformer for rotary embedding too (and we can apply same patten across all relevant pipeline in a follow-up PR)
let me know what you think!
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a comment!
a-r-r-o-w changed the title
[refactor] remove the frame limitation in CogVideoX [refactor] move positional embeddings to patch embed layer for CogVideoX
Collaborator
yiyixuxu left a comment • Loading
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the PR!
cc @zRzRzRzRzRzRzR for visibility. Should only impact CogVideoX-2b in terms of implementation, but have no effect on final outputs
a-r-r-o-w deleted the cogvideox/pipeline-followups branch
a-r-r-o-w added a commit that referenced this pull request
…eoX (#9263)
remove frame limit in cogvideox
remove debug prints
Update src/diffusers/models/transformers/cogvideox_transformer_3d.py
revert pipeline; remove frame limitation
revert transformer changes
address review comments
add error message
apply suggestions from review
sayakpaul pushed a commit that referenced this pull request
…eoX (#9263)
remove frame limit in cogvideox
remove debug prints
Update src/diffusers/models/transformers/cogvideox_transformer_3d.py
revert pipeline; remove frame limitation
revert transformer changes
address review comments
add error message
apply suggestions from review