Cogvideox-5B Model adapter change by zRzRzRzRzRzRzR · Pull Request #9203 · huggingface/diffusers (original) (raw)

@yiyixuxu @zRzRzRzRzRzRzR Would it be okay to remove the following limit in the follow-up PR?

if num_frames > 49:
raise ValueError(
"The number of frames must be less than 49 for now due to static positional embeddings. This will be updated in the future to remove this limitation."
)

I tested the 5B-model with generations of 57, 65, and 73 frames and they all turn out good - maybe the RoPE embeddings help the model generalize better. For the 2B-model, the outputs are bad for the above values probably due to limitations of normal PE. We could add a recommendation in docs mentioning that 49 frames and below is the good setting for 2B.

In the refactor, I'd also like to create the normal positional embeddings in the pipeline instead of the transformer, similar to rope embeds, because it does not make sense to create them for the 5B model (currently they are created and saved to the module with a call to register_buffer regardless of whether 2B or 5B).