I2VGenXLPipeline - missing components? · huggingface/diffusers · Discussion #7952 (original) (raw)

Hi everyone,

I was playing with I2VGenXLPipeline. Here is corresponding Huggingface implementation.. I saw some discrepancy between method described in the paper and this implementation. Can someone help me in checking if my understanding is correct.

In the paper, they have the following diagram:

According to this diagram, in the base stage, they have D.Enc. and G.Inc, however, I only see CLIP in the implementation here.
Similarly, in implementation, I observe that text embeddings are passed to the LDM of base stage (this line), however, as per the diagram, text is only passed in refinement stage.
In refinement stage, there is LDM, however, in implementation, I see that low dimensional video latent is passed to VAE decoder to generate high dimensional video, I do not see any reverse diffusion process.

Can anyone tell me if my understanding is correct for this code? I wanted to access intermediate low dimensional video, which comes at the end of base stage, but I don't know how to exactly access it. Can anyone tell me how to access that representation? I would appreciate it.