Fix Flux CLIP prompt embeds repeat for num_images_per_prompt > 1 by DN6 · Pull Request #9280 · huggingface/diffusers (original) (raw)
So the effect can be subtle for num_images_per_prompt>1
. e.g In the current implementation with this snippet
prompt = ["a cat", "a dog"]
height = 1024 width = 768
images = pipe( prompt=prompt, guidance_scale=3.5, num_inference_steps=20, num_images_per_prompt=2, generator=torch.Generator("cpu").manual_seed(1), height=height, width=width ).images
The CLIP prompt embeds alternate between cat/dog embeddings
tensor([[-0.5312, 0.2305, 0.1787, ..., -0.3242, 0.5781, -0.7578],
[-0.4375, 0.5430, -0.2891, ..., -0.1133, 0.0938, -0.3477],
[-0.5312, 0.2305, 0.1787, ..., -0.3242, 0.5781, -0.7578],
[-0.4375, 0.5430, -0.2891, ..., -0.1133, 0.0938, -0.3477]],
device='cuda:0', dtype=torch.bfloat16)
but it should be (2 cat embeddings, 2 dog embeddings)
tensor([[-0.5312, 0.2305, 0.1787, ..., -0.3242, 0.5781, -0.7578],
[-0.5312, 0.2305, 0.1787, ..., -0.3242, 0.5781, -0.7578],
[-0.4375, 0.5430, -0.2891, ..., -0.1133, 0.0938, -0.3477],
[-0.4375, 0.5430, -0.2891, ..., -0.1133, 0.0938, -0.3477]],
device='cuda:0', dtype=torch.bfloat16)
Since the T5 embeddings are in the right order, what happens in this case is that
1st image (should be cat) - uses CLIP cat embedding + T5 cat embedding
2nd image (should be cat) - uses CLIP dog embedding + T5 cat embedding
3rd image (should be dog) - uses CLIP cat embedding + T5 dog embedding
4th image (should be dog) - uses CLIP dog embedding + T5 dog embedding