diffusers (original) (raw)

So the effect can be subtle for num_images_per_prompt>1 . e.g In the current implementation with this snippet

prompt = ["a cat", "a dog"]

height = 1024 width = 768

images = pipe( prompt=prompt, guidance_scale=3.5, num_inference_steps=20, num_images_per_prompt=2, generator=torch.Generator("cpu").manual_seed(1), height=height, width=width ).images

The CLIP prompt embeds alternate between cat/dog embeddings

tensor([[-0.5312,  0.2305,  0.1787,  ..., -0.3242,  0.5781, -0.7578],
        [-0.4375,  0.5430, -0.2891,  ..., -0.1133,  0.0938, -0.3477],
        [-0.5312,  0.2305,  0.1787,  ..., -0.3242,  0.5781, -0.7578],
        [-0.4375,  0.5430, -0.2891,  ..., -0.1133,  0.0938, -0.3477]],
       device='cuda:0', dtype=torch.bfloat16)

but it should be (2 cat embeddings, 2 dog embeddings)

tensor([[-0.5312,  0.2305,  0.1787,  ..., -0.3242,  0.5781, -0.7578],
        [-0.5312,  0.2305,  0.1787,  ..., -0.3242,  0.5781, -0.7578],
        [-0.4375,  0.5430, -0.2891,  ..., -0.1133,  0.0938, -0.3477],
        [-0.4375,  0.5430, -0.2891,  ..., -0.1133,  0.0938, -0.3477]],
       device='cuda:0', dtype=torch.bfloat16)

Since the T5 embeddings are in the right order, what happens in this case is that

1st image (should be cat) - uses CLIP cat embedding + T5 cat embedding
2nd image (should be cat) - uses CLIP dog embedding + T5 cat embedding
3rd image (should be dog) - uses CLIP cat embedding + T5 dog embedding
4th image (should be dog) - uses CLIP dog embedding + T5 dog embedding