Fix small inconsistency in output dimension of "_get_t5_prompt_embeds" function in sd3 pipeline by alirezafarashah · Pull Request #12531 · huggingface/diffusers (original) (raw)

What does this PR do?

This PR fixes a small inconsistency in the output dimension of the _get_t5_prompt_embeds function in the Stable Diffusion 3 pipeline.

Previously, when self.text_encoder_3 was None, the function returned a tensor (torch.zeros) with a sequence length of self.tokenizer_max_length (77), which corresponds to the CLIP encoder. However, the T5 text encoder used in SD3 has a different maximum sequence length (256).

As a result, when text_encoder_3 was available, the prompt embeddings had a sequence length of 333 (256 from T5 + 77 from CLIP), but when it was not available, the returned tensor had only 154 (77 + 77), leading to an inconsistency in output dimensions in encode_prompt.

Motivation and Context

This change ensures consistent tensor shapes across different encoder availability conditions in the SD3 pipeline.
It prevents dimension mismatches and potential runtime errors when text_encoder_3 is None.

Previously, the zeros tensor used self.tokenizer_max_length, which corresponds to CLIP, instead of T5’s longer sequence length.
This mismatch led to inconsistent embedding dimensions when combining outputs from CLIP and T5 in encode_prompt.

Changes Made

Before submitting

Who can review?