Add Support for Z-Image Series by JerryWu-code · Pull Request #12703 · huggingface/diffusers (original) (raw)

@yiyixuxu
By the way, while testing the _flash_3 and _flash_varlen_3 backends, we noticed that the current implementation in attention_dispatch.py is incompatible with the latest Flash Attention 3 APIs.

The recent FA3 commit (Dao-AILab/flash-attention@203b9b3) introduced a return_attn_probs argument and changed the default behavior. The functions now return a single output tensor by default (instead of a tuple), which causes the current tuple unpacking logic in diffusers to fail:

out, lse, *_ = flash_attn_3_varlen_func(
q=query_packed,
k=key_packed,
v=value_packed,
cu_seqlens_q=cu_seqlens_q,
cu_seqlens_k=cu_seqlens_k,
max_seqlen_q=max_seqlen_q,
max_seqlen_k=max_seqlen_k,
softmax_scale=scale,
causal=is_causal,
)

out, lse, *_ = flash_attn_3_func(
q=q,
k=k,
v=v,
softmax_scale=softmax_scale,
causal=causal,
qv=qv,
q_descale=q_descale,
k_descale=k_descale,
v_descale=v_descale,
window_size=window_size,
attention_chunk=attention_chunk,
softcap=softcap,
num_splits=num_splits,
pack_gqa=pack_gqa,
deterministic=deterministic,
sm_margin=sm_margin,
)

We have implemented a fix that handles this while maintaining backward compatibility:

JerryWu-code@de4c6f1#diff-b027e126a86a26981384b125714e0f3bd9923eaa8322f1ae5f6b53fe3e3481c2

Should we include this fix in the current PR, or would you prefer us to open a separate PR for it?