Improve pos embed for Flux.1 inference on Ascend NPU by gameofdimension · Pull Request #12534 · huggingface/diffusers (original) (raw)

What does this PR do?

Moving pos_embed computation from NPU to CPU results in a 1.4x speedup in Flux.1's end-to-end latency.

model device e2e latency
FLUX.1-dev npu 31s
FLUX.1-dev cpu 21s
FLUX.1-Fill-dev npu 64s
FLUX.1-Fill-dev cpu 46s

Tested hardware:

Ascend 910B2C

Repro Code

FLUX.1-dev

import torch from diffusers import FluxPipeline

pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16).to("npu")

prompt = "A cat holding a sign that says hello world" image = pipe( prompt, height=1024, width=1024, guidance_scale=3.5, num_inference_steps=50, max_sequence_length=512, generator=torch.Generator("cpu").manual_seed(0) ).images[0] image.save("flux-dev.png")

FLUX.1-Fill-dev

import torch from diffusers import FluxFillPipeline from diffusers.utils import load_image

image = load_image("https://huggingface.co/datasets/diffusers/diffusers-images-docs/resolve/main/cup.png") mask = load_image("https://huggingface.co/datasets/diffusers/diffusers-images-docs/resolve/main/cup_mask.png")

pipe = FluxFillPipeline.from_pretrained("black-forest-labs/FLUX.1-Fill-dev", torch_dtype=torch.bfloat16).to("npu") image = pipe( prompt="a white paper cup", image=image, mask_image=mask, height=1632, width=1232, guidance_scale=30, num_inference_steps=50, max_sequence_length=512, generator=torch.Generator("cpu").manual_seed(0) ).images[0] image.save(f"flux-fill-dev.png")

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.