Remove Qwen Image Redundant RoPE Cache by dg845 · Pull Request #12452 · huggingface/diffusers (original) (raw)

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

What does this PR do?

This PR removes self.rope_cache in QwenEmbedRope, so that RoPE frequency caching is only done once through the functools.lru_cache decorator on _compute_video_freqs. It also changes the maxsize argument for lru_cache from None to 128 to prevent the cache from causing OOM errors.

Fixes #12401

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@sayakpaul
@naykun
@chenxiao111222

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@naykun could you also give it a look?

One thing to note is that the current PR implementation (in 52cf252) isn't quite equivalent to the implementation in main when not compiling since frame is part of the cache key for the lru_cache, but not part of the cache key for self.rope_cache.

@naykun do you think changing the cache key to include frame makes sense? If frame is the same for most calls to _compute_video_freqs, then I think the two implementations are approximately equivalent.

I think the use of a frame based notation is present to allow extension to videos easily. For images, this shouldn't matter at all because the idx is always a constant and since img_shapes is supplemented just once from the pipeline:

img_shapes: Optional[List[Tuple[int, int, int]]] = None,

(example)

do you think changing the cache key to include frame makes sense? If frame is the same for most calls to _compute_video_freqs, then I think the two implementations are approximately equivalent.

For the rope_cache, I am not sure how much of a readability compromise that would be, though.

Gentle ping @naykun to take a look :).

Hi everyone!
Apologies for missing the notifications earlier.

As @sayakpaul mentioned, we’re currently keeping the frame parameter in place to maintain future compatibility with video-related features. For text-to-image and image editing scenarios, the frame will always be 1, and the idx will vary depending on the input conditions or output images. The changes in this PR won’t affect that logic or impact caching efficiency.

That said, I’d like to double-check with @sayakpaul: using lru_cache won’t interfere with compilation in any edge cases, right?

Running

RUN_SLOW=1 RUN_COMPILE=1 pytest tests/models/transformers/test_models_transformer_qwenimage.py::QwenImageTransformerCompileTests

gives no errors (at least locally on DGX).