[SDXL] Partial diffusion support for Text2Img and Img2Img Pipelines by bghira · Pull Request #4015 · huggingface/diffusers (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation113 Commits20 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

bghira

What does this PR do?

Fixes #4003

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

bghira added 4 commits

July 9, 2023 15:20

…rst_inference_step for img2img

@HuggingFaceDocBuilderDev

The documentation is not available anymore as the PR was closed or merged.

sayakpaul

sayakpaul

sayakpaul

sayakpaul

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@patrickvonplaten

Hey @bghira,

Thanks for the PR, could you maybe post a couple of examples with diffusers code that show how this approach is better? Or I guess the workflow is as stated here: #4003 (comment) ?

Copy-pasting it here:

from diffusers import DiffusionPipeline import torch

pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-0.9", torch_dtype=torch.float16, use_safetensors=True, variant="fp16") pipe.to("cuda") # OR, pipe.enable_sequential_cpu_offload() OR, pipe.enable_model_cpu_offload()

if using torch < 2.0

pipe.enable_xformers_memory_efficient_attention()

prompt = "An astronaut riding a green horse"

We're going to schedule 20 steps, and complete 10 of them.

image = pipe(prompt=prompt, output_type="latent", num_inference_steps=20, max_inference_steps=10).images

If you have low vram.

del pipe

import gc

gc.collect()

torch.clear_cache()

torch.clear_autocast_cache()

Put through the refiner now.

pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-refiner-0.9", torch_dtype=torch.float16, use_safetensors=True, variant="fp16") pipe.to("cuda") # OR, pipe.enable_sequential_cpu_offload() OR, pipe.enable_model_cpu_offload()

if using torch < 2.0

pipe.enable_xformers_memory_efficient_attention()

If add_noise is False, we will no longer add noise during Img2Img.

This is useful since the base details will be preserved. However,

switching add_noise to True can yield truly interesting results.

images = pipe(prompt=prompt, image=image, num_inference_steps=20, first_inference_step=10, add_noise=False).images

patrickvonplaten

@patrickvonplaten

Hey @bghira,

Very cool PR, I think I've understood the full workflow now. So essentially what we're doing here is that we use SD-XL base to denoise the image for high noise timesteps and we use the refiner to denoise the images for low noise timesteps.

In that sense your approach here is very similar to the idea of eDiffi - it's very cool that you made it work for SD-XL!

From the eDiffi paper:

Early in sampling,
generation strongly relies on the text prompt to generate textaligned content, while later, the text conditioning is >
almost
entirely ignored, and the task changes to producing outputs
of high visual fidelity.

=> So in a sense "base" is used here mostly to generate textaligned content and the refiner is used for high visual fidelity results.

Think we need to add tests and docs here as well :-)

patrickvonplaten

patrickvonplaten

patrickvonplaten

@@ -619,6 +638,8 @@ def __call__(
] = None,
strength: float = 0.3,
num_inference_steps: int = 50,
max_inference_steps: Optional[int] = None,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if you use percentages the accuracy of the calculation and converting to integer means you might misalign timesteps, which results in artifacting

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have an idea i will implement a helper method to return a pair of values from two inputs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm can you explain how you might accidentally misalign?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think this can't really happen if we always round, e.g.:

max_inference_steps = int(round(denoising_end * num_inference_steps)) first_inference_steps = int(round(denoising_start * num_inference_Steps))

denoising_end should always be equal to 1 - denoising_start and in this case we will always have:

num_inference_steps = max_inference_steps + first_inference_steps

no?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it happens when you have precise values for strength that result in remainders in the int() casting that become discarded. i have tested it:
1689000316 7576849ca8f41e6b56b1ff29f093680e55fbc6

In this test, num_inference_steps is set to 27 and strength is set to 0.73 - the calculation says that the model should run for about 19.71 steps, but since we can't run a fractional step, this would be truncated by int() to 19, but the value for begin_inference_step is 7.27 which truncates to 7.

This results in a misalignment of one step, which is, apparently, enough to disrupt the output altogether.

We can use round(), but instead, I have opted for what I've considered a more robust approach.

Furthermore, there are some prompts/seeds/subject material where these artifacts prove useful in making the output more variable and creative. Having the option to set those and override our best judgement seems like one of the things that makes Diffusers great, and I wanted to retain that.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on @patrickvonplaten suggestion on using a percentage

It seems to me we will just have to set denoising_end = denoising_start, and default it to be (1- strength)?

then,
final_inference_steps = int(denoising_end * num_inference_steps)
begin_inference_steps = int(denoising_start * num_inference_steps)

so in the example num_inference_step = 27, strength= 0.73, we will have final_inference_steps = begin_inference_steps = 7, i.e run 7 steps in base model and 20 in refiner - is this what's intended?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ive tested this, and while it seems like it should work, it has really broken the results for me, overall. i am digging into it.

patrickvonplaten

patrickvonplaten

patrickvonplaten

patrickvonplaten

patrickvonplaten

@patrickvonplaten

Very excited about this PR. This makes a lot of sense. From a user perspective I think it would be nicer to provide fractions/%'s, such as denoising_end=0.6 and denoising_start=0.4.

I think this has two advantages:

So the this API could look as follows:

from diffusers import DiffusionPipeline import torch

pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-0.9", torch_dtype=torch.float16, use_safetensors=True, variant="fp16") pipe.to("cuda") # OR, pipe.enable_sequential_cpu_offload() OR, pipe.enable_model_cpu_offload()

if using torch < 2.0

pipe.enable_xformers_memory_efficient_attention()

prompt = "An astronaut riding a green horse"

We're going to schedule 20 steps, and complete 10 of them.

image = pipe(prompt=prompt, output_type="latent", num_inference_steps=20, denoising_end=0.5).images

If you have low vram.

del pipe

import gc

gc.collect()

torch.clear_cache()

torch.clear_autocast_cache()

Put through the refiner now.

pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-refiner-0.9", torch_dtype=torch.float16, use_safetensors=True, variant="fp16") pipe.to("cuda") # OR, pipe.enable_sequential_cpu_offload() OR, pipe.enable_model_cpu_offload()

if using torch < 2.0

pipe.enable_xformers_memory_efficient_attention()

If add_noise is False, we will no longer add noise during Img2Img.

This is useful since the base details will be preserved. However,

switching add_noise to True can yield truly interesting results.

images = pipe(prompt=prompt, image=image, num_inference_steps=20, denoising_start=0.5).images

@bghira

Here is an updated example:

Example: from diffusers import DiffusionPipeline import torch

pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-0.9", torch_dtype=torch.float16, use_safetensors=True, variant="fp16")
pipe.to("cuda") # OR, pipe.enable_sequential_cpu_offload() OR, pipe.enable_model_cpu_offload()

# if using torch < 2.0
# pipe.enable_xformers_memory_efficient_attention()

prompt = "An astronaut riding a green horse"

# We're going to schedule 20 steps, and complete 50% of them using either model.
total_num_steps = 20
refiner_strength = 0.5
# These values can be manipulated manually to get different results.
final_inference_step, begin_inference_step = pipe.timesteps_from_strength(refiner_strength, total_num_steps)
image = pipe(prompt=prompt, output_type="latent",
            num_inference_steps=total_num_steps, final_inference_step=final_inference_step).images

# If you have low vram.
# del pipe
# import gc
# gc.collect()
# torch.clear_cache()
# torch.clear_autocast_cache()

# Put through the refiner now.
pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-refiner-0.9", torch_dtype=torch.float16, use_safetensors=True, variant="fp16")
pipe.to("cuda") # OR, pipe.enable_sequential_cpu_offload() OR, pipe.enable_model_cpu_offload()

# if using torch < 2.0
# pipe.enable_xformers_memory_efficient_attention()

images = pipe(prompt=prompt, image=image,
            num_inference_steps=total_num_steps, begin_inference_step=begin_inference_step).images

@bghira

I took a crack at that today, for a few hours. I couldn't make the percentages work reliably. The risk of artifacting is pretty great. If anyone else would like to take a swing. I welcome more commits and attention on this work.

Here's a comparison grid of the demo prompt with the fully denoised base model samples on the left column, and the partially denoised base model samples on the right column.

The image at the top centre is what the base model did on its own.

The results are using DDIM with 40 steps divided between the models from 10% up to 90% refiner strength.

https://tripleback.net/public/partial-diffusion-comparison.png

bghira and others added 2 commits

July 10, 2023 19:06

@patrickvonplaten

sayakpaul

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fantastic!

May I also have some a brief explanation of the two separate test cases being added?

IIUC, their purpose is to test for:

patrickvonplaten

patrickvonplaten

patrickvonplaten

patrickvonplaten

patrickvonplaten

patrickvonplaten

patrickvonplaten

patrickvonplaten

@patrickvonplaten @sayakpaul

Co-authored-by: Sayak Paul spsayakpaul@gmail.com

@patrickvonplaten

@patrickvonplaten

Failing test is unrelated - merging

@n00mkrad

@patrickvonplaten @bghira

The docs for this seem to be wrong.
The docs say that denoising_end and denoising_start should be set to high_noise_frac, but this results in extremely smooth denoised images with no detail.

It should actually be n_steps * high_noise_frac for this value. However, when using this, the progress bar incorrectly show about 10x too many steps, but the resulting image is perfect.

@n00mkrad

I was wrong. However I don't see the point of this method.

Here is fraction 0.7 (top) vs 1.0 (bottom):

image

Frac 1.0 also runs faster than 0.7.

I feel like I'm missing something here. Frac 1.0 should skip the refiner entirely, right? But then why is my result so, so much sharper than the 70/30 split of base/refiner?

@n00mkrad

After hours of debugging I found the issue:

THIS METHOD DOES NOT WORK WITH KARRAS SCHEDULERS!

Should either be fixed (if possible) or mentioned in the docs.

@AmericanPresidentJimmyCarter

After hours of debugging I found the issue:

THIS METHOD DOES NOT WORK WITH KARRAS SCHEDULERS!

Should either be fixed (if possible) or mentioned in the docs.

Yes, I already opened an issue and I told them about the issue when they were implementing it, but it was still merged.

#4060

@bghira

the Karras sigmas pattern is generally less tested and sees more issues like that. I have ceased using them for a month or more now, in favour of the rescaled DDIM timesteps.

@n00mkrad

DPM++ 2M (without karras) is also very washed out and less detailed than with the regular method.

@bghira

can you open a new issue report if you're seeing issues you'd like resolved? they're unlikely to receive the proper attention in a comment thread on a merged PR.

@n00mkrad

I guess we're done here as #4060 already mentions improvements to this approach.

@patrickvonplaten

Let's maybe continue the discussion in #4115 :-)

orpatashnik pushed a commit to orpatashnik/diffusers that referenced this pull request

Aug 1, 2023

…uggingface#4015)

Co-authored-by: Pedro Cuenca pedro@huggingface.co

Co-authored-by: Sayak Paul spsayakpaul@gmail.com


Co-authored-by: bghira bghira@users.github.com Co-authored-by: Patrick von Platen patrick.v.platen@gmail.com Co-authored-by: Pedro Cuenca pedro@huggingface.co Co-authored-by: Sayak Paul spsayakpaul@gmail.com

orpatashnik pushed a commit to orpatashnik/diffusers that referenced this pull request

Aug 1, 2023

…uggingface#4015)

Co-authored-by: Pedro Cuenca pedro@huggingface.co

Co-authored-by: Sayak Paul spsayakpaul@gmail.com


Co-authored-by: bghira bghira@users.github.com Co-authored-by: Patrick von Platen patrick.v.platen@gmail.com Co-authored-by: Pedro Cuenca pedro@huggingface.co Co-authored-by: Sayak Paul spsayakpaul@gmail.com

orpatashnik pushed a commit to orpatashnik/diffusers that referenced this pull request

Aug 1, 2023

…uggingface#4015)

Co-authored-by: Pedro Cuenca pedro@huggingface.co

Co-authored-by: Sayak Paul spsayakpaul@gmail.com


Co-authored-by: bghira bghira@users.github.com Co-authored-by: Patrick von Platen patrick.v.platen@gmail.com Co-authored-by: Pedro Cuenca pedro@huggingface.co Co-authored-by: Sayak Paul spsayakpaul@gmail.com

@bghira bghira deleted the partial-diffusion branch

September 29, 2023 16:35

yoonseokjin pushed a commit to yoonseokjin/diffusers that referenced this pull request

Dec 25, 2023

…uggingface#4015)

Co-authored-by: Pedro Cuenca pedro@huggingface.co

Co-authored-by: Sayak Paul spsayakpaul@gmail.com


Co-authored-by: bghira bghira@users.github.com Co-authored-by: Patrick von Platen patrick.v.platen@gmail.com Co-authored-by: Pedro Cuenca pedro@huggingface.co Co-authored-by: Sayak Paul spsayakpaul@gmail.com

AmericanPresidentJimmyCarter pushed a commit to AmericanPresidentJimmyCarter/diffusers that referenced this pull request

Apr 26, 2024

…uggingface#4015)

Co-authored-by: Pedro Cuenca pedro@huggingface.co

Co-authored-by: Sayak Paul spsayakpaul@gmail.com


Co-authored-by: bghira bghira@users.github.com Co-authored-by: Patrick von Platen patrick.v.platen@gmail.com Co-authored-by: Pedro Cuenca pedro@huggingface.co Co-authored-by: Sayak Paul spsayakpaul@gmail.com