[SDXL] Partial diffusion support for Text2Img and Img2Img Pipelines by bghira · Pull Request #4015 · huggingface/diffusers (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation113 Commits20 Checks0 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
What does this PR do?
Fixes #4003
Before submitting
- This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- Did you read the contributor guideline?
- Did you read our philosophy doc (important for complex PRs)?
- Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
- Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings. - Did you write any new necessary tests?
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
- General functionalities: @patrickvonplaten and @sayakpaul
bghira added 4 commits
…rst_inference_step for img2img
The documentation is not available anymore as the PR was closed or merged.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @bghira,
Thanks for the PR, could you maybe post a couple of examples with diffusers
code that show how this approach is better? Or I guess the workflow is as stated here: #4003 (comment) ?
Copy-pasting it here:
from diffusers import DiffusionPipeline import torch
pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-0.9", torch_dtype=torch.float16, use_safetensors=True, variant="fp16") pipe.to("cuda") # OR, pipe.enable_sequential_cpu_offload() OR, pipe.enable_model_cpu_offload()
if using torch < 2.0
pipe.enable_xformers_memory_efficient_attention()
prompt = "An astronaut riding a green horse"
We're going to schedule 20 steps, and complete 10 of them.
image = pipe(prompt=prompt, output_type="latent", num_inference_steps=20, max_inference_steps=10).images
If you have low vram.
del pipe
import gc
gc.collect()
torch.clear_cache()
torch.clear_autocast_cache()
Put through the refiner now.
pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-refiner-0.9", torch_dtype=torch.float16, use_safetensors=True, variant="fp16") pipe.to("cuda") # OR, pipe.enable_sequential_cpu_offload() OR, pipe.enable_model_cpu_offload()
if using torch < 2.0
pipe.enable_xformers_memory_efficient_attention()
If add_noise is False, we will no longer add noise during Img2Img.
This is useful since the base details will be preserved. However,
switching add_noise to True can yield truly interesting results.
images = pipe(prompt=prompt, image=image, num_inference_steps=20, first_inference_step=10, add_noise=False).images
Hey @bghira,
Very cool PR, I think I've understood the full workflow now. So essentially what we're doing here is that we use SD-XL base to denoise the image for high noise timesteps and we use the refiner to denoise the images for low noise timesteps.
In that sense your approach here is very similar to the idea of eDiffi - it's very cool that you made it work for SD-XL!
From the eDiffi paper:
Early in sampling,
generation strongly relies on the text prompt to generate textaligned content, while later, the text conditioning is >
almost
entirely ignored, and the task changes to producing outputs
of high visual fidelity.
=> So in a sense "base" is used here mostly to generate textaligned content and the refiner is used for high visual fidelity results.
Think we need to add tests and docs here as well :-)
@@ -619,6 +638,8 @@ def __call__( |
---|
] = None, |
strength: float = 0.3, |
num_inference_steps: int = 50, |
max_inference_steps: Optional[int] = None, |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so if you use percentages the accuracy of the calculation and converting to integer means you might misalign timesteps, which results in artifacting
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point!
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i have an idea i will implement a helper method to return a pair of values from two inputs.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm can you explain how you might accidentally misalign?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I think this can't really happen if we always round, e.g.:
max_inference_steps = int(round(denoising_end * num_inference_steps)) first_inference_steps = int(round(denoising_start * num_inference_Steps))
denoising_end
should always be equal to 1 - denoising_start
and in this case we will always have:
num_inference_steps = max_inference_steps + first_inference_steps
no?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it happens when you have precise values for strength
that result in remainders in the int()
casting that become discarded. i have tested it:
In this test, num_inference_steps
is set to 27
and strength
is set to 0.73
- the calculation says that the model should run for about 19.71
steps, but since we can't run a fractional step, this would be truncated by int()
to 19
, but the value for begin_inference_step
is 7.27
which truncates to 7
.
This results in a misalignment of one step, which is, apparently, enough to disrupt the output altogether.
We can use round(), but instead, I have opted for what I've considered a more robust approach.
Furthermore, there are some prompts/seeds/subject material where these artifacts prove useful in making the output more variable and creative. Having the option to set those and override our best judgement seems like one of the things that makes Diffusers great, and I wanted to retain that.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on @patrickvonplaten suggestion on using a percentage
It seems to me we will just have to set denoising_end = denoising_start
, and default it to be (1- strength)
?
then,final_inference_steps = int(denoising_end * num_inference_steps)
begin_inference_steps = int(denoising_start * num_inference_steps)
so in the example num_inference_step = 27
, strength= 0.73
, we will have final_inference_steps = begin_inference_steps = 7
, i.e run 7
steps in base model and 20
in refiner - is this what's intended?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ive tested this, and while it seems like it should work, it has really broken the results for me, overall. i am digging into it.
Very excited about this PR. This makes a lot of sense. From a user perspective I think it would be nicer to provide fractions/%'s, such as denoising_end=0.6
and denoising_start=0.4
.
I think this has two advantages:
- 1.) IMO it's easier to grasp the concept, e.g. "For 60 % of the denoising process we use the "base" model" and "For the final 40% we use the refiner
- 2.) It should create a better workflow as the user can change
num_inference_steps
just like before without having to changedenoising_start
or end.
So the this API could look as follows:
from diffusers import DiffusionPipeline import torch
pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-0.9", torch_dtype=torch.float16, use_safetensors=True, variant="fp16") pipe.to("cuda") # OR, pipe.enable_sequential_cpu_offload() OR, pipe.enable_model_cpu_offload()
if using torch < 2.0
pipe.enable_xformers_memory_efficient_attention()
prompt = "An astronaut riding a green horse"
We're going to schedule 20 steps, and complete 10 of them.
image = pipe(prompt=prompt, output_type="latent", num_inference_steps=20, denoising_end=0.5).images
If you have low vram.
del pipe
import gc
gc.collect()
torch.clear_cache()
torch.clear_autocast_cache()
Put through the refiner now.
pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-refiner-0.9", torch_dtype=torch.float16, use_safetensors=True, variant="fp16") pipe.to("cuda") # OR, pipe.enable_sequential_cpu_offload() OR, pipe.enable_model_cpu_offload()
if using torch < 2.0
pipe.enable_xformers_memory_efficient_attention()
If add_noise is False, we will no longer add noise during Img2Img.
This is useful since the base details will be preserved. However,
switching add_noise to True can yield truly interesting results.
images = pipe(prompt=prompt, image=image, num_inference_steps=20, denoising_start=0.5).images
Here is an updated example:
Example: from diffusers import DiffusionPipeline import torch
pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-0.9", torch_dtype=torch.float16, use_safetensors=True, variant="fp16")
pipe.to("cuda") # OR, pipe.enable_sequential_cpu_offload() OR, pipe.enable_model_cpu_offload()
# if using torch < 2.0
# pipe.enable_xformers_memory_efficient_attention()
prompt = "An astronaut riding a green horse"
# We're going to schedule 20 steps, and complete 50% of them using either model.
total_num_steps = 20
refiner_strength = 0.5
# These values can be manipulated manually to get different results.
final_inference_step, begin_inference_step = pipe.timesteps_from_strength(refiner_strength, total_num_steps)
image = pipe(prompt=prompt, output_type="latent",
num_inference_steps=total_num_steps, final_inference_step=final_inference_step).images
# If you have low vram.
# del pipe
# import gc
# gc.collect()
# torch.clear_cache()
# torch.clear_autocast_cache()
# Put through the refiner now.
pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-refiner-0.9", torch_dtype=torch.float16, use_safetensors=True, variant="fp16")
pipe.to("cuda") # OR, pipe.enable_sequential_cpu_offload() OR, pipe.enable_model_cpu_offload()
# if using torch < 2.0
# pipe.enable_xformers_memory_efficient_attention()
images = pipe(prompt=prompt, image=image,
num_inference_steps=total_num_steps, begin_inference_step=begin_inference_step).images
I took a crack at that today, for a few hours. I couldn't make the percentages work reliably. The risk of artifacting is pretty great. If anyone else would like to take a swing. I welcome more commits and attention on this work.
Here's a comparison grid of the demo prompt with the fully denoised base model samples on the left column, and the partially denoised base model samples on the right column.
The image at the top centre is what the base model did on its own.
The results are using DDIM with 40 steps divided between the models from 10% up to 90% refiner strength.
https://tripleback.net/public/partial-diffusion-comparison.png
bghira and others added 2 commits
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fantastic!
May I also have some a brief explanation of the two separate test cases being added?
IIUC, their purpose is to test for:
- Make sure this feature works for a set of schedulers
- Make sure
denoising_end
anddenoising_start
behaviour is robust - Make sure the internal computations for steps within the schedulers are as expected.
Co-authored-by: Sayak Paul spsayakpaul@gmail.com
Failing test is unrelated - merging
The docs for this seem to be wrong.
The docs say that denoising_end
and denoising_start
should be set to high_noise_frac
, but this results in extremely smooth denoised images with no detail.
It should actually be n_steps * high_noise_frac
for this value. However, when using this, the progress bar incorrectly show about 10x too many steps, but the resulting image is perfect.
I was wrong. However I don't see the point of this method.
Here is fraction 0.7 (top) vs 1.0 (bottom):
Frac 1.0 also runs faster than 0.7.
I feel like I'm missing something here. Frac 1.0 should skip the refiner entirely, right? But then why is my result so, so much sharper than the 70/30 split of base/refiner?
After hours of debugging I found the issue:
THIS METHOD DOES NOT WORK WITH KARRAS SCHEDULERS!
Should either be fixed (if possible) or mentioned in the docs.
After hours of debugging I found the issue:
THIS METHOD DOES NOT WORK WITH KARRAS SCHEDULERS!
Should either be fixed (if possible) or mentioned in the docs.
Yes, I already opened an issue and I told them about the issue when they were implementing it, but it was still merged.
the Karras sigmas pattern is generally less tested and sees more issues like that. I have ceased using them for a month or more now, in favour of the rescaled DDIM timesteps.
DPM++ 2M (without karras) is also very washed out and less detailed than with the regular method.
can you open a new issue report if you're seeing issues you'd like resolved? they're unlikely to receive the proper attention in a comment thread on a merged PR.
I guess we're done here as #4060 already mentions improvements to this approach.
Let's maybe continue the discussion in #4115 :-)
orpatashnik pushed a commit to orpatashnik/diffusers that referenced this pull request
diffusers#4003 - initial implementation of max_inference_steps
diffusers#4003 - initial implementation of max_inference_steps and first_inference_step for img2img
diffusers#4003 - use first_inference_step as an input arg for get_timestamps in img2img
diffusers#4003 Do not add noise during img2img when we have a defined first timestep
diffusers#4003 Mild updates after revert
diffusers#4003 Missing change
Show implementation with denoising_start and end
Apply suggestions from code review
Update src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py
Co-authored-by: Pedro Cuenca pedro@huggingface.co
move to 0.19.0dev
Apply suggestions from code review
add exhaustive tests
add docs
finish
Apply suggestions from code review
Co-authored-by: Sayak Paul spsayakpaul@gmail.com
- make style
Co-authored-by: bghira bghira@users.github.com Co-authored-by: Patrick von Platen patrick.v.platen@gmail.com Co-authored-by: Pedro Cuenca pedro@huggingface.co Co-authored-by: Sayak Paul spsayakpaul@gmail.com
orpatashnik pushed a commit to orpatashnik/diffusers that referenced this pull request
diffusers#4003 - initial implementation of max_inference_steps
diffusers#4003 - initial implementation of max_inference_steps and first_inference_step for img2img
diffusers#4003 - use first_inference_step as an input arg for get_timestamps in img2img
diffusers#4003 Do not add noise during img2img when we have a defined first timestep
diffusers#4003 Mild updates after revert
diffusers#4003 Missing change
Show implementation with denoising_start and end
Apply suggestions from code review
Update src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py
Co-authored-by: Pedro Cuenca pedro@huggingface.co
move to 0.19.0dev
Apply suggestions from code review
add exhaustive tests
add docs
finish
Apply suggestions from code review
Co-authored-by: Sayak Paul spsayakpaul@gmail.com
- make style
Co-authored-by: bghira bghira@users.github.com Co-authored-by: Patrick von Platen patrick.v.platen@gmail.com Co-authored-by: Pedro Cuenca pedro@huggingface.co Co-authored-by: Sayak Paul spsayakpaul@gmail.com
orpatashnik pushed a commit to orpatashnik/diffusers that referenced this pull request
diffusers#4003 - initial implementation of max_inference_steps
diffusers#4003 - initial implementation of max_inference_steps and first_inference_step for img2img
diffusers#4003 - use first_inference_step as an input arg for get_timestamps in img2img
diffusers#4003 Do not add noise during img2img when we have a defined first timestep
diffusers#4003 Mild updates after revert
diffusers#4003 Missing change
Show implementation with denoising_start and end
Apply suggestions from code review
Update src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py
Co-authored-by: Pedro Cuenca pedro@huggingface.co
move to 0.19.0dev
Apply suggestions from code review
add exhaustive tests
add docs
finish
Apply suggestions from code review
Co-authored-by: Sayak Paul spsayakpaul@gmail.com
- make style
Co-authored-by: bghira bghira@users.github.com Co-authored-by: Patrick von Platen patrick.v.platen@gmail.com Co-authored-by: Pedro Cuenca pedro@huggingface.co Co-authored-by: Sayak Paul spsayakpaul@gmail.com
bghira deleted the partial-diffusion branch
yoonseokjin pushed a commit to yoonseokjin/diffusers that referenced this pull request
diffusers#4003 - initial implementation of max_inference_steps
diffusers#4003 - initial implementation of max_inference_steps and first_inference_step for img2img
diffusers#4003 - use first_inference_step as an input arg for get_timestamps in img2img
diffusers#4003 Do not add noise during img2img when we have a defined first timestep
diffusers#4003 Mild updates after revert
diffusers#4003 Missing change
Show implementation with denoising_start and end
Apply suggestions from code review
Update src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py
Co-authored-by: Pedro Cuenca pedro@huggingface.co
move to 0.19.0dev
Apply suggestions from code review
add exhaustive tests
add docs
finish
Apply suggestions from code review
Co-authored-by: Sayak Paul spsayakpaul@gmail.com
- make style
Co-authored-by: bghira bghira@users.github.com Co-authored-by: Patrick von Platen patrick.v.platen@gmail.com Co-authored-by: Pedro Cuenca pedro@huggingface.co Co-authored-by: Sayak Paul spsayakpaul@gmail.com
AmericanPresidentJimmyCarter pushed a commit to AmericanPresidentJimmyCarter/diffusers that referenced this pull request
diffusers#4003 - initial implementation of max_inference_steps
diffusers#4003 - initial implementation of max_inference_steps and first_inference_step for img2img
diffusers#4003 - use first_inference_step as an input arg for get_timestamps in img2img
diffusers#4003 Do not add noise during img2img when we have a defined first timestep
diffusers#4003 Mild updates after revert
diffusers#4003 Missing change
Show implementation with denoising_start and end
Apply suggestions from code review
Update src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py
Co-authored-by: Pedro Cuenca pedro@huggingface.co
move to 0.19.0dev
Apply suggestions from code review
add exhaustive tests
add docs
finish
Apply suggestions from code review
Co-authored-by: Sayak Paul spsayakpaul@gmail.com
- make style
Co-authored-by: bghira bghira@users.github.com Co-authored-by: Patrick von Platen patrick.v.platen@gmail.com Co-authored-by: Pedro Cuenca pedro@huggingface.co Co-authored-by: Sayak Paul spsayakpaul@gmail.com