DeepFloyd IF (original) (raw)
Overview
DeepFloyd IF is a novel state-of-the-art open-source text-to-image model with a high degree of photorealism and language understanding. The model is a modular composed of a frozen text encoder and three cascaded pixel diffusion modules:
- Stage 1: a base model that generates 64x64 px image based on text prompt,
- Stage 2: a 64x64 px => 256x256 px super-resolution model, and
- Stage 3: a 256x256 px => 1024x1024 px super-resolution model Stage 1 and Stage 2 utilize a frozen text encoder based on the T5 transformer to extract text embeddings, which are then fed into a UNet architecture enhanced with cross-attention and attention pooling. Stage 3 is Stability AI’s x4 Upscaling model. The result is a highly efficient model that outperforms current state-of-the-art models, achieving a zero-shot FID score of 6.66 on the COCO dataset. Our work underscores the potential of larger UNet architectures in the first stage of cascaded diffusion models and depicts a promising future for text-to-image synthesis.
Usage
Before you can use IF, you need to accept its usage conditions. To do so:
- Make sure to have a Hugging Face account and be logged in.
- Accept the license on the model card of DeepFloyd/IF-I-XL-v1.0. Accepting the license on the stage I model card will auto accept for the other IF models.
- Make sure to login locally. Install
huggingface_hub
:
pip install huggingface_hub --upgrade
run the login function in a Python shell:
from huggingface_hub import login
login()
and enter your Hugging Face Hub access token.
Next we install diffusers
and dependencies:
pip install -q diffusers accelerate transformers
The following sections give more in-detail examples of how to use IF. Specifically:
- Text-to-Image Generation
- Image-to-Image Generation
- Inpainting
- Reusing model weights
- Speed optimization
- Memory optimization
Available checkpoints
- Stage-1
- Stage-2
- Stage-3
Text-to-Image Generation
By default diffusers makes use of model cpu offloading to run the whole IF pipeline with as little as 14 GB of VRAM.
from diffusers import DiffusionPipeline from diffusers.utils import pt_to_pil, make_image_grid import torch
stage_1 = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) stage_1.enable_model_cpu_offload()
stage_2 = DiffusionPipeline.from_pretrained( "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16 ) stage_2.enable_model_cpu_offload()
safety_modules = { "feature_extractor": stage_1.feature_extractor, "safety_checker": stage_1.safety_checker, "watermarker": stage_1.watermarker, } stage_3 = DiffusionPipeline.from_pretrained( "stabilityai/stable-diffusion-x4-upscaler", **safety_modules, torch_dtype=torch.float16 ) stage_3.enable_model_cpu_offload()
prompt = 'a photo of a kangaroo wearing an orange hoodie and blue sunglasses standing in front of the eiffel tower holding a sign that says "very deep learning"' generator = torch.manual_seed(1)
prompt_embeds, negative_embeds = stage_1.encode_prompt(prompt)
stage_1_output = stage_1( prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, generator=generator, output_type="pt" ).images
stage_2_output = stage_2( image=stage_1_output, prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, generator=generator, output_type="pt", ).images
stage_3_output = stage_3(prompt=prompt, image=stage_2_output, noise_level=100, generator=generator).images
make_image_grid([pt_to_pil(stage_1_output)[0], pt_to_pil(stage_2_output)[0], stage_3_output[0]], rows=1, rows=3)
Text Guided Image-to-Image Generation
The same IF model weights can be used for text-guided image-to-image translation or image variation. In this case just make sure to load the weights using the IFImg2ImgPipeline and IFImg2ImgSuperResolutionPipeline pipelines.
Note: You can also directly move the weights of the text-to-image pipelines to the image-to-image pipelines without loading them twice by making use of the components argument as explained here.
from diffusers import IFImg2ImgPipeline, IFImg2ImgSuperResolutionPipeline, DiffusionPipeline from diffusers.utils import pt_to_pil, load_image, make_image_grid import torch
url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg" original_image = load_image(url) original_image = original_image.resize((768, 512))
stage_1 = IFImg2ImgPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) stage_1.enable_model_cpu_offload()
stage_2 = IFImg2ImgSuperResolutionPipeline.from_pretrained( "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16 ) stage_2.enable_model_cpu_offload()
safety_modules = { "feature_extractor": stage_1.feature_extractor, "safety_checker": stage_1.safety_checker, "watermarker": stage_1.watermarker, } stage_3 = DiffusionPipeline.from_pretrained( "stabilityai/stable-diffusion-x4-upscaler", **safety_modules, torch_dtype=torch.float16 ) stage_3.enable_model_cpu_offload()
prompt = "A fantasy landscape in style minecraft" generator = torch.manual_seed(1)
prompt_embeds, negative_embeds = stage_1.encode_prompt(prompt)
stage_1_output = stage_1( image=original_image, prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, generator=generator, output_type="pt", ).images
stage_2_output = stage_2( image=stage_1_output, original_image=original_image, prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, generator=generator, output_type="pt", ).images
stage_3_output = stage_3(prompt=prompt, image=stage_2_output, generator=generator, noise_level=100).images
make_image_grid([original_image, pt_to_pil(stage_1_output)[0], pt_to_pil(stage_2_output)[0], stage_3_output[0]], rows=1, rows=4)
Text Guided Inpainting Generation
The same IF model weights can be used for text-guided image-to-image translation or image variation. In this case just make sure to load the weights using the IFInpaintingPipeline and IFInpaintingSuperResolutionPipeline pipelines.
Note: You can also directly move the weights of the text-to-image pipelines to the image-to-image pipelines without loading them twice by making use of the ~DiffusionPipeline.components()
function as explained here.
from diffusers import IFInpaintingPipeline, IFInpaintingSuperResolutionPipeline, DiffusionPipeline from diffusers.utils import pt_to_pil, load_image, make_image_grid import torch
url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/if/person.png" original_image = load_image(url)
url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/if/glasses_mask.png" mask_image = load_image(url)
stage_1 = IFInpaintingPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) stage_1.enable_model_cpu_offload()
stage_2 = IFInpaintingSuperResolutionPipeline.from_pretrained( "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16 ) stage_2.enable_model_cpu_offload()
safety_modules = { "feature_extractor": stage_1.feature_extractor, "safety_checker": stage_1.safety_checker, "watermarker": stage_1.watermarker, } stage_3 = DiffusionPipeline.from_pretrained( "stabilityai/stable-diffusion-x4-upscaler", **safety_modules, torch_dtype=torch.float16 ) stage_3.enable_model_cpu_offload()
prompt = "blue sunglasses" generator = torch.manual_seed(1)
prompt_embeds, negative_embeds = stage_1.encode_prompt(prompt)
stage_1_output = stage_1( image=original_image, mask_image=mask_image, prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, generator=generator, output_type="pt", ).images
stage_2_output = stage_2( image=stage_1_output, original_image=original_image, mask_image=mask_image, prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, generator=generator, output_type="pt", ).images
stage_3_output = stage_3(prompt=prompt, image=stage_2_output, generator=generator, noise_level=100).images
make_image_grid([original_image, mask_image, pt_to_pil(stage_1_output)[0], pt_to_pil(stage_2_output)[0], stage_3_output[0]], rows=1, rows=5)
Converting between different pipelines
In addition to being loaded with from_pretrained
, Pipelines can also be loaded directly from each other.
from diffusers import IFPipeline, IFSuperResolutionPipeline
pipe_1 = IFPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0") pipe_2 = IFSuperResolutionPipeline.from_pretrained("DeepFloyd/IF-II-L-v1.0")
from diffusers import IFImg2ImgPipeline, IFImg2ImgSuperResolutionPipeline
pipe_1 = IFImg2ImgPipeline(**pipe_1.components) pipe_2 = IFImg2ImgSuperResolutionPipeline(**pipe_2.components)
from diffusers import IFInpaintingPipeline, IFInpaintingSuperResolutionPipeline
pipe_1 = IFInpaintingPipeline(**pipe_1.components) pipe_2 = IFInpaintingSuperResolutionPipeline(**pipe_2.components)
Optimizing for speed
The simplest optimization to run IF faster is to move all model components to the GPU.
pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) pipe.to("cuda")
You can also run the diffusion process for a shorter number of timesteps.
This can either be done with the num_inference_steps
argument:
pipe("", num_inference_steps=30)
Or with the timesteps
argument:
from diffusers.pipelines.deepfloyd_if import fast27_timesteps
pipe("", timesteps=fast27_timesteps)
When doing image variation or inpainting, you can also decrease the number of timesteps with the strength argument. The strength argument is the amount of noise to add to the input image which also determines how many steps to run in the denoising process. A smaller number will vary the image less but run faster.
pipe = IFImg2ImgPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) pipe.to("cuda")
image = pipe(image=image, prompt="", strength=0.3).images
You can also use torch.compile. Note that we have not exhaustively tested torch.compile
with IF and it might not give expected results.
from diffusers import DiffusionPipeline import torch
pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) pipe.to("cuda")
pipe.text_encoder = torch.compile(pipe.text_encoder, mode="reduce-overhead", fullgraph=True) pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
Optimizing for memory
When optimizing for GPU memory, we can use the standard diffusers CPU offloading APIs.
Either the model based CPU offloading,
pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) pipe.enable_model_cpu_offload()
or the more aggressive layer based CPU offloading.
pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) pipe.enable_sequential_cpu_offload()
Additionally, T5 can be loaded in 8bit precision
from transformers import T5EncoderModel
text_encoder = T5EncoderModel.from_pretrained( "DeepFloyd/IF-I-XL-v1.0", subfolder="text_encoder", device_map="auto", load_in_8bit=True, variant="8bit" )
from diffusers import DiffusionPipeline
pipe = DiffusionPipeline.from_pretrained(
"DeepFloyd/IF-I-XL-v1.0",
text_encoder=text_encoder,
unet=None,
device_map="auto",
)
prompt_embeds, negative_embeds = pipe.encode_prompt("")
For CPU RAM constrained machines like Google Colab free tier where we can’t load all model components to the CPU at once, we can manually only load the pipeline with the text encoder or UNet when the respective model components are needed.
from diffusers import IFPipeline, IFSuperResolutionPipeline import torch import gc from transformers import T5EncoderModel from diffusers.utils import pt_to_pil, make_image_grid
text_encoder = T5EncoderModel.from_pretrained( "DeepFloyd/IF-I-XL-v1.0", subfolder="text_encoder", device_map="auto", load_in_8bit=True, variant="8bit" )
pipe = DiffusionPipeline.from_pretrained(
"DeepFloyd/IF-I-XL-v1.0",
text_encoder=text_encoder,
unet=None,
device_map="auto",
)
prompt = 'a photo of a kangaroo wearing an orange hoodie and blue sunglasses standing in front of the eiffel tower holding a sign that says "very deep learning"' prompt_embeds, negative_embeds = pipe.encode_prompt(prompt)
del text_encoder del pipe gc.collect() torch.cuda.empty_cache()
pipe = IFPipeline.from_pretrained( "DeepFloyd/IF-I-XL-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16, device_map="auto" )
generator = torch.Generator().manual_seed(0) stage_1_output = pipe( prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, output_type="pt", generator=generator, ).images
del pipe gc.collect() torch.cuda.empty_cache()
pipe = IFSuperResolutionPipeline.from_pretrained( "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16, device_map="auto" )
generator = torch.Generator().manual_seed(0) stage_2_output = pipe( image=stage_1_output, prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, output_type="pt", generator=generator, ).images
make_image_grid([pt_to_pil(stage_1_output)[0], pt_to_pil(stage_2_output)[0]], rows=1, rows=2)
Available Pipelines:
Pipeline | Tasks | Colab |
---|---|---|
pipeline_if.py | Text-to-Image Generation | - |
pipeline_if_superresolution.py | Text-to-Image Generation | - |
pipeline_if_img2img.py | Image-to-Image Generation | - |
pipeline_if_img2img_superresolution.py | Image-to-Image Generation | - |
pipeline_if_inpainting.py | Image-to-Image Generation | - |
pipeline_if_inpainting_superresolution.py | Image-to-Image Generation | - |
IFPipeline
class diffusers.IFPipeline
( tokenizer: T5Tokenizer text_encoder: T5EncoderModel unet: UNet2DConditionModel scheduler: DDPMScheduler safety_checker: typing.Optional[diffusers.pipelines.deepfloyd_if.safety_checker.IFSafetyChecker] feature_extractor: typing.Optional[transformers.models.clip.image_processing_clip.CLIPImageProcessor] watermarker: typing.Optional[diffusers.pipelines.deepfloyd_if.watermark.IFWatermarker] requires_safety_checker: bool = True )
__call__
( prompt: typing.Union[str, typing.List[str]] = None num_inference_steps: int = 100 timesteps: typing.List[int] = None guidance_scale: float = 7.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 height: typing.Optional[int] = None width: typing.Optional[int] = None eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback: typing.Optional[typing.Callable[[int, int, torch.Tensor], NoneType]] = None callback_steps: int = 1 clean_caption: bool = True cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None ) → ~pipelines.stable_diffusion.IFPipelineOutput
or tuple
Parameters
- prompt (
str
orList[str]
, optional) — The prompt or prompts to guide the image generation. If not defined, one has to passprompt_embeds
. instead. - num_inference_steps (
int
, optional, defaults to 100) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. - timesteps (
List[int]
, optional) — Custom timesteps to use for the denoising process. If not defined, equal spacednum_inference_steps
timesteps are used. Must be in descending order. - guidance_scale (
float
, optional, defaults to 7.0) — Guidance scale as defined in Classifier-Free Diffusion Guidance.guidance_scale
is defined asw
of equation 2. of Imagen Paper. Guidance scale is enabled by settingguidance_scale > 1
. Higher guidance scale encourages to generate images that are closely linked to the textprompt
, usually at the expense of lower image quality. - negative_prompt (
str
orList[str]
, optional) — The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embeds
instead. Ignored when not using guidance (i.e., ignored ifguidance_scale
is less than1
). - num_images_per_prompt (
int
, optional, defaults to 1) — The number of images to generate per prompt. - height (
int
, optional, defaults to self.unet.config.sample_size) — The height in pixels of the generated image. - width (
int
, optional, defaults to self.unet.config.sample_size) — The width in pixels of the generated image. - eta (
float
, optional, defaults to 0.0) — Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies toschedulers.DDIMScheduler, will be ignored for others. - generator (
torch.Generator
orList[torch.Generator]
, optional) — One or a list of torch generator(s)to make generation deterministic. - prompt_embeds (
torch.Tensor
, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated fromprompt
input argument. - negative_prompt_embeds (
torch.Tensor
, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated fromnegative_prompt
input argument. - output_type (
str
, optional, defaults to"pil"
) — The output format of the generate image. Choose betweenPIL:PIL.Image.Image
ornp.array
. - return_dict (
bool
, optional, defaults toTrue
) — Whether or not to return a~pipelines.stable_diffusion.IFPipelineOutput
instead of a plain tuple. - callback (
Callable
, optional) — A function that will be called everycallback_steps
steps during inference. The function will be called with the following arguments:callback(step: int, timestep: int, latents: torch.Tensor)
. - callback_steps (
int
, optional, defaults to 1) — The frequency at which thecallback
function will be called. If not specified, the callback will be called at every step. - clean_caption (
bool
, optional, defaults toTrue
) — Whether or not to clean the caption before creating embeddings. Requiresbeautifulsoup4
andftfy
to be installed. If the dependencies are not installed, the embeddings will be created from the raw prompt. - cross_attention_kwargs (
dict
, optional) — A kwargs dictionary that if specified is passed along to theAttentionProcessor
as defined underself.processor
indiffusers.models.attention_processor.
Returns
~pipelines.stable_diffusion.IFPipelineOutput
or tuple
~pipelines.stable_diffusion.IFPipelineOutput
if return_dict
is True, otherwise a tuple. When returning a tuple, the first element is a list with the generated images, and the second element is a list of
bools denoting whether the corresponding generated image likely represents "not-safe-for-work" (nsfw) or watermarked content, according to the
safety_checker`.
Function invoked when calling the pipeline for generation.
Examples:
from diffusers import IFPipeline, IFSuperResolutionPipeline, DiffusionPipeline from diffusers.utils import pt_to_pil import torch
pipe = IFPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) pipe.enable_model_cpu_offload()
prompt = 'a photo of a kangaroo wearing an orange hoodie and blue sunglasses standing in front of the eiffel tower holding a sign that says "very deep learning"' prompt_embeds, negative_embeds = pipe.encode_prompt(prompt)
image = pipe(prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, output_type="pt").images
pil_image = pt_to_pil(image) pil_image[0].save("./if_stage_I.png")
super_res_1_pipe = IFSuperResolutionPipeline.from_pretrained( ... "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16 ... ) super_res_1_pipe.enable_model_cpu_offload()
image = super_res_1_pipe( ... image=image, prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, output_type="pt" ... ).images
pil_image = pt_to_pil(image) pil_image[0].save("./if_stage_I.png")
safety_modules = { ... "feature_extractor": pipe.feature_extractor, ... "safety_checker": pipe.safety_checker, ... "watermarker": pipe.watermarker, ... } super_res_2_pipe = DiffusionPipeline.from_pretrained( ... "stabilityai/stable-diffusion-x4-upscaler", **safety_modules, torch_dtype=torch.float16 ... ) super_res_2_pipe.enable_model_cpu_offload()
image = super_res_2_pipe( ... prompt=prompt, ... image=image, ... ).images image[0].save("./if_stage_II.png")
encode_prompt
( prompt: typing.Union[str, typing.List[str]] do_classifier_free_guidance: bool = True num_images_per_prompt: int = 1 device: typing.Optional[torch.device] = None negative_prompt: typing.Union[str, typing.List[str], NoneType] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None clean_caption: bool = False )
Parameters
- prompt (
str
orList[str]
, optional) — prompt to be encoded - do_classifier_free_guidance (
bool
, optional, defaults toTrue
) — whether to use classifier free guidance or not - num_images_per_prompt (
int
, optional, defaults to 1) — number of images that should be generated per prompt - device — (
torch.device
, optional): torch device to place the resulting embeddings on - negative_prompt (
str
orList[str]
, optional) — The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embeds
. instead. If not defined, one has to passnegative_prompt_embeds
. instead. Ignored when not using guidance (i.e., ignored ifguidance_scale
is less than1
). - prompt_embeds (
torch.Tensor
, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated fromprompt
input argument. - negative_prompt_embeds (
torch.Tensor
, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated fromnegative_prompt
input argument. - clean_caption (bool, defaults to
False
) — IfTrue
, the function will preprocess and clean the provided caption before encoding.
Encodes the prompt into text encoder hidden states.
IFSuperResolutionPipeline
class diffusers.IFSuperResolutionPipeline
( tokenizer: T5Tokenizer text_encoder: T5EncoderModel unet: UNet2DConditionModel scheduler: DDPMScheduler image_noising_scheduler: DDPMScheduler safety_checker: typing.Optional[diffusers.pipelines.deepfloyd_if.safety_checker.IFSafetyChecker] feature_extractor: typing.Optional[transformers.models.clip.image_processing_clip.CLIPImageProcessor] watermarker: typing.Optional[diffusers.pipelines.deepfloyd_if.watermark.IFWatermarker] requires_safety_checker: bool = True )
__call__
( prompt: typing.Union[str, typing.List[str]] = None height: int = None width: int = None image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor] = None num_inference_steps: int = 50 timesteps: typing.List[int] = None guidance_scale: float = 4.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback: typing.Optional[typing.Callable[[int, int, torch.Tensor], NoneType]] = None callback_steps: int = 1 cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None noise_level: int = 250 clean_caption: bool = True ) → ~pipelines.stable_diffusion.IFPipelineOutput
or tuple
Parameters
- prompt (
str
orList[str]
, optional) — The prompt or prompts to guide the image generation. If not defined, one has to passprompt_embeds
. instead. - height (
int
, optional, defaults to None) — The height in pixels of the generated image. - width (
int
, optional, defaults to None) — The width in pixels of the generated image. - image (
PIL.Image.Image
,np.ndarray
,torch.Tensor
) — The image to be upscaled. - num_inference_steps (
int
, optional, defaults to 50) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. - timesteps (
List[int]
, optional, defaults to None) — Custom timesteps to use for the denoising process. If not defined, equal spacednum_inference_steps
timesteps are used. Must be in descending order. - guidance_scale (
float
, optional, defaults to 4.0) — Guidance scale as defined in Classifier-Free Diffusion Guidance.guidance_scale
is defined asw
of equation 2. of Imagen Paper. Guidance scale is enabled by settingguidance_scale > 1
. Higher guidance scale encourages to generate images that are closely linked to the textprompt
, usually at the expense of lower image quality. - negative_prompt (
str
orList[str]
, optional) — The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embeds
instead. Ignored when not using guidance (i.e., ignored ifguidance_scale
is less than1
). - num_images_per_prompt (
int
, optional, defaults to 1) — The number of images to generate per prompt. - eta (
float
, optional, defaults to 0.0) — Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies toschedulers.DDIMScheduler, will be ignored for others. - generator (
torch.Generator
orList[torch.Generator]
, optional) — One or a list of torch generator(s)to make generation deterministic. - prompt_embeds (
torch.Tensor
, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated fromprompt
input argument. - negative_prompt_embeds (
torch.Tensor
, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated fromnegative_prompt
input argument. - output_type (
str
, optional, defaults to"pil"
) — The output format of the generate image. Choose betweenPIL:PIL.Image.Image
ornp.array
. - return_dict (
bool
, optional, defaults toTrue
) — Whether or not to return a~pipelines.stable_diffusion.IFPipelineOutput
instead of a plain tuple. - callback (
Callable
, optional) — A function that will be called everycallback_steps
steps during inference. The function will be called with the following arguments:callback(step: int, timestep: int, latents: torch.Tensor)
. - callback_steps (
int
, optional, defaults to 1) — The frequency at which thecallback
function will be called. If not specified, the callback will be called at every step. - cross_attention_kwargs (
dict
, optional) — A kwargs dictionary that if specified is passed along to theAttentionProcessor
as defined underself.processor
indiffusers.models.attention_processor. - noise_level (
int
, optional, defaults to 250) — The amount of noise to add to the upscaled image. Must be in the range[0, 1000)
- clean_caption (
bool
, optional, defaults toTrue
) — Whether or not to clean the caption before creating embeddings. Requiresbeautifulsoup4
andftfy
to be installed. If the dependencies are not installed, the embeddings will be created from the raw prompt.
Returns
~pipelines.stable_diffusion.IFPipelineOutput
or tuple
~pipelines.stable_diffusion.IFPipelineOutput
if return_dict
is True, otherwise a tuple. When returning a tuple, the first element is a list with the generated images, and the second element is a list of
bools denoting whether the corresponding generated image likely represents "not-safe-for-work" (nsfw) or watermarked content, according to the
safety_checker`.
Function invoked when calling the pipeline for generation.
Examples:
from diffusers import IFPipeline, IFSuperResolutionPipeline, DiffusionPipeline from diffusers.utils import pt_to_pil import torch
pipe = IFPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) pipe.enable_model_cpu_offload()
prompt = 'a photo of a kangaroo wearing an orange hoodie and blue sunglasses standing in front of the eiffel tower holding a sign that says "very deep learning"' prompt_embeds, negative_embeds = pipe.encode_prompt(prompt)
image = pipe(prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, output_type="pt").images
pil_image = pt_to_pil(image) pil_image[0].save("./if_stage_I.png")
super_res_1_pipe = IFSuperResolutionPipeline.from_pretrained( ... "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16 ... ) super_res_1_pipe.enable_model_cpu_offload()
image = super_res_1_pipe( ... image=image, prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds ... ).images image[0].save("./if_stage_II.png")
encode_prompt
( prompt: typing.Union[str, typing.List[str]] do_classifier_free_guidance: bool = True num_images_per_prompt: int = 1 device: typing.Optional[torch.device] = None negative_prompt: typing.Union[str, typing.List[str], NoneType] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None clean_caption: bool = False )
Parameters
- prompt (
str
orList[str]
, optional) — prompt to be encoded - do_classifier_free_guidance (
bool
, optional, defaults toTrue
) — whether to use classifier free guidance or not - num_images_per_prompt (
int
, optional, defaults to 1) — number of images that should be generated per prompt - device — (
torch.device
, optional): torch device to place the resulting embeddings on - negative_prompt (
str
orList[str]
, optional) — The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embeds
. instead. If not defined, one has to passnegative_prompt_embeds
. instead. Ignored when not using guidance (i.e., ignored ifguidance_scale
is less than1
). - prompt_embeds (
torch.Tensor
, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated fromprompt
input argument. - negative_prompt_embeds (
torch.Tensor
, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated fromnegative_prompt
input argument. - clean_caption (bool, defaults to
False
) — IfTrue
, the function will preprocess and clean the provided caption before encoding.
Encodes the prompt into text encoder hidden states.
IFImg2ImgPipeline
class diffusers.IFImg2ImgPipeline
( tokenizer: T5Tokenizer text_encoder: T5EncoderModel unet: UNet2DConditionModel scheduler: DDPMScheduler safety_checker: typing.Optional[diffusers.pipelines.deepfloyd_if.safety_checker.IFSafetyChecker] feature_extractor: typing.Optional[transformers.models.clip.image_processing_clip.CLIPImageProcessor] watermarker: typing.Optional[diffusers.pipelines.deepfloyd_if.watermark.IFWatermarker] requires_safety_checker: bool = True )
__call__
( prompt: typing.Union[str, typing.List[str]] = None image: typing.Union[PIL.Image.Image, torch.Tensor, numpy.ndarray, typing.List[PIL.Image.Image], typing.List[torch.Tensor], typing.List[numpy.ndarray]] = None strength: float = 0.7 num_inference_steps: int = 80 timesteps: typing.List[int] = None guidance_scale: float = 10.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback: typing.Optional[typing.Callable[[int, int, torch.Tensor], NoneType]] = None callback_steps: int = 1 clean_caption: bool = True cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None ) → ~pipelines.stable_diffusion.IFPipelineOutput
or tuple
Parameters
- prompt (
str
orList[str]
, optional) — The prompt or prompts to guide the image generation. If not defined, one has to passprompt_embeds
. instead. - image (
torch.Tensor
orPIL.Image.Image
) —Image
, or tensor representing an image batch, that will be used as the starting point for the process. - strength (
float
, optional, defaults to 0.7) — Conceptually, indicates how much to transform the referenceimage
. Must be between 0 and 1.image
will be used as a starting point, adding more noise to it the larger thestrength
. The number of denoising steps depends on the amount of noise initially added. Whenstrength
is 1, added noise will be maximum and the denoising process will run for the full number of iterations specified innum_inference_steps
. A value of 1, therefore, essentially ignoresimage
. - num_inference_steps (
int
, optional, defaults to 80) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. - timesteps (
List[int]
, optional) — Custom timesteps to use for the denoising process. If not defined, equal spacednum_inference_steps
timesteps are used. Must be in descending order. - guidance_scale (
float
, optional, defaults to 10.0) — Guidance scale as defined in Classifier-Free Diffusion Guidance.guidance_scale
is defined asw
of equation 2. of Imagen Paper. Guidance scale is enabled by settingguidance_scale > 1
. Higher guidance scale encourages to generate images that are closely linked to the textprompt
, usually at the expense of lower image quality. - negative_prompt (
str
orList[str]
, optional) — The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embeds
instead. Ignored when not using guidance (i.e., ignored ifguidance_scale
is less than1
). - num_images_per_prompt (
int
, optional, defaults to 1) — The number of images to generate per prompt. - eta (
float
, optional, defaults to 0.0) — Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies toschedulers.DDIMScheduler, will be ignored for others. - generator (
torch.Generator
orList[torch.Generator]
, optional) — One or a list of torch generator(s)to make generation deterministic. - prompt_embeds (
torch.Tensor
, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated fromprompt
input argument. - negative_prompt_embeds (
torch.Tensor
, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated fromnegative_prompt
input argument. - output_type (
str
, optional, defaults to"pil"
) — The output format of the generate image. Choose betweenPIL:PIL.Image.Image
ornp.array
. - return_dict (
bool
, optional, defaults toTrue
) — Whether or not to return a~pipelines.stable_diffusion.IFPipelineOutput
instead of a plain tuple. - callback (
Callable
, optional) — A function that will be called everycallback_steps
steps during inference. The function will be called with the following arguments:callback(step: int, timestep: int, latents: torch.Tensor)
. - callback_steps (
int
, optional, defaults to 1) — The frequency at which thecallback
function will be called. If not specified, the callback will be called at every step. - clean_caption (
bool
, optional, defaults toTrue
) — Whether or not to clean the caption before creating embeddings. Requiresbeautifulsoup4
andftfy
to be installed. If the dependencies are not installed, the embeddings will be created from the raw prompt. - cross_attention_kwargs (
dict
, optional) — A kwargs dictionary that if specified is passed along to theAttentionProcessor
as defined underself.processor
indiffusers.models.attention_processor.
Returns
~pipelines.stable_diffusion.IFPipelineOutput
or tuple
~pipelines.stable_diffusion.IFPipelineOutput
if return_dict
is True, otherwise a tuple. When returning a tuple, the first element is a list with the generated images, and the second element is a list of
bools denoting whether the corresponding generated image likely represents "not-safe-for-work" (nsfw) or watermarked content, according to the
safety_checker`.
Function invoked when calling the pipeline for generation.
Examples:
from diffusers import IFImg2ImgPipeline, IFImg2ImgSuperResolutionPipeline, DiffusionPipeline from diffusers.utils import pt_to_pil import torch from PIL import Image import requests from io import BytesIO
url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg" response = requests.get(url) original_image = Image.open(BytesIO(response.content)).convert("RGB") original_image = original_image.resize((768, 512))
pipe = IFImg2ImgPipeline.from_pretrained( ... "DeepFloyd/IF-I-XL-v1.0", ... variant="fp16", ... torch_dtype=torch.float16, ... ) pipe.enable_model_cpu_offload()
prompt = "A fantasy landscape in style minecraft" prompt_embeds, negative_embeds = pipe.encode_prompt(prompt)
image = pipe( ... image=original_image, ... prompt_embeds=prompt_embeds, ... negative_prompt_embeds=negative_embeds, ... output_type="pt", ... ).images
pil_image = pt_to_pil(image) pil_image[0].save("./if_stage_I.png")
super_res_1_pipe = IFImg2ImgSuperResolutionPipeline.from_pretrained( ... "DeepFloyd/IF-II-L-v1.0", ... text_encoder=None, ... variant="fp16", ... torch_dtype=torch.float16, ... ) super_res_1_pipe.enable_model_cpu_offload()
image = super_res_1_pipe( ... image=image, ... original_image=original_image, ... prompt_embeds=prompt_embeds, ... negative_prompt_embeds=negative_embeds, ... ).images image[0].save("./if_stage_II.png")
encode_prompt
( prompt: typing.Union[str, typing.List[str]] do_classifier_free_guidance: bool = True num_images_per_prompt: int = 1 device: typing.Optional[torch.device] = None negative_prompt: typing.Union[str, typing.List[str], NoneType] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None clean_caption: bool = False )
Parameters
- prompt (
str
orList[str]
, optional) — prompt to be encoded - do_classifier_free_guidance (
bool
, optional, defaults toTrue
) — whether to use classifier free guidance or not - num_images_per_prompt (
int
, optional, defaults to 1) — number of images that should be generated per prompt - device — (
torch.device
, optional): torch device to place the resulting embeddings on - negative_prompt (
str
orList[str]
, optional) — The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embeds
. instead. If not defined, one has to passnegative_prompt_embeds
. instead. Ignored when not using guidance (i.e., ignored ifguidance_scale
is less than1
). - prompt_embeds (
torch.Tensor
, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated fromprompt
input argument. - negative_prompt_embeds (
torch.Tensor
, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated fromnegative_prompt
input argument. - clean_caption (bool, defaults to
False
) — IfTrue
, the function will preprocess and clean the provided caption before encoding.
Encodes the prompt into text encoder hidden states.
IFImg2ImgSuperResolutionPipeline
class diffusers.IFImg2ImgSuperResolutionPipeline
( tokenizer: T5Tokenizer text_encoder: T5EncoderModel unet: UNet2DConditionModel scheduler: DDPMScheduler image_noising_scheduler: DDPMScheduler safety_checker: typing.Optional[diffusers.pipelines.deepfloyd_if.safety_checker.IFSafetyChecker] feature_extractor: typing.Optional[transformers.models.clip.image_processing_clip.CLIPImageProcessor] watermarker: typing.Optional[diffusers.pipelines.deepfloyd_if.watermark.IFWatermarker] requires_safety_checker: bool = True )
__call__
( image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor] original_image: typing.Union[PIL.Image.Image, torch.Tensor, numpy.ndarray, typing.List[PIL.Image.Image], typing.List[torch.Tensor], typing.List[numpy.ndarray]] = None strength: float = 0.8 prompt: typing.Union[str, typing.List[str]] = None num_inference_steps: int = 50 timesteps: typing.List[int] = None guidance_scale: float = 4.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback: typing.Optional[typing.Callable[[int, int, torch.Tensor], NoneType]] = None callback_steps: int = 1 cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None noise_level: int = 250 clean_caption: bool = True ) → ~pipelines.stable_diffusion.IFPipelineOutput
or tuple
Parameters
- image (
torch.Tensor
orPIL.Image.Image
) —Image
, or tensor representing an image batch, that will be used as the starting point for the process. - original_image (
torch.Tensor
orPIL.Image.Image
) — The original image thatimage
was varied from. - strength (
float
, optional, defaults to 0.8) — Conceptually, indicates how much to transform the referenceimage
. Must be between 0 and 1.image
will be used as a starting point, adding more noise to it the larger thestrength
. The number of denoising steps depends on the amount of noise initially added. Whenstrength
is 1, added noise will be maximum and the denoising process will run for the full number of iterations specified innum_inference_steps
. A value of 1, therefore, essentially ignoresimage
. - prompt (
str
orList[str]
, optional) — The prompt or prompts to guide the image generation. If not defined, one has to passprompt_embeds
. instead. - num_inference_steps (
int
, optional, defaults to 50) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. - timesteps (
List[int]
, optional) — Custom timesteps to use for the denoising process. If not defined, equal spacednum_inference_steps
timesteps are used. Must be in descending order. - guidance_scale (
float
, optional, defaults to 4.0) — Guidance scale as defined in Classifier-Free Diffusion Guidance.guidance_scale
is defined asw
of equation 2. of Imagen Paper. Guidance scale is enabled by settingguidance_scale > 1
. Higher guidance scale encourages to generate images that are closely linked to the textprompt
, usually at the expense of lower image quality. - negative_prompt (
str
orList[str]
, optional) — The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embeds
instead. Ignored when not using guidance (i.e., ignored ifguidance_scale
is less than1
). - num_images_per_prompt (
int
, optional, defaults to 1) — The number of images to generate per prompt. - eta (
float
, optional, defaults to 0.0) — Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies toschedulers.DDIMScheduler, will be ignored for others. - generator (
torch.Generator
orList[torch.Generator]
, optional) — One or a list of torch generator(s)to make generation deterministic. - prompt_embeds (
torch.Tensor
, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated fromprompt
input argument. - negative_prompt_embeds (
torch.Tensor
, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated fromnegative_prompt
input argument. - output_type (
str
, optional, defaults to"pil"
) — The output format of the generate image. Choose betweenPIL:PIL.Image.Image
ornp.array
. - return_dict (
bool
, optional, defaults toTrue
) — Whether or not to return a~pipelines.stable_diffusion.IFPipelineOutput
instead of a plain tuple. - callback (
Callable
, optional) — A function that will be called everycallback_steps
steps during inference. The function will be called with the following arguments:callback(step: int, timestep: int, latents: torch.Tensor)
. - callback_steps (
int
, optional, defaults to 1) — The frequency at which thecallback
function will be called. If not specified, the callback will be called at every step. - cross_attention_kwargs (
dict
, optional) — A kwargs dictionary that if specified is passed along to theAttentionProcessor
as defined underself.processor
indiffusers.models.attention_processor. - noise_level (
int
, optional, defaults to 250) — The amount of noise to add to the upscaled image. Must be in the range[0, 1000)
- clean_caption (
bool
, optional, defaults toTrue
) — Whether or not to clean the caption before creating embeddings. Requiresbeautifulsoup4
andftfy
to be installed. If the dependencies are not installed, the embeddings will be created from the raw prompt.
Returns
~pipelines.stable_diffusion.IFPipelineOutput
or tuple
~pipelines.stable_diffusion.IFPipelineOutput
if return_dict
is True, otherwise a tuple. When returning a tuple, the first element is a list with the generated images, and the second element is a list of
bools denoting whether the corresponding generated image likely represents "not-safe-for-work" (nsfw) or watermarked content, according to the
safety_checker`.
Function invoked when calling the pipeline for generation.
Examples:
from diffusers import IFImg2ImgPipeline, IFImg2ImgSuperResolutionPipeline, DiffusionPipeline from diffusers.utils import pt_to_pil import torch from PIL import Image import requests from io import BytesIO
url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg" response = requests.get(url) original_image = Image.open(BytesIO(response.content)).convert("RGB") original_image = original_image.resize((768, 512))
pipe = IFImg2ImgPipeline.from_pretrained( ... "DeepFloyd/IF-I-XL-v1.0", ... variant="fp16", ... torch_dtype=torch.float16, ... ) pipe.enable_model_cpu_offload()
prompt = "A fantasy landscape in style minecraft" prompt_embeds, negative_embeds = pipe.encode_prompt(prompt)
image = pipe( ... image=original_image, ... prompt_embeds=prompt_embeds, ... negative_prompt_embeds=negative_embeds, ... output_type="pt", ... ).images
pil_image = pt_to_pil(image) pil_image[0].save("./if_stage_I.png")
super_res_1_pipe = IFImg2ImgSuperResolutionPipeline.from_pretrained( ... "DeepFloyd/IF-II-L-v1.0", ... text_encoder=None, ... variant="fp16", ... torch_dtype=torch.float16, ... ) super_res_1_pipe.enable_model_cpu_offload()
image = super_res_1_pipe( ... image=image, ... original_image=original_image, ... prompt_embeds=prompt_embeds, ... negative_prompt_embeds=negative_embeds, ... ).images image[0].save("./if_stage_II.png")
encode_prompt
( prompt: typing.Union[str, typing.List[str]] do_classifier_free_guidance: bool = True num_images_per_prompt: int = 1 device: typing.Optional[torch.device] = None negative_prompt: typing.Union[str, typing.List[str], NoneType] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None clean_caption: bool = False )
Parameters
- prompt (
str
orList[str]
, optional) — prompt to be encoded - do_classifier_free_guidance (
bool
, optional, defaults toTrue
) — whether to use classifier free guidance or not - num_images_per_prompt (
int
, optional, defaults to 1) — number of images that should be generated per prompt - device — (
torch.device
, optional): torch device to place the resulting embeddings on - negative_prompt (
str
orList[str]
, optional) — The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embeds
. instead. If not defined, one has to passnegative_prompt_embeds
. instead. Ignored when not using guidance (i.e., ignored ifguidance_scale
is less than1
). - prompt_embeds (
torch.Tensor
, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated fromprompt
input argument. - negative_prompt_embeds (
torch.Tensor
, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated fromnegative_prompt
input argument. - clean_caption (bool, defaults to
False
) — IfTrue
, the function will preprocess and clean the provided caption before encoding.
Encodes the prompt into text encoder hidden states.
IFInpaintingPipeline
class diffusers.IFInpaintingPipeline
( tokenizer: T5Tokenizer text_encoder: T5EncoderModel unet: UNet2DConditionModel scheduler: DDPMScheduler safety_checker: typing.Optional[diffusers.pipelines.deepfloyd_if.safety_checker.IFSafetyChecker] feature_extractor: typing.Optional[transformers.models.clip.image_processing_clip.CLIPImageProcessor] watermarker: typing.Optional[diffusers.pipelines.deepfloyd_if.watermark.IFWatermarker] requires_safety_checker: bool = True )
__call__
( prompt: typing.Union[str, typing.List[str]] = None image: typing.Union[PIL.Image.Image, torch.Tensor, numpy.ndarray, typing.List[PIL.Image.Image], typing.List[torch.Tensor], typing.List[numpy.ndarray]] = None mask_image: typing.Union[PIL.Image.Image, torch.Tensor, numpy.ndarray, typing.List[PIL.Image.Image], typing.List[torch.Tensor], typing.List[numpy.ndarray]] = None strength: float = 1.0 num_inference_steps: int = 50 timesteps: typing.List[int] = None guidance_scale: float = 7.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback: typing.Optional[typing.Callable[[int, int, torch.Tensor], NoneType]] = None callback_steps: int = 1 clean_caption: bool = True cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None ) → ~pipelines.stable_diffusion.IFPipelineOutput
or tuple
Parameters
- prompt (
str
orList[str]
, optional) — The prompt or prompts to guide the image generation. If not defined, one has to passprompt_embeds
. instead. - image (
torch.Tensor
orPIL.Image.Image
) —Image
, or tensor representing an image batch, that will be used as the starting point for the process. - mask_image (
PIL.Image.Image
) —Image
, or tensor representing an image batch, to maskimage
. White pixels in the mask will be repainted, while black pixels will be preserved. Ifmask_image
is a PIL image, it will be converted to a single channel (luminance) before use. If it’s a tensor, it should contain one color channel (L) instead of 3, so the expected shape would be(B, H, W, 1)
. - strength (
float
, optional, defaults to 1.0) — Conceptually, indicates how much to transform the referenceimage
. Must be between 0 and 1.image
will be used as a starting point, adding more noise to it the larger thestrength
. The number of denoising steps depends on the amount of noise initially added. Whenstrength
is 1, added noise will be maximum and the denoising process will run for the full number of iterations specified innum_inference_steps
. A value of 1, therefore, essentially ignoresimage
. - num_inference_steps (
int
, optional, defaults to 50) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. - timesteps (
List[int]
, optional) — Custom timesteps to use for the denoising process. If not defined, equal spacednum_inference_steps
timesteps are used. Must be in descending order. - guidance_scale (
float
, optional, defaults to 7.0) — Guidance scale as defined in Classifier-Free Diffusion Guidance.guidance_scale
is defined asw
of equation 2. of Imagen Paper. Guidance scale is enabled by settingguidance_scale > 1
. Higher guidance scale encourages to generate images that are closely linked to the textprompt
, usually at the expense of lower image quality. - negative_prompt (
str
orList[str]
, optional) — The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embeds
instead. Ignored when not using guidance (i.e., ignored ifguidance_scale
is less than1
). - num_images_per_prompt (
int
, optional, defaults to 1) — The number of images to generate per prompt. - eta (
float
, optional, defaults to 0.0) — Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies toschedulers.DDIMScheduler, will be ignored for others. - generator (
torch.Generator
orList[torch.Generator]
, optional) — One or a list of torch generator(s)to make generation deterministic. - prompt_embeds (
torch.Tensor
, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated fromprompt
input argument. - negative_prompt_embeds (
torch.Tensor
, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated fromnegative_prompt
input argument. - output_type (
str
, optional, defaults to"pil"
) — The output format of the generate image. Choose betweenPIL:PIL.Image.Image
ornp.array
. - return_dict (
bool
, optional, defaults toTrue
) — Whether or not to return a~pipelines.stable_diffusion.IFPipelineOutput
instead of a plain tuple. - callback (
Callable
, optional) — A function that will be called everycallback_steps
steps during inference. The function will be called with the following arguments:callback(step: int, timestep: int, latents: torch.Tensor)
. - callback_steps (
int
, optional, defaults to 1) — The frequency at which thecallback
function will be called. If not specified, the callback will be called at every step. - clean_caption (
bool
, optional, defaults toTrue
) — Whether or not to clean the caption before creating embeddings. Requiresbeautifulsoup4
andftfy
to be installed. If the dependencies are not installed, the embeddings will be created from the raw prompt. - cross_attention_kwargs (
dict
, optional) — A kwargs dictionary that if specified is passed along to theAttentionProcessor
as defined underself.processor
indiffusers.models.attention_processor.
Returns
~pipelines.stable_diffusion.IFPipelineOutput
or tuple
~pipelines.stable_diffusion.IFPipelineOutput
if return_dict
is True, otherwise a tuple. When returning a tuple, the first element is a list with the generated images, and the second element is a list of
bools denoting whether the corresponding generated image likely represents "not-safe-for-work" (nsfw) or watermarked content, according to the
safety_checker`.
Function invoked when calling the pipeline for generation.
Examples:
from diffusers import IFInpaintingPipeline, IFInpaintingSuperResolutionPipeline, DiffusionPipeline from diffusers.utils import pt_to_pil import torch from PIL import Image import requests from io import BytesIO
url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/if/person.png" response = requests.get(url) original_image = Image.open(BytesIO(response.content)).convert("RGB") original_image = original_image
url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/if/glasses_mask.png" response = requests.get(url) mask_image = Image.open(BytesIO(response.content)) mask_image = mask_image
pipe = IFInpaintingPipeline.from_pretrained( ... "DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16 ... ) pipe.enable_model_cpu_offload()
prompt = "blue sunglasses" prompt_embeds, negative_embeds = pipe.encode_prompt(prompt)
image = pipe( ... image=original_image, ... mask_image=mask_image, ... prompt_embeds=prompt_embeds, ... negative_prompt_embeds=negative_embeds, ... output_type="pt", ... ).images
pil_image = pt_to_pil(image) pil_image[0].save("./if_stage_I.png")
super_res_1_pipe = IFInpaintingSuperResolutionPipeline.from_pretrained( ... "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16 ... ) super_res_1_pipe.enable_model_cpu_offload()
image = super_res_1_pipe( ... image=image, ... mask_image=mask_image, ... original_image=original_image, ... prompt_embeds=prompt_embeds, ... negative_prompt_embeds=negative_embeds, ... ).images image[0].save("./if_stage_II.png")
encode_prompt
( prompt: typing.Union[str, typing.List[str]] do_classifier_free_guidance: bool = True num_images_per_prompt: int = 1 device: typing.Optional[torch.device] = None negative_prompt: typing.Union[str, typing.List[str], NoneType] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None clean_caption: bool = False )
Parameters
- prompt (
str
orList[str]
, optional) — prompt to be encoded - do_classifier_free_guidance (
bool
, optional, defaults toTrue
) — whether to use classifier free guidance or not - num_images_per_prompt (
int
, optional, defaults to 1) — number of images that should be generated per prompt - device — (
torch.device
, optional): torch device to place the resulting embeddings on - negative_prompt (
str
orList[str]
, optional) — The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embeds
. instead. If not defined, one has to passnegative_prompt_embeds
. instead. Ignored when not using guidance (i.e., ignored ifguidance_scale
is less than1
). - prompt_embeds (
torch.Tensor
, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated fromprompt
input argument. - negative_prompt_embeds (
torch.Tensor
, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated fromnegative_prompt
input argument. - clean_caption (bool, defaults to
False
) — IfTrue
, the function will preprocess and clean the provided caption before encoding.
Encodes the prompt into text encoder hidden states.
IFInpaintingSuperResolutionPipeline
class diffusers.IFInpaintingSuperResolutionPipeline
( tokenizer: T5Tokenizer text_encoder: T5EncoderModel unet: UNet2DConditionModel scheduler: DDPMScheduler image_noising_scheduler: DDPMScheduler safety_checker: typing.Optional[diffusers.pipelines.deepfloyd_if.safety_checker.IFSafetyChecker] feature_extractor: typing.Optional[transformers.models.clip.image_processing_clip.CLIPImageProcessor] watermarker: typing.Optional[diffusers.pipelines.deepfloyd_if.watermark.IFWatermarker] requires_safety_checker: bool = True )
__call__
( image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor] original_image: typing.Union[PIL.Image.Image, torch.Tensor, numpy.ndarray, typing.List[PIL.Image.Image], typing.List[torch.Tensor], typing.List[numpy.ndarray]] = None mask_image: typing.Union[PIL.Image.Image, torch.Tensor, numpy.ndarray, typing.List[PIL.Image.Image], typing.List[torch.Tensor], typing.List[numpy.ndarray]] = None strength: float = 0.8 prompt: typing.Union[str, typing.List[str]] = None num_inference_steps: int = 100 timesteps: typing.List[int] = None guidance_scale: float = 4.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback: typing.Optional[typing.Callable[[int, int, torch.Tensor], NoneType]] = None callback_steps: int = 1 cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None noise_level: int = 0 clean_caption: bool = True ) → ~pipelines.stable_diffusion.IFPipelineOutput
or tuple
Parameters
- image (
torch.Tensor
orPIL.Image.Image
) —Image
, or tensor representing an image batch, that will be used as the starting point for the process. - original_image (
torch.Tensor
orPIL.Image.Image
) — The original image thatimage
was varied from. - mask_image (
PIL.Image.Image
) —Image
, or tensor representing an image batch, to maskimage
. White pixels in the mask will be repainted, while black pixels will be preserved. Ifmask_image
is a PIL image, it will be converted to a single channel (luminance) before use. If it’s a tensor, it should contain one color channel (L) instead of 3, so the expected shape would be(B, H, W, 1)
. - strength (
float
, optional, defaults to 0.8) — Conceptually, indicates how much to transform the referenceimage
. Must be between 0 and 1.image
will be used as a starting point, adding more noise to it the larger thestrength
. The number of denoising steps depends on the amount of noise initially added. Whenstrength
is 1, added noise will be maximum and the denoising process will run for the full number of iterations specified innum_inference_steps
. A value of 1, therefore, essentially ignoresimage
. - prompt (
str
orList[str]
, optional) — The prompt or prompts to guide the image generation. If not defined, one has to passprompt_embeds
. instead. - num_inference_steps (
int
, optional, defaults to 100) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. - timesteps (
List[int]
, optional) — Custom timesteps to use for the denoising process. If not defined, equal spacednum_inference_steps
timesteps are used. Must be in descending order. - guidance_scale (
float
, optional, defaults to 4.0) — Guidance scale as defined in Classifier-Free Diffusion Guidance.guidance_scale
is defined asw
of equation 2. of Imagen Paper. Guidance scale is enabled by settingguidance_scale > 1
. Higher guidance scale encourages to generate images that are closely linked to the textprompt
, usually at the expense of lower image quality. - negative_prompt (
str
orList[str]
, optional) — The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embeds
instead. Ignored when not using guidance (i.e., ignored ifguidance_scale
is less than1
). - num_images_per_prompt (
int
, optional, defaults to 1) — The number of images to generate per prompt. - eta (
float
, optional, defaults to 0.0) — Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies toschedulers.DDIMScheduler, will be ignored for others. - generator (
torch.Generator
orList[torch.Generator]
, optional) — One or a list of torch generator(s)to make generation deterministic. - prompt_embeds (
torch.Tensor
, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated fromprompt
input argument. - negative_prompt_embeds (
torch.Tensor
, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated fromnegative_prompt
input argument. - output_type (
str
, optional, defaults to"pil"
) — The output format of the generate image. Choose betweenPIL:PIL.Image.Image
ornp.array
. - return_dict (
bool
, optional, defaults toTrue
) — Whether or not to return a~pipelines.stable_diffusion.IFPipelineOutput
instead of a plain tuple. - callback (
Callable
, optional) — A function that will be called everycallback_steps
steps during inference. The function will be called with the following arguments:callback(step: int, timestep: int, latents: torch.Tensor)
. - callback_steps (
int
, optional, defaults to 1) — The frequency at which thecallback
function will be called. If not specified, the callback will be called at every step. - cross_attention_kwargs (
dict
, optional) — A kwargs dictionary that if specified is passed along to theAttentionProcessor
as defined underself.processor
indiffusers.models.attention_processor. - noise_level (
int
, optional, defaults to 0) — The amount of noise to add to the upscaled image. Must be in the range[0, 1000)
- clean_caption (
bool
, optional, defaults toTrue
) — Whether or not to clean the caption before creating embeddings. Requiresbeautifulsoup4
andftfy
to be installed. If the dependencies are not installed, the embeddings will be created from the raw prompt.
Returns
~pipelines.stable_diffusion.IFPipelineOutput
or tuple
~pipelines.stable_diffusion.IFPipelineOutput
if return_dict
is True, otherwise a tuple. When returning a tuple, the first element is a list with the generated images, and the second element is a list of
bools denoting whether the corresponding generated image likely represents "not-safe-for-work" (nsfw) or watermarked content, according to the
safety_checker`.
Function invoked when calling the pipeline for generation.
Examples:
from diffusers import IFInpaintingPipeline, IFInpaintingSuperResolutionPipeline, DiffusionPipeline from diffusers.utils import pt_to_pil import torch from PIL import Image import requests from io import BytesIO
url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/if/person.png" response = requests.get(url) original_image = Image.open(BytesIO(response.content)).convert("RGB") original_image = original_image
url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/if/glasses_mask.png" response = requests.get(url) mask_image = Image.open(BytesIO(response.content)) mask_image = mask_image
pipe = IFInpaintingPipeline.from_pretrained( ... "DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16 ... ) pipe.enable_model_cpu_offload()
prompt = "blue sunglasses"
prompt_embeds, negative_embeds = pipe.encode_prompt(prompt) image = pipe( ... image=original_image, ... mask_image=mask_image, ... prompt_embeds=prompt_embeds, ... negative_prompt_embeds=negative_embeds, ... output_type="pt", ... ).images
pil_image = pt_to_pil(image) pil_image[0].save("./if_stage_I.png")
super_res_1_pipe = IFInpaintingSuperResolutionPipeline.from_pretrained( ... "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16 ... ) super_res_1_pipe.enable_model_cpu_offload()
image = super_res_1_pipe( ... image=image, ... mask_image=mask_image, ... original_image=original_image, ... prompt_embeds=prompt_embeds, ... negative_prompt_embeds=negative_embeds, ... ).images image[0].save("./if_stage_II.png")
encode_prompt
( prompt: typing.Union[str, typing.List[str]] do_classifier_free_guidance: bool = True num_images_per_prompt: int = 1 device: typing.Optional[torch.device] = None negative_prompt: typing.Union[str, typing.List[str], NoneType] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None clean_caption: bool = False )
Parameters
- prompt (
str
orList[str]
, optional) — prompt to be encoded - do_classifier_free_guidance (
bool
, optional, defaults toTrue
) — whether to use classifier free guidance or not - num_images_per_prompt (
int
, optional, defaults to 1) — number of images that should be generated per prompt - device — (
torch.device
, optional): torch device to place the resulting embeddings on - negative_prompt (
str
orList[str]
, optional) — The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embeds
. instead. If not defined, one has to passnegative_prompt_embeds
. instead. Ignored when not using guidance (i.e., ignored ifguidance_scale
is less than1
). - prompt_embeds (
torch.Tensor
, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated fromprompt
input argument. - negative_prompt_embeds (
torch.Tensor
, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated fromnegative_prompt
input argument. - clean_caption (bool, defaults to
False
) — IfTrue
, the function will preprocess and clean the provided caption before encoding.
Encodes the prompt into text encoder hidden states.