IF (original) (raw)

Overview

DeepFloyd IF is a novel state-of-the-art open-source text-to-image model with a high degree of photorealism and language understanding. The model is a modular composed of a frozen text encoder and three cascaded pixel diffusion modules:

Usage

Before you can use IF, you need to accept its usage conditions. To do so:

  1. Make sure to have a Hugging Face account and be logged in
  2. Accept the license on the model card of DeepFloyd/IF-I-XL-v1.0. Accepting the license on the stage I model card will auto accept for the other IF models.
  3. Make sure to login locally. Install huggingface_hub

pip install huggingface_hub --upgrade

run the login function in a Python shell

from huggingface_hub import login

login()

and enter your Hugging Face Hub access token.

Next we install diffusers and dependencies:

pip install diffusers accelerate transformers safetensors

The following sections give more in-detail examples of how to use IF. Specifically:

Available checkpoints

Demo Hugging Face Spaces

Google Colab Open In Colab

Text-to-Image Generation

By default diffusers makes use of model cpu offloadingto run the whole IF pipeline with as little as 14 GB of VRAM.

from diffusers import DiffusionPipeline from diffusers.utils import pt_to_pil import torch

stage_1 = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) stage_1.enable_model_cpu_offload()

stage_2 = DiffusionPipeline.from_pretrained( "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16 ) stage_2.enable_model_cpu_offload()

safety_modules = { "feature_extractor": stage_1.feature_extractor, "safety_checker": stage_1.safety_checker, "watermarker": stage_1.watermarker, } stage_3 = DiffusionPipeline.from_pretrained( "stabilityai/stable-diffusion-x4-upscaler", **safety_modules, torch_dtype=torch.float16 ) stage_3.enable_model_cpu_offload()

prompt = 'a photo of a kangaroo wearing an orange hoodie and blue sunglasses standing in front of the eiffel tower holding a sign that says "very deep learning"' generator = torch.manual_seed(1)

prompt_embeds, negative_embeds = stage_1.encode_prompt(prompt)

image = stage_1( prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, generator=generator, output_type="pt" ).images pt_to_pil(image)[0].save("./if_stage_I.png")

image = stage_2( image=image, prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, generator=generator, output_type="pt", ).images pt_to_pil(image)[0].save("./if_stage_II.png")

image = stage_3(prompt=prompt, image=image, noise_level=100, generator=generator).images image[0].save("./if_stage_III.png")

Text Guided Image-to-Image Generation

The same IF model weights can be used for text-guided image-to-image translation or image variation. In this case just make sure to load the weights using the IFInpaintingPipeline and IFInpaintingSuperResolutionPipeline pipelines.

Note: You can also directly move the weights of the text-to-image pipelines to the image-to-image pipelines without loading them twice by making use of the ~DiffusionPipeline.components() function as explained here.

from diffusers import IFImg2ImgPipeline, IFImg2ImgSuperResolutionPipeline, DiffusionPipeline from diffusers.utils import pt_to_pil

import torch

from PIL import Image import requests from io import BytesIO

url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg" response = requests.get(url) original_image = Image.open(BytesIO(response.content)).convert("RGB") original_image = original_image.resize((768, 512))

stage_1 = IFImg2ImgPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) stage_1.enable_model_cpu_offload()

stage_2 = IFImg2ImgSuperResolutionPipeline.from_pretrained( "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16 ) stage_2.enable_model_cpu_offload()

safety_modules = { "feature_extractor": stage_1.feature_extractor, "safety_checker": stage_1.safety_checker, "watermarker": stage_1.watermarker, } stage_3 = DiffusionPipeline.from_pretrained( "stabilityai/stable-diffusion-x4-upscaler", **safety_modules, torch_dtype=torch.float16 ) stage_3.enable_model_cpu_offload()

prompt = "A fantasy landscape in style minecraft" generator = torch.manual_seed(1)

prompt_embeds, negative_embeds = stage_1.encode_prompt(prompt)

image = stage_1( image=original_image, prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, generator=generator, output_type="pt", ).images pt_to_pil(image)[0].save("./if_stage_I.png")

image = stage_2( image=image, original_image=original_image, prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, generator=generator, output_type="pt", ).images pt_to_pil(image)[0].save("./if_stage_II.png")

image = stage_3(prompt=prompt, image=image, generator=generator, noise_level=100).images image[0].save("./if_stage_III.png")

Text Guided Inpainting Generation

The same IF model weights can be used for text-guided image-to-image translation or image variation. In this case just make sure to load the weights using the IFInpaintingPipeline and IFInpaintingSuperResolutionPipeline pipelines.

Note: You can also directly move the weights of the text-to-image pipelines to the image-to-image pipelines without loading them twice by making use of the ~DiffusionPipeline.components() function as explained here.

from diffusers import IFInpaintingPipeline, IFInpaintingSuperResolutionPipeline, DiffusionPipeline from diffusers.utils import pt_to_pil import torch

from PIL import Image import requests from io import BytesIO

url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/if/person.png" response = requests.get(url) original_image = Image.open(BytesIO(response.content)).convert("RGB") original_image = original_image

url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/if/glasses_mask.png" response = requests.get(url) mask_image = Image.open(BytesIO(response.content)) mask_image = mask_image

stage_1 = IFInpaintingPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) stage_1.enable_model_cpu_offload()

stage_2 = IFInpaintingSuperResolutionPipeline.from_pretrained( "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16 ) stage_2.enable_model_cpu_offload()

safety_modules = { "feature_extractor": stage_1.feature_extractor, "safety_checker": stage_1.safety_checker, "watermarker": stage_1.watermarker, } stage_3 = DiffusionPipeline.from_pretrained( "stabilityai/stable-diffusion-x4-upscaler", **safety_modules, torch_dtype=torch.float16 ) stage_3.enable_model_cpu_offload()

prompt = "blue sunglasses" generator = torch.manual_seed(1)

prompt_embeds, negative_embeds = stage_1.encode_prompt(prompt)

image = stage_1( image=original_image, mask_image=mask_image, prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, generator=generator, output_type="pt", ).images pt_to_pil(image)[0].save("./if_stage_I.png")

image = stage_2( image=image, original_image=original_image, mask_image=mask_image, prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, generator=generator, output_type="pt", ).images pt_to_pil(image)[0].save("./if_stage_II.png")

image = stage_3(prompt=prompt, image=image, generator=generator, noise_level=100).images image[0].save("./if_stage_III.png")

Converting between different pipelines

In addition to being loaded with from_pretrained, Pipelines can also be loaded directly from each other.

from diffusers import IFPipeline, IFSuperResolutionPipeline

pipe_1 = IFPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0") pipe_2 = IFSuperResolutionPipeline.from_pretrained("DeepFloyd/IF-II-L-v1.0")

from diffusers import IFImg2ImgPipeline, IFImg2ImgSuperResolutionPipeline

pipe_1 = IFImg2ImgPipeline(**pipe_1.components) pipe_2 = IFImg2ImgSuperResolutionPipeline(**pipe_2.components)

from diffusers import IFInpaintingPipeline, IFInpaintingSuperResolutionPipeline

pipe_1 = IFInpaintingPipeline(**pipe_1.components) pipe_2 = IFInpaintingSuperResolutionPipeline(**pipe_2.components)

Optimizing for speed

The simplest optimization to run IF faster is to move all model components to the GPU.

pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) pipe.to("cuda")

You can also run the diffusion process for a shorter number of timesteps.

This can either be done with the num_inference_steps argument

pipe("", num_inference_steps=30)

Or with the timesteps argument

from diffusers.pipelines.deepfloyd_if import fast27_timesteps

pipe("", timesteps=fast27_timesteps)

When doing image variation or inpainting, you can also decrease the number of timesteps with the strength argument. The strength argument is the amount of noise to add to the input image which also determines how many steps to run in the denoising process. A smaller number will vary the image less but run faster.

pipe = IFImg2ImgPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) pipe.to("cuda")

image = pipe(image=image, prompt="", strength=0.3).images

You can also use torch.compile. Note that we have not exhaustively tested torch.compilewith IF and it might not give expected results.

import torch

pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) pipe.to("cuda")

pipe.text_encoder = torch.compile(pipe.text_encoder) pipe.unet = torch.compile(pipe.unet)

Optimizing for memory

When optimizing for GPU memory, we can use the standard diffusers cpu offloading APIs.

Either the model based CPU offloading,

pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) pipe.enable_model_cpu_offload()

or the more aggressive layer based CPU offloading.

pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) pipe.enable_sequential_cpu_offload()

Additionally, T5 can be loaded in 8bit precision

from transformers import T5EncoderModel

text_encoder = T5EncoderModel.from_pretrained( "DeepFloyd/IF-I-XL-v1.0", subfolder="text_encoder", device_map="auto", load_in_8bit=True, variant="8bit" )

from diffusers import DiffusionPipeline

pipe = DiffusionPipeline.from_pretrained( "DeepFloyd/IF-I-XL-v1.0", text_encoder=text_encoder,
unet=None, device_map="auto", )

prompt_embeds, negative_embeds = pipe.encode_prompt("")

For CPU RAM constrained machines like google colab free tier where we can’t load all model components to the CPU at once, we can manually only load the pipeline with the text encoder or unet when the respective model components are needed.

from diffusers import IFPipeline, IFSuperResolutionPipeline import torch import gc from transformers import T5EncoderModel from diffusers.utils import pt_to_pil

text_encoder = T5EncoderModel.from_pretrained( "DeepFloyd/IF-I-XL-v1.0", subfolder="text_encoder", device_map="auto", load_in_8bit=True, variant="8bit" )

pipe = DiffusionPipeline.from_pretrained( "DeepFloyd/IF-I-XL-v1.0", text_encoder=text_encoder,
unet=None, device_map="auto", )

prompt = 'a photo of a kangaroo wearing an orange hoodie and blue sunglasses standing in front of the eiffel tower holding a sign that says "very deep learning"' prompt_embeds, negative_embeds = pipe.encode_prompt(prompt)

del text_encoder del pipe gc.collect() torch.cuda.empty_cache()

pipe = IFPipeline.from_pretrained( "DeepFloyd/IF-I-XL-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16, device_map="auto" )

generator = torch.Generator().manual_seed(0) image = pipe( prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, output_type="pt", generator=generator, ).images

pt_to_pil(image)[0].save("./if_stage_I.png")

del pipe gc.collect() torch.cuda.empty_cache()

pipe = IFSuperResolutionPipeline.from_pretrained( "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16, device_map="auto" )

generator = torch.Generator().manual_seed(0) image = pipe( image=image, prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, output_type="pt", generator=generator, ).images

pt_to_pil(image)[0].save("./if_stage_II.png")

Available Pipelines:

Pipeline Tasks Colab
pipeline_if.py Text-to-Image Generation -
pipeline_if_superresolution.py Text-to-Image Generation -
pipeline_if_img2img.py Image-to-Image Generation -
pipeline_if_img2img_superresolution.py Image-to-Image Generation -
pipeline_if_inpainting.py Image-to-Image Generation -
pipeline_if_inpainting_superresolution.py Image-to-Image Generation -

IFPipeline

class diffusers.IFPipeline

< source >

( tokenizer: T5Tokenizer text_encoder: T5EncoderModel unet: UNet2DConditionModel scheduler: DDPMScheduler safety_checker: typing.Optional[diffusers.pipelines.deepfloyd_if.safety_checker.IFSafetyChecker] feature_extractor: typing.Optional[transformers.models.clip.image_processing_clip.CLIPImageProcessor] watermarker: typing.Optional[diffusers.pipelines.deepfloyd_if.watermark.IFWatermarker] requires_safety_checker: bool = True )

__call__

< source >

( prompt: typing.Union[str, typing.List[str]] = None num_inference_steps: int = 100 timesteps: typing.List[int] = None guidance_scale: float = 7.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 height: typing.Optional[int] = None width: typing.Optional[int] = None eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None prompt_embeds: typing.Optional[torch.FloatTensor] = None negative_prompt_embeds: typing.Optional[torch.FloatTensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback: typing.Union[typing.Callable[[int, int, torch.FloatTensor], NoneType], NoneType] = None callback_steps: int = 1 clean_caption: bool = True cross_attention_kwargs: typing.Union[typing.Dict[str, typing.Any], NoneType] = None ) → ~pipelines.stable_diffusion.IFPipelineOutput or tuple

Parameters

Returns

~pipelines.stable_diffusion.IFPipelineOutput or tuple

~pipelines.stable_diffusion.IFPipelineOutput if return_dict is True, otherwise a tuple. When returning a tuple, the first element is a list with the generated images, and the second element is a list of bools denoting whether the corresponding generated image likely represents "not-safe-for-work" (nsfw) or watermarked content, according to the safety_checker`.

Function invoked when calling the pipeline for generation.

Examples:

from diffusers import IFPipeline, IFSuperResolutionPipeline, DiffusionPipeline from diffusers.utils import pt_to_pil import torch

pipe = IFPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) pipe.enable_model_cpu_offload()

prompt = 'a photo of a kangaroo wearing an orange hoodie and blue sunglasses standing in front of the eiffel tower holding a sign that says "very deep learning"' prompt_embeds, negative_embeds = pipe.encode_prompt(prompt)

image = pipe(prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, output_type="pt").images

pil_image = pt_to_pil(image) pil_image[0].save("./if_stage_I.png")

super_res_1_pipe = IFSuperResolutionPipeline.from_pretrained( ... "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16 ... ) super_res_1_pipe.enable_model_cpu_offload()

image = super_res_1_pipe( ... image=image, prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, output_type="pt" ... ).images

pil_image = pt_to_pil(image) pil_image[0].save("./if_stage_I.png")

safety_modules = { ... "feature_extractor": pipe.feature_extractor, ... "safety_checker": pipe.safety_checker, ... "watermarker": pipe.watermarker, ... } super_res_2_pipe = DiffusionPipeline.from_pretrained( ... "stabilityai/stable-diffusion-x4-upscaler", **safety_modules, torch_dtype=torch.float16 ... ) super_res_2_pipe.enable_model_cpu_offload()

image = super_res_2_pipe( ... prompt=prompt, ... image=image, ... ).images image[0].save("./if_stage_II.png")

Offloads all models to CPU using accelerate, reducing memory usage with a low impact on performance. Compared to enable_sequential_cpu_offload, this method moves one whole model at a time to the GPU when its forwardmethod is called, and the model remains in GPU until the next model runs. Memory savings are lower than withenable_sequential_cpu_offload, but performance is much better due to the iterative execution of the unet.

enable_sequential_cpu_offload

< source >

( gpu_id = 0 )

Offloads all models to CPU using accelerate, significantly reducing memory usage. When called, the pipeline’s models have their state dicts saved to CPU and then are moved to a torch.device('meta') and loaded to GPU only when their specific submodule has its forward` method called.

encode_prompt

< source >

( prompt do_classifier_free_guidance = True num_images_per_prompt = 1 device = None negative_prompt = None prompt_embeds: typing.Optional[torch.FloatTensor] = None negative_prompt_embeds: typing.Optional[torch.FloatTensor] = None clean_caption: bool = False )

Parameters

Encodes the prompt into text encoder hidden states.

device: (torch.device, optional): torch device to place the resulting embeddings on num_images_per_prompt (int, optional, defaults to 1): number of images that should be generated per prompt do_classifier_free_guidance (bool, optional, defaults to True): whether to use classifier free guidance or not negative_prompt (str or List[str], optional): The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embeds. instead. If not defined, one has to pass negative_prompt_embeds. instead. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1). prompt_embeds (torch.FloatTensor, optional): Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument. negative_prompt_embeds (torch.FloatTensor, optional): Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated from negative_prompt input argument.

IFSuperResolutionPipeline

class diffusers.IFSuperResolutionPipeline

< source >

( tokenizer: T5Tokenizer text_encoder: T5EncoderModel unet: UNet2DConditionModel scheduler: DDPMScheduler image_noising_scheduler: DDPMScheduler safety_checker: typing.Optional[diffusers.pipelines.deepfloyd_if.safety_checker.IFSafetyChecker] feature_extractor: typing.Optional[transformers.models.clip.image_processing_clip.CLIPImageProcessor] watermarker: typing.Optional[diffusers.pipelines.deepfloyd_if.watermark.IFWatermarker] requires_safety_checker: bool = True )

__call__

< source >

( prompt: typing.Union[str, typing.List[str]] = None height: int = None width: int = None image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.FloatTensor] = None num_inference_steps: int = 50 timesteps: typing.List[int] = None guidance_scale: float = 4.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None prompt_embeds: typing.Optional[torch.FloatTensor] = None negative_prompt_embeds: typing.Optional[torch.FloatTensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback: typing.Union[typing.Callable[[int, int, torch.FloatTensor], NoneType], NoneType] = None callback_steps: int = 1 cross_attention_kwargs: typing.Union[typing.Dict[str, typing.Any], NoneType] = None noise_level: int = 250 clean_caption: bool = True ) → ~pipelines.stable_diffusion.IFPipelineOutput or tuple

Parameters

Returns

~pipelines.stable_diffusion.IFPipelineOutput or tuple

~pipelines.stable_diffusion.IFPipelineOutput if return_dict is True, otherwise a tuple. When returning a tuple, the first element is a list with the generated images, and the second element is a list of bools denoting whether the corresponding generated image likely represents "not-safe-for-work" (nsfw) or watermarked content, according to the safety_checker`.

Function invoked when calling the pipeline for generation.

Examples:

from diffusers import IFPipeline, IFSuperResolutionPipeline, DiffusionPipeline from diffusers.utils import pt_to_pil import torch

pipe = IFPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) pipe.enable_model_cpu_offload()

prompt = 'a photo of a kangaroo wearing an orange hoodie and blue sunglasses standing in front of the eiffel tower holding a sign that says "very deep learning"' prompt_embeds, negative_embeds = pipe.encode_prompt(prompt)

image = pipe(prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, output_type="pt").images

pil_image = pt_to_pil(image) pil_image[0].save("./if_stage_I.png")

super_res_1_pipe = IFSuperResolutionPipeline.from_pretrained( ... "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16 ... ) super_res_1_pipe.enable_model_cpu_offload()

image = super_res_1_pipe( ... image=image, prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds ... ).images image[0].save("./if_stage_II.png")

Offloads all models to CPU using accelerate, reducing memory usage with a low impact on performance. Compared to enable_sequential_cpu_offload, this method moves one whole model at a time to the GPU when its forwardmethod is called, and the model remains in GPU until the next model runs. Memory savings are lower than withenable_sequential_cpu_offload, but performance is much better due to the iterative execution of the unet.

enable_sequential_cpu_offload

< source >

( gpu_id = 0 )

Offloads all models to CPU using accelerate, significantly reducing memory usage. When called, the pipeline’s models have their state dicts saved to CPU and then are moved to a torch.device('meta') and loaded to GPU only when their specific submodule has its forward` method called.

encode_prompt

< source >

( prompt do_classifier_free_guidance = True num_images_per_prompt = 1 device = None negative_prompt = None prompt_embeds: typing.Optional[torch.FloatTensor] = None negative_prompt_embeds: typing.Optional[torch.FloatTensor] = None clean_caption: bool = False )

Parameters

Encodes the prompt into text encoder hidden states.

device: (torch.device, optional): torch device to place the resulting embeddings on num_images_per_prompt (int, optional, defaults to 1): number of images that should be generated per prompt do_classifier_free_guidance (bool, optional, defaults to True): whether to use classifier free guidance or not negative_prompt (str or List[str], optional): The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embeds. instead. If not defined, one has to pass negative_prompt_embeds. instead. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1). prompt_embeds (torch.FloatTensor, optional): Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument. negative_prompt_embeds (torch.FloatTensor, optional): Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated from negative_prompt input argument.

IFImg2ImgPipeline

class diffusers.IFImg2ImgPipeline

< source >

( tokenizer: T5Tokenizer text_encoder: T5EncoderModel unet: UNet2DConditionModel scheduler: DDPMScheduler safety_checker: typing.Optional[diffusers.pipelines.deepfloyd_if.safety_checker.IFSafetyChecker] feature_extractor: typing.Optional[transformers.models.clip.image_processing_clip.CLIPImageProcessor] watermarker: typing.Optional[diffusers.pipelines.deepfloyd_if.watermark.IFWatermarker] requires_safety_checker: bool = True )

__call__

< source >

( prompt: typing.Union[str, typing.List[str]] = None image: typing.Union[PIL.Image.Image, torch.Tensor, numpy.ndarray, typing.List[PIL.Image.Image], typing.List[torch.Tensor], typing.List[numpy.ndarray]] = None strength: float = 0.7 num_inference_steps: int = 80 timesteps: typing.List[int] = None guidance_scale: float = 10.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None prompt_embeds: typing.Optional[torch.FloatTensor] = None negative_prompt_embeds: typing.Optional[torch.FloatTensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback: typing.Union[typing.Callable[[int, int, torch.FloatTensor], NoneType], NoneType] = None callback_steps: int = 1 clean_caption: bool = True cross_attention_kwargs: typing.Union[typing.Dict[str, typing.Any], NoneType] = None ) → ~pipelines.stable_diffusion.IFPipelineOutput or tuple

Parameters

Returns

~pipelines.stable_diffusion.IFPipelineOutput or tuple

~pipelines.stable_diffusion.IFPipelineOutput if return_dict is True, otherwise a tuple. When returning a tuple, the first element is a list with the generated images, and the second element is a list of bools denoting whether the corresponding generated image likely represents "not-safe-for-work" (nsfw) or watermarked content, according to the safety_checker`.

Function invoked when calling the pipeline for generation.

Examples:

from diffusers import IFImg2ImgPipeline, IFImg2ImgSuperResolutionPipeline, DiffusionPipeline from diffusers.utils import pt_to_pil import torch from PIL import Image import requests from io import BytesIO

url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg" response = requests.get(url) original_image = Image.open(BytesIO(response.content)).convert("RGB") original_image = original_image.resize((768, 512))

pipe = IFImg2ImgPipeline.from_pretrained( ... "DeepFloyd/IF-I-XL-v1.0", ... variant="fp16", ... torch_dtype=torch.float16, ... ) pipe.enable_model_cpu_offload()

prompt = "A fantasy landscape in style minecraft" prompt_embeds, negative_embeds = pipe.encode_prompt(prompt)

image = pipe( ... image=original_image, ... prompt_embeds=prompt_embeds, ... negative_prompt_embeds=negative_embeds, ... output_type="pt", ... ).images

pil_image = pt_to_pil(image) pil_image[0].save("./if_stage_I.png")

super_res_1_pipe = IFImg2ImgSuperResolutionPipeline.from_pretrained( ... "DeepFloyd/IF-II-L-v1.0", ... text_encoder=None, ... variant="fp16", ... torch_dtype=torch.float16, ... ) super_res_1_pipe.enable_model_cpu_offload()

image = super_res_1_pipe( ... image=image, ... original_image=original_image, ... prompt_embeds=prompt_embeds, ... negative_prompt_embeds=negative_embeds, ... ).images image[0].save("./if_stage_II.png")

Offloads all models to CPU using accelerate, reducing memory usage with a low impact on performance. Compared to enable_sequential_cpu_offload, this method moves one whole model at a time to the GPU when its forwardmethod is called, and the model remains in GPU until the next model runs. Memory savings are lower than withenable_sequential_cpu_offload, but performance is much better due to the iterative execution of the unet.

enable_sequential_cpu_offload

< source >

( gpu_id = 0 )

Offloads all models to CPU using accelerate, significantly reducing memory usage. When called, the pipeline’s models have their state dicts saved to CPU and then are moved to a torch.device('meta') and loaded to GPU only when their specific submodule has its forward` method called.

encode_prompt

< source >

( prompt do_classifier_free_guidance = True num_images_per_prompt = 1 device = None negative_prompt = None prompt_embeds: typing.Optional[torch.FloatTensor] = None negative_prompt_embeds: typing.Optional[torch.FloatTensor] = None clean_caption: bool = False )

Parameters

Encodes the prompt into text encoder hidden states.

device: (torch.device, optional): torch device to place the resulting embeddings on num_images_per_prompt (int, optional, defaults to 1): number of images that should be generated per prompt do_classifier_free_guidance (bool, optional, defaults to True): whether to use classifier free guidance or not negative_prompt (str or List[str], optional): The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embeds. instead. If not defined, one has to pass negative_prompt_embeds. instead. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1). prompt_embeds (torch.FloatTensor, optional): Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument. negative_prompt_embeds (torch.FloatTensor, optional): Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated from negative_prompt input argument.

IFImg2ImgSuperResolutionPipeline

class diffusers.IFImg2ImgSuperResolutionPipeline

< source >

( tokenizer: T5Tokenizer text_encoder: T5EncoderModel unet: UNet2DConditionModel scheduler: DDPMScheduler image_noising_scheduler: DDPMScheduler safety_checker: typing.Optional[diffusers.pipelines.deepfloyd_if.safety_checker.IFSafetyChecker] feature_extractor: typing.Optional[transformers.models.clip.image_processing_clip.CLIPImageProcessor] watermarker: typing.Optional[diffusers.pipelines.deepfloyd_if.watermark.IFWatermarker] requires_safety_checker: bool = True )

__call__

< source >

( image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.FloatTensor] original_image: typing.Union[PIL.Image.Image, torch.Tensor, numpy.ndarray, typing.List[PIL.Image.Image], typing.List[torch.Tensor], typing.List[numpy.ndarray]] = None strength: float = 0.8 prompt: typing.Union[str, typing.List[str]] = None num_inference_steps: int = 50 timesteps: typing.List[int] = None guidance_scale: float = 4.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None prompt_embeds: typing.Optional[torch.FloatTensor] = None negative_prompt_embeds: typing.Optional[torch.FloatTensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback: typing.Union[typing.Callable[[int, int, torch.FloatTensor], NoneType], NoneType] = None callback_steps: int = 1 cross_attention_kwargs: typing.Union[typing.Dict[str, typing.Any], NoneType] = None noise_level: int = 250 clean_caption: bool = True ) → ~pipelines.stable_diffusion.IFPipelineOutput or tuple

Parameters

Returns

~pipelines.stable_diffusion.IFPipelineOutput or tuple

~pipelines.stable_diffusion.IFPipelineOutput if return_dict is True, otherwise a tuple. When returning a tuple, the first element is a list with the generated images, and the second element is a list of bools denoting whether the corresponding generated image likely represents "not-safe-for-work" (nsfw) or watermarked content, according to the safety_checker`.

Function invoked when calling the pipeline for generation.

Examples:

from diffusers import IFImg2ImgPipeline, IFImg2ImgSuperResolutionPipeline, DiffusionPipeline from diffusers.utils import pt_to_pil import torch from PIL import Image import requests from io import BytesIO

url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg" response = requests.get(url) original_image = Image.open(BytesIO(response.content)).convert("RGB") original_image = original_image.resize((768, 512))

pipe = IFImg2ImgPipeline.from_pretrained( ... "DeepFloyd/IF-I-XL-v1.0", ... variant="fp16", ... torch_dtype=torch.float16, ... ) pipe.enable_model_cpu_offload()

prompt = "A fantasy landscape in style minecraft" prompt_embeds, negative_embeds = pipe.encode_prompt(prompt)

image = pipe( ... image=original_image, ... prompt_embeds=prompt_embeds, ... negative_prompt_embeds=negative_embeds, ... output_type="pt", ... ).images

pil_image = pt_to_pil(image) pil_image[0].save("./if_stage_I.png")

super_res_1_pipe = IFImg2ImgSuperResolutionPipeline.from_pretrained( ... "DeepFloyd/IF-II-L-v1.0", ... text_encoder=None, ... variant="fp16", ... torch_dtype=torch.float16, ... ) super_res_1_pipe.enable_model_cpu_offload()

image = super_res_1_pipe( ... image=image, ... original_image=original_image, ... prompt_embeds=prompt_embeds, ... negative_prompt_embeds=negative_embeds, ... ).images image[0].save("./if_stage_II.png")

Offloads all models to CPU using accelerate, reducing memory usage with a low impact on performance. Compared to enable_sequential_cpu_offload, this method moves one whole model at a time to the GPU when its forwardmethod is called, and the model remains in GPU until the next model runs. Memory savings are lower than withenable_sequential_cpu_offload, but performance is much better due to the iterative execution of the unet.

enable_sequential_cpu_offload

< source >

( gpu_id = 0 )

Offloads all models to CPU using accelerate, significantly reducing memory usage. When called, the pipeline’s models have their state dicts saved to CPU and then are moved to a torch.device('meta') and loaded to GPU only when their specific submodule has its forward` method called.

encode_prompt

< source >

( prompt do_classifier_free_guidance = True num_images_per_prompt = 1 device = None negative_prompt = None prompt_embeds: typing.Optional[torch.FloatTensor] = None negative_prompt_embeds: typing.Optional[torch.FloatTensor] = None clean_caption: bool = False )

Parameters

Encodes the prompt into text encoder hidden states.

device: (torch.device, optional): torch device to place the resulting embeddings on num_images_per_prompt (int, optional, defaults to 1): number of images that should be generated per prompt do_classifier_free_guidance (bool, optional, defaults to True): whether to use classifier free guidance or not negative_prompt (str or List[str], optional): The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embeds. instead. If not defined, one has to pass negative_prompt_embeds. instead. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1). prompt_embeds (torch.FloatTensor, optional): Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument. negative_prompt_embeds (torch.FloatTensor, optional): Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated from negative_prompt input argument.

IFInpaintingPipeline

class diffusers.IFInpaintingPipeline

< source >

( tokenizer: T5Tokenizer text_encoder: T5EncoderModel unet: UNet2DConditionModel scheduler: DDPMScheduler safety_checker: typing.Optional[diffusers.pipelines.deepfloyd_if.safety_checker.IFSafetyChecker] feature_extractor: typing.Optional[transformers.models.clip.image_processing_clip.CLIPImageProcessor] watermarker: typing.Optional[diffusers.pipelines.deepfloyd_if.watermark.IFWatermarker] requires_safety_checker: bool = True )

__call__

< source >

( prompt: typing.Union[str, typing.List[str]] = None image: typing.Union[PIL.Image.Image, torch.Tensor, numpy.ndarray, typing.List[PIL.Image.Image], typing.List[torch.Tensor], typing.List[numpy.ndarray]] = None mask_image: typing.Union[PIL.Image.Image, torch.Tensor, numpy.ndarray, typing.List[PIL.Image.Image], typing.List[torch.Tensor], typing.List[numpy.ndarray]] = None strength: float = 1.0 num_inference_steps: int = 50 timesteps: typing.List[int] = None guidance_scale: float = 7.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None prompt_embeds: typing.Optional[torch.FloatTensor] = None negative_prompt_embeds: typing.Optional[torch.FloatTensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback: typing.Union[typing.Callable[[int, int, torch.FloatTensor], NoneType], NoneType] = None callback_steps: int = 1 clean_caption: bool = True cross_attention_kwargs: typing.Union[typing.Dict[str, typing.Any], NoneType] = None ) → ~pipelines.stable_diffusion.IFPipelineOutput or tuple

Parameters

Returns

~pipelines.stable_diffusion.IFPipelineOutput or tuple

~pipelines.stable_diffusion.IFPipelineOutput if return_dict is True, otherwise a tuple. When returning a tuple, the first element is a list with the generated images, and the second element is a list of bools denoting whether the corresponding generated image likely represents "not-safe-for-work" (nsfw) or watermarked content, according to the safety_checker`.

Function invoked when calling the pipeline for generation.

Examples:

from diffusers import IFInpaintingPipeline, IFInpaintingSuperResolutionPipeline, DiffusionPipeline from diffusers.utils import pt_to_pil import torch from PIL import Image import requests from io import BytesIO

url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/if/person.png" response = requests.get(url) original_image = Image.open(BytesIO(response.content)).convert("RGB") original_image = original_image

url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/if/glasses_mask.png" response = requests.get(url) mask_image = Image.open(BytesIO(response.content)) mask_image = mask_image

pipe = IFInpaintingPipeline.from_pretrained( ... "DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16 ... ) pipe.enable_model_cpu_offload()

prompt = "blue sunglasses" prompt_embeds, negative_embeds = pipe.encode_prompt(prompt)

image = pipe( ... image=original_image, ... mask_image=mask_image, ... prompt_embeds=prompt_embeds, ... negative_prompt_embeds=negative_embeds, ... output_type="pt", ... ).images

pil_image = pt_to_pil(image) pil_image[0].save("./if_stage_I.png")

super_res_1_pipe = IFInpaintingSuperResolutionPipeline.from_pretrained( ... "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16 ... ) super_res_1_pipe.enable_model_cpu_offload()

image = super_res_1_pipe( ... image=image, ... mask_image=mask_image, ... original_image=original_image, ... prompt_embeds=prompt_embeds, ... negative_prompt_embeds=negative_embeds, ... ).images image[0].save("./if_stage_II.png")

Offloads all models to CPU using accelerate, reducing memory usage with a low impact on performance. Compared to enable_sequential_cpu_offload, this method moves one whole model at a time to the GPU when its forwardmethod is called, and the model remains in GPU until the next model runs. Memory savings are lower than withenable_sequential_cpu_offload, but performance is much better due to the iterative execution of the unet.

enable_sequential_cpu_offload

< source >

( gpu_id = 0 )

Offloads all models to CPU using accelerate, significantly reducing memory usage. When called, the pipeline’s models have their state dicts saved to CPU and then are moved to a torch.device('meta') and loaded to GPU only when their specific submodule has its forward` method called.

encode_prompt

< source >

( prompt do_classifier_free_guidance = True num_images_per_prompt = 1 device = None negative_prompt = None prompt_embeds: typing.Optional[torch.FloatTensor] = None negative_prompt_embeds: typing.Optional[torch.FloatTensor] = None clean_caption: bool = False )

Parameters

Encodes the prompt into text encoder hidden states.

device: (torch.device, optional): torch device to place the resulting embeddings on num_images_per_prompt (int, optional, defaults to 1): number of images that should be generated per prompt do_classifier_free_guidance (bool, optional, defaults to True): whether to use classifier free guidance or not negative_prompt (str or List[str], optional): The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embeds. instead. If not defined, one has to pass negative_prompt_embeds. instead. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1). prompt_embeds (torch.FloatTensor, optional): Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument. negative_prompt_embeds (torch.FloatTensor, optional): Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated from negative_prompt input argument.

IFInpaintingSuperResolutionPipeline

class diffusers.IFInpaintingSuperResolutionPipeline

< source >

( tokenizer: T5Tokenizer text_encoder: T5EncoderModel unet: UNet2DConditionModel scheduler: DDPMScheduler image_noising_scheduler: DDPMScheduler safety_checker: typing.Optional[diffusers.pipelines.deepfloyd_if.safety_checker.IFSafetyChecker] feature_extractor: typing.Optional[transformers.models.clip.image_processing_clip.CLIPImageProcessor] watermarker: typing.Optional[diffusers.pipelines.deepfloyd_if.watermark.IFWatermarker] requires_safety_checker: bool = True )

__call__

< source >

( image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.FloatTensor] original_image: typing.Union[PIL.Image.Image, torch.Tensor, numpy.ndarray, typing.List[PIL.Image.Image], typing.List[torch.Tensor], typing.List[numpy.ndarray]] = None mask_image: typing.Union[PIL.Image.Image, torch.Tensor, numpy.ndarray, typing.List[PIL.Image.Image], typing.List[torch.Tensor], typing.List[numpy.ndarray]] = None strength: float = 0.8 prompt: typing.Union[str, typing.List[str]] = None num_inference_steps: int = 100 timesteps: typing.List[int] = None guidance_scale: float = 4.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None prompt_embeds: typing.Optional[torch.FloatTensor] = None negative_prompt_embeds: typing.Optional[torch.FloatTensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback: typing.Union[typing.Callable[[int, int, torch.FloatTensor], NoneType], NoneType] = None callback_steps: int = 1 cross_attention_kwargs: typing.Union[typing.Dict[str, typing.Any], NoneType] = None noise_level: int = 0 clean_caption: bool = True ) → ~pipelines.stable_diffusion.IFPipelineOutput or tuple

Parameters

Returns

~pipelines.stable_diffusion.IFPipelineOutput or tuple

~pipelines.stable_diffusion.IFPipelineOutput if return_dict is True, otherwise a tuple. When returning a tuple, the first element is a list with the generated images, and the second element is a list of bools denoting whether the corresponding generated image likely represents "not-safe-for-work" (nsfw) or watermarked content, according to the safety_checker`.

Function invoked when calling the pipeline for generation.

Examples:

from diffusers import IFInpaintingPipeline, IFInpaintingSuperResolutionPipeline, DiffusionPipeline from diffusers.utils import pt_to_pil import torch from PIL import Image import requests from io import BytesIO

url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/if/person.png" response = requests.get(url) original_image = Image.open(BytesIO(response.content)).convert("RGB") original_image = original_image

url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/if/glasses_mask.png" response = requests.get(url) mask_image = Image.open(BytesIO(response.content)) mask_image = mask_image

pipe = IFInpaintingPipeline.from_pretrained( ... "DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16 ... ) pipe.enable_model_cpu_offload()

prompt = "blue sunglasses"

prompt_embeds, negative_embeds = pipe.encode_prompt(prompt) image = pipe( ... image=original_image, ... mask_image=mask_image, ... prompt_embeds=prompt_embeds, ... negative_prompt_embeds=negative_embeds, ... output_type="pt", ... ).images

pil_image = pt_to_pil(image) pil_image[0].save("./if_stage_I.png")

super_res_1_pipe = IFInpaintingSuperResolutionPipeline.from_pretrained( ... "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16 ... ) super_res_1_pipe.enable_model_cpu_offload()

image = super_res_1_pipe( ... image=image, ... mask_image=mask_image, ... original_image=original_image, ... prompt_embeds=prompt_embeds, ... negative_prompt_embeds=negative_embeds, ... ).images image[0].save("./if_stage_II.png")

Offloads all models to CPU using accelerate, reducing memory usage with a low impact on performance. Compared to enable_sequential_cpu_offload, this method moves one whole model at a time to the GPU when its forwardmethod is called, and the model remains in GPU until the next model runs. Memory savings are lower than withenable_sequential_cpu_offload, but performance is much better due to the iterative execution of the unet.

enable_sequential_cpu_offload

< source >

( gpu_id = 0 )

Offloads all models to CPU using accelerate, significantly reducing memory usage. When called, the pipeline’s models have their state dicts saved to CPU and then are moved to a torch.device('meta') and loaded to GPU only when their specific submodule has its forward` method called.

encode_prompt

< source >

( prompt do_classifier_free_guidance = True num_images_per_prompt = 1 device = None negative_prompt = None prompt_embeds: typing.Optional[torch.FloatTensor] = None negative_prompt_embeds: typing.Optional[torch.FloatTensor] = None clean_caption: bool = False )

Parameters

Encodes the prompt into text encoder hidden states.

device: (torch.device, optional): torch device to place the resulting embeddings on num_images_per_prompt (int, optional, defaults to 1): number of images that should be generated per prompt do_classifier_free_guidance (bool, optional, defaults to True): whether to use classifier free guidance or not negative_prompt (str or List[str], optional): The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embeds. instead. If not defined, one has to pass negative_prompt_embeds. instead. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1). prompt_embeds (torch.FloatTensor, optional): Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument. negative_prompt_embeds (torch.FloatTensor, optional): Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated from negative_prompt input argument.