DeepFloyd IF (original) (raw)

LoRA MPS

Overview

DeepFloyd IF is a novel state-of-the-art open-source text-to-image model with a high degree of photorealism and language understanding. The model is a modular composed of a frozen text encoder and three cascaded pixel diffusion modules:

Usage

Before you can use IF, you need to accept its usage conditions. To do so:

  1. Make sure to have a Hugging Face account and be logged in.
  2. Accept the license on the model card of DeepFloyd/IF-I-XL-v1.0. Accepting the license on the stage I model card will auto accept for the other IF models.
  3. Make sure to login locally. Install huggingface_hub:

pip install huggingface_hub --upgrade

run the login function in a Python shell:

from huggingface_hub import login

login()

and enter your Hugging Face Hub access token.

Next we install diffusers and dependencies:

pip install -q diffusers accelerate transformers

The following sections give more in-detail examples of how to use IF. Specifically:

Available checkpoints

Google Colab Open In Colab

Text-to-Image Generation

By default diffusers makes use of model cpu offloading to run the whole IF pipeline with as little as 14 GB of VRAM.

from diffusers import DiffusionPipeline from diffusers.utils import pt_to_pil, make_image_grid import torch

stage_1 = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) stage_1.enable_model_cpu_offload()

stage_2 = DiffusionPipeline.from_pretrained( "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16 ) stage_2.enable_model_cpu_offload()

safety_modules = { "feature_extractor": stage_1.feature_extractor, "safety_checker": stage_1.safety_checker, "watermarker": stage_1.watermarker, } stage_3 = DiffusionPipeline.from_pretrained( "stabilityai/stable-diffusion-x4-upscaler", **safety_modules, torch_dtype=torch.float16 ) stage_3.enable_model_cpu_offload()

prompt = 'a photo of a kangaroo wearing an orange hoodie and blue sunglasses standing in front of the eiffel tower holding a sign that says "very deep learning"' generator = torch.manual_seed(1)

prompt_embeds, negative_embeds = stage_1.encode_prompt(prompt)

stage_1_output = stage_1( prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, generator=generator, output_type="pt" ).images

stage_2_output = stage_2( image=stage_1_output, prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, generator=generator, output_type="pt", ).images

stage_3_output = stage_3(prompt=prompt, image=stage_2_output, noise_level=100, generator=generator).images

make_image_grid([pt_to_pil(stage_1_output)[0], pt_to_pil(stage_2_output)[0], stage_3_output[0]], rows=1, rows=3)

Text Guided Image-to-Image Generation

The same IF model weights can be used for text-guided image-to-image translation or image variation. In this case just make sure to load the weights using the IFImg2ImgPipeline and IFImg2ImgSuperResolutionPipeline pipelines.

Note: You can also directly move the weights of the text-to-image pipelines to the image-to-image pipelines without loading them twice by making use of the components argument as explained here.

from diffusers import IFImg2ImgPipeline, IFImg2ImgSuperResolutionPipeline, DiffusionPipeline from diffusers.utils import pt_to_pil, load_image, make_image_grid import torch

url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg" original_image = load_image(url) original_image = original_image.resize((768, 512))

stage_1 = IFImg2ImgPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) stage_1.enable_model_cpu_offload()

stage_2 = IFImg2ImgSuperResolutionPipeline.from_pretrained( "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16 ) stage_2.enable_model_cpu_offload()

safety_modules = { "feature_extractor": stage_1.feature_extractor, "safety_checker": stage_1.safety_checker, "watermarker": stage_1.watermarker, } stage_3 = DiffusionPipeline.from_pretrained( "stabilityai/stable-diffusion-x4-upscaler", **safety_modules, torch_dtype=torch.float16 ) stage_3.enable_model_cpu_offload()

prompt = "A fantasy landscape in style minecraft" generator = torch.manual_seed(1)

prompt_embeds, negative_embeds = stage_1.encode_prompt(prompt)

stage_1_output = stage_1( image=original_image, prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, generator=generator, output_type="pt", ).images

stage_2_output = stage_2( image=stage_1_output, original_image=original_image, prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, generator=generator, output_type="pt", ).images

stage_3_output = stage_3(prompt=prompt, image=stage_2_output, generator=generator, noise_level=100).images

make_image_grid([original_image, pt_to_pil(stage_1_output)[0], pt_to_pil(stage_2_output)[0], stage_3_output[0]], rows=1, rows=4)

Text Guided Inpainting Generation

The same IF model weights can be used for text-guided image-to-image translation or image variation. In this case just make sure to load the weights using the IFInpaintingPipeline and IFInpaintingSuperResolutionPipeline pipelines.

Note: You can also directly move the weights of the text-to-image pipelines to the image-to-image pipelines without loading them twice by making use of the ~DiffusionPipeline.components() function as explained here.

from diffusers import IFInpaintingPipeline, IFInpaintingSuperResolutionPipeline, DiffusionPipeline from diffusers.utils import pt_to_pil, load_image, make_image_grid import torch

url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/if/person.png" original_image = load_image(url)

url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/if/glasses_mask.png" mask_image = load_image(url)

stage_1 = IFInpaintingPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) stage_1.enable_model_cpu_offload()

stage_2 = IFInpaintingSuperResolutionPipeline.from_pretrained( "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16 ) stage_2.enable_model_cpu_offload()

safety_modules = { "feature_extractor": stage_1.feature_extractor, "safety_checker": stage_1.safety_checker, "watermarker": stage_1.watermarker, } stage_3 = DiffusionPipeline.from_pretrained( "stabilityai/stable-diffusion-x4-upscaler", **safety_modules, torch_dtype=torch.float16 ) stage_3.enable_model_cpu_offload()

prompt = "blue sunglasses" generator = torch.manual_seed(1)

prompt_embeds, negative_embeds = stage_1.encode_prompt(prompt)

stage_1_output = stage_1( image=original_image, mask_image=mask_image, prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, generator=generator, output_type="pt", ).images

stage_2_output = stage_2( image=stage_1_output, original_image=original_image, mask_image=mask_image, prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, generator=generator, output_type="pt", ).images

stage_3_output = stage_3(prompt=prompt, image=stage_2_output, generator=generator, noise_level=100).images

make_image_grid([original_image, mask_image, pt_to_pil(stage_1_output)[0], pt_to_pil(stage_2_output)[0], stage_3_output[0]], rows=1, rows=5)

Converting between different pipelines

In addition to being loaded with from_pretrained, Pipelines can also be loaded directly from each other.

from diffusers import IFPipeline, IFSuperResolutionPipeline

pipe_1 = IFPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0") pipe_2 = IFSuperResolutionPipeline.from_pretrained("DeepFloyd/IF-II-L-v1.0")

from diffusers import IFImg2ImgPipeline, IFImg2ImgSuperResolutionPipeline

pipe_1 = IFImg2ImgPipeline(**pipe_1.components) pipe_2 = IFImg2ImgSuperResolutionPipeline(**pipe_2.components)

from diffusers import IFInpaintingPipeline, IFInpaintingSuperResolutionPipeline

pipe_1 = IFInpaintingPipeline(**pipe_1.components) pipe_2 = IFInpaintingSuperResolutionPipeline(**pipe_2.components)

Optimizing for speed

The simplest optimization to run IF faster is to move all model components to the GPU.

pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) pipe.to("cuda")

You can also run the diffusion process for a shorter number of timesteps.

This can either be done with the num_inference_steps argument:

pipe("", num_inference_steps=30)

Or with the timesteps argument:

from diffusers.pipelines.deepfloyd_if import fast27_timesteps

pipe("", timesteps=fast27_timesteps)

When doing image variation or inpainting, you can also decrease the number of timesteps with the strength argument. The strength argument is the amount of noise to add to the input image which also determines how many steps to run in the denoising process. A smaller number will vary the image less but run faster.

pipe = IFImg2ImgPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) pipe.to("cuda")

image = pipe(image=image, prompt="", strength=0.3).images

You can also use torch.compile. Note that we have not exhaustively tested torch.compilewith IF and it might not give expected results.

from diffusers import DiffusionPipeline import torch

pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) pipe.to("cuda")

pipe.text_encoder = torch.compile(pipe.text_encoder, mode="reduce-overhead", fullgraph=True) pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)

Optimizing for memory

When optimizing for GPU memory, we can use the standard diffusers CPU offloading APIs.

Either the model based CPU offloading,

pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) pipe.enable_model_cpu_offload()

or the more aggressive layer based CPU offloading.

pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) pipe.enable_sequential_cpu_offload()

Additionally, T5 can be loaded in 8bit precision

from transformers import T5EncoderModel

text_encoder = T5EncoderModel.from_pretrained( "DeepFloyd/IF-I-XL-v1.0", subfolder="text_encoder", device_map="auto", load_in_8bit=True, variant="8bit" )

from diffusers import DiffusionPipeline

pipe = DiffusionPipeline.from_pretrained( "DeepFloyd/IF-I-XL-v1.0", text_encoder=text_encoder,
unet=None, device_map="auto", )

prompt_embeds, negative_embeds = pipe.encode_prompt("")

For CPU RAM constrained machines like Google Colab free tier where we can’t load all model components to the CPU at once, we can manually only load the pipeline with the text encoder or UNet when the respective model components are needed.

from diffusers import IFPipeline, IFSuperResolutionPipeline import torch import gc from transformers import T5EncoderModel from diffusers.utils import pt_to_pil, make_image_grid

text_encoder = T5EncoderModel.from_pretrained( "DeepFloyd/IF-I-XL-v1.0", subfolder="text_encoder", device_map="auto", load_in_8bit=True, variant="8bit" )

pipe = DiffusionPipeline.from_pretrained( "DeepFloyd/IF-I-XL-v1.0", text_encoder=text_encoder,
unet=None, device_map="auto", )

prompt = 'a photo of a kangaroo wearing an orange hoodie and blue sunglasses standing in front of the eiffel tower holding a sign that says "very deep learning"' prompt_embeds, negative_embeds = pipe.encode_prompt(prompt)

del text_encoder del pipe gc.collect() torch.cuda.empty_cache()

pipe = IFPipeline.from_pretrained( "DeepFloyd/IF-I-XL-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16, device_map="auto" )

generator = torch.Generator().manual_seed(0) stage_1_output = pipe( prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, output_type="pt", generator=generator, ).images

del pipe gc.collect() torch.cuda.empty_cache()

pipe = IFSuperResolutionPipeline.from_pretrained( "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16, device_map="auto" )

generator = torch.Generator().manual_seed(0) stage_2_output = pipe( image=stage_1_output, prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, output_type="pt", generator=generator, ).images

make_image_grid([pt_to_pil(stage_1_output)[0], pt_to_pil(stage_2_output)[0]], rows=1, rows=2)

Available Pipelines:

Pipeline Tasks Colab
pipeline_if.py Text-to-Image Generation -
pipeline_if_superresolution.py Text-to-Image Generation -
pipeline_if_img2img.py Image-to-Image Generation -
pipeline_if_img2img_superresolution.py Image-to-Image Generation -
pipeline_if_inpainting.py Image-to-Image Generation -
pipeline_if_inpainting_superresolution.py Image-to-Image Generation -

IFPipeline

class diffusers.IFPipeline

< source >

( tokenizer: T5Tokenizer text_encoder: T5EncoderModel unet: UNet2DConditionModel scheduler: DDPMScheduler safety_checker: typing.Optional[diffusers.pipelines.deepfloyd_if.safety_checker.IFSafetyChecker] feature_extractor: typing.Optional[transformers.models.clip.image_processing_clip.CLIPImageProcessor] watermarker: typing.Optional[diffusers.pipelines.deepfloyd_if.watermark.IFWatermarker] requires_safety_checker: bool = True )

__call__

< source >

( prompt: typing.Union[str, typing.List[str]] = None num_inference_steps: int = 100 timesteps: typing.List[int] = None guidance_scale: float = 7.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 height: typing.Optional[int] = None width: typing.Optional[int] = None eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback: typing.Optional[typing.Callable[[int, int, torch.Tensor], NoneType]] = None callback_steps: int = 1 clean_caption: bool = True cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None ) → ~pipelines.stable_diffusion.IFPipelineOutput or tuple

Parameters

Returns

~pipelines.stable_diffusion.IFPipelineOutput or tuple

~pipelines.stable_diffusion.IFPipelineOutput if return_dict is True, otherwise a tuple. When returning a tuple, the first element is a list with the generated images, and the second element is a list of bools denoting whether the corresponding generated image likely represents "not-safe-for-work" (nsfw) or watermarked content, according to the safety_checker`.

Function invoked when calling the pipeline for generation.

Examples:

from diffusers import IFPipeline, IFSuperResolutionPipeline, DiffusionPipeline from diffusers.utils import pt_to_pil import torch

pipe = IFPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) pipe.enable_model_cpu_offload()

prompt = 'a photo of a kangaroo wearing an orange hoodie and blue sunglasses standing in front of the eiffel tower holding a sign that says "very deep learning"' prompt_embeds, negative_embeds = pipe.encode_prompt(prompt)

image = pipe(prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, output_type="pt").images

pil_image = pt_to_pil(image) pil_image[0].save("./if_stage_I.png")

super_res_1_pipe = IFSuperResolutionPipeline.from_pretrained( ... "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16 ... ) super_res_1_pipe.enable_model_cpu_offload()

image = super_res_1_pipe( ... image=image, prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, output_type="pt" ... ).images

pil_image = pt_to_pil(image) pil_image[0].save("./if_stage_I.png")

safety_modules = { ... "feature_extractor": pipe.feature_extractor, ... "safety_checker": pipe.safety_checker, ... "watermarker": pipe.watermarker, ... } super_res_2_pipe = DiffusionPipeline.from_pretrained( ... "stabilityai/stable-diffusion-x4-upscaler", **safety_modules, torch_dtype=torch.float16 ... ) super_res_2_pipe.enable_model_cpu_offload()

image = super_res_2_pipe( ... prompt=prompt, ... image=image, ... ).images image[0].save("./if_stage_II.png")

encode_prompt

< source >

( prompt: typing.Union[str, typing.List[str]] do_classifier_free_guidance: bool = True num_images_per_prompt: int = 1 device: typing.Optional[torch.device] = None negative_prompt: typing.Union[str, typing.List[str], NoneType] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None clean_caption: bool = False )

Parameters

Encodes the prompt into text encoder hidden states.

IFSuperResolutionPipeline

class diffusers.IFSuperResolutionPipeline

< source >

( tokenizer: T5Tokenizer text_encoder: T5EncoderModel unet: UNet2DConditionModel scheduler: DDPMScheduler image_noising_scheduler: DDPMScheduler safety_checker: typing.Optional[diffusers.pipelines.deepfloyd_if.safety_checker.IFSafetyChecker] feature_extractor: typing.Optional[transformers.models.clip.image_processing_clip.CLIPImageProcessor] watermarker: typing.Optional[diffusers.pipelines.deepfloyd_if.watermark.IFWatermarker] requires_safety_checker: bool = True )

__call__

< source >

( prompt: typing.Union[str, typing.List[str]] = None height: int = None width: int = None image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor] = None num_inference_steps: int = 50 timesteps: typing.List[int] = None guidance_scale: float = 4.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback: typing.Optional[typing.Callable[[int, int, torch.Tensor], NoneType]] = None callback_steps: int = 1 cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None noise_level: int = 250 clean_caption: bool = True ) → ~pipelines.stable_diffusion.IFPipelineOutput or tuple

Parameters

Returns

~pipelines.stable_diffusion.IFPipelineOutput or tuple

~pipelines.stable_diffusion.IFPipelineOutput if return_dict is True, otherwise a tuple. When returning a tuple, the first element is a list with the generated images, and the second element is a list of bools denoting whether the corresponding generated image likely represents "not-safe-for-work" (nsfw) or watermarked content, according to the safety_checker`.

Function invoked when calling the pipeline for generation.

Examples:

from diffusers import IFPipeline, IFSuperResolutionPipeline, DiffusionPipeline from diffusers.utils import pt_to_pil import torch

pipe = IFPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) pipe.enable_model_cpu_offload()

prompt = 'a photo of a kangaroo wearing an orange hoodie and blue sunglasses standing in front of the eiffel tower holding a sign that says "very deep learning"' prompt_embeds, negative_embeds = pipe.encode_prompt(prompt)

image = pipe(prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, output_type="pt").images

pil_image = pt_to_pil(image) pil_image[0].save("./if_stage_I.png")

super_res_1_pipe = IFSuperResolutionPipeline.from_pretrained( ... "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16 ... ) super_res_1_pipe.enable_model_cpu_offload()

image = super_res_1_pipe( ... image=image, prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds ... ).images image[0].save("./if_stage_II.png")

encode_prompt

< source >

( prompt: typing.Union[str, typing.List[str]] do_classifier_free_guidance: bool = True num_images_per_prompt: int = 1 device: typing.Optional[torch.device] = None negative_prompt: typing.Union[str, typing.List[str], NoneType] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None clean_caption: bool = False )

Parameters

Encodes the prompt into text encoder hidden states.

IFImg2ImgPipeline

class diffusers.IFImg2ImgPipeline

< source >

( tokenizer: T5Tokenizer text_encoder: T5EncoderModel unet: UNet2DConditionModel scheduler: DDPMScheduler safety_checker: typing.Optional[diffusers.pipelines.deepfloyd_if.safety_checker.IFSafetyChecker] feature_extractor: typing.Optional[transformers.models.clip.image_processing_clip.CLIPImageProcessor] watermarker: typing.Optional[diffusers.pipelines.deepfloyd_if.watermark.IFWatermarker] requires_safety_checker: bool = True )

__call__

< source >

( prompt: typing.Union[str, typing.List[str]] = None image: typing.Union[PIL.Image.Image, torch.Tensor, numpy.ndarray, typing.List[PIL.Image.Image], typing.List[torch.Tensor], typing.List[numpy.ndarray]] = None strength: float = 0.7 num_inference_steps: int = 80 timesteps: typing.List[int] = None guidance_scale: float = 10.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback: typing.Optional[typing.Callable[[int, int, torch.Tensor], NoneType]] = None callback_steps: int = 1 clean_caption: bool = True cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None ) → ~pipelines.stable_diffusion.IFPipelineOutput or tuple

Parameters

Returns

~pipelines.stable_diffusion.IFPipelineOutput or tuple

~pipelines.stable_diffusion.IFPipelineOutput if return_dict is True, otherwise a tuple. When returning a tuple, the first element is a list with the generated images, and the second element is a list of bools denoting whether the corresponding generated image likely represents "not-safe-for-work" (nsfw) or watermarked content, according to the safety_checker`.

Function invoked when calling the pipeline for generation.

Examples:

from diffusers import IFImg2ImgPipeline, IFImg2ImgSuperResolutionPipeline, DiffusionPipeline from diffusers.utils import pt_to_pil import torch from PIL import Image import requests from io import BytesIO

url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg" response = requests.get(url) original_image = Image.open(BytesIO(response.content)).convert("RGB") original_image = original_image.resize((768, 512))

pipe = IFImg2ImgPipeline.from_pretrained( ... "DeepFloyd/IF-I-XL-v1.0", ... variant="fp16", ... torch_dtype=torch.float16, ... ) pipe.enable_model_cpu_offload()

prompt = "A fantasy landscape in style minecraft" prompt_embeds, negative_embeds = pipe.encode_prompt(prompt)

image = pipe( ... image=original_image, ... prompt_embeds=prompt_embeds, ... negative_prompt_embeds=negative_embeds, ... output_type="pt", ... ).images

pil_image = pt_to_pil(image) pil_image[0].save("./if_stage_I.png")

super_res_1_pipe = IFImg2ImgSuperResolutionPipeline.from_pretrained( ... "DeepFloyd/IF-II-L-v1.0", ... text_encoder=None, ... variant="fp16", ... torch_dtype=torch.float16, ... ) super_res_1_pipe.enable_model_cpu_offload()

image = super_res_1_pipe( ... image=image, ... original_image=original_image, ... prompt_embeds=prompt_embeds, ... negative_prompt_embeds=negative_embeds, ... ).images image[0].save("./if_stage_II.png")

encode_prompt

< source >

( prompt: typing.Union[str, typing.List[str]] do_classifier_free_guidance: bool = True num_images_per_prompt: int = 1 device: typing.Optional[torch.device] = None negative_prompt: typing.Union[str, typing.List[str], NoneType] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None clean_caption: bool = False )

Parameters

Encodes the prompt into text encoder hidden states.

IFImg2ImgSuperResolutionPipeline

class diffusers.IFImg2ImgSuperResolutionPipeline

< source >

( tokenizer: T5Tokenizer text_encoder: T5EncoderModel unet: UNet2DConditionModel scheduler: DDPMScheduler image_noising_scheduler: DDPMScheduler safety_checker: typing.Optional[diffusers.pipelines.deepfloyd_if.safety_checker.IFSafetyChecker] feature_extractor: typing.Optional[transformers.models.clip.image_processing_clip.CLIPImageProcessor] watermarker: typing.Optional[diffusers.pipelines.deepfloyd_if.watermark.IFWatermarker] requires_safety_checker: bool = True )

__call__

< source >

( image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor] original_image: typing.Union[PIL.Image.Image, torch.Tensor, numpy.ndarray, typing.List[PIL.Image.Image], typing.List[torch.Tensor], typing.List[numpy.ndarray]] = None strength: float = 0.8 prompt: typing.Union[str, typing.List[str]] = None num_inference_steps: int = 50 timesteps: typing.List[int] = None guidance_scale: float = 4.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback: typing.Optional[typing.Callable[[int, int, torch.Tensor], NoneType]] = None callback_steps: int = 1 cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None noise_level: int = 250 clean_caption: bool = True ) → ~pipelines.stable_diffusion.IFPipelineOutput or tuple

Parameters

Returns

~pipelines.stable_diffusion.IFPipelineOutput or tuple

~pipelines.stable_diffusion.IFPipelineOutput if return_dict is True, otherwise a tuple. When returning a tuple, the first element is a list with the generated images, and the second element is a list of bools denoting whether the corresponding generated image likely represents "not-safe-for-work" (nsfw) or watermarked content, according to the safety_checker`.

Function invoked when calling the pipeline for generation.

Examples:

from diffusers import IFImg2ImgPipeline, IFImg2ImgSuperResolutionPipeline, DiffusionPipeline from diffusers.utils import pt_to_pil import torch from PIL import Image import requests from io import BytesIO

url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg" response = requests.get(url) original_image = Image.open(BytesIO(response.content)).convert("RGB") original_image = original_image.resize((768, 512))

pipe = IFImg2ImgPipeline.from_pretrained( ... "DeepFloyd/IF-I-XL-v1.0", ... variant="fp16", ... torch_dtype=torch.float16, ... ) pipe.enable_model_cpu_offload()

prompt = "A fantasy landscape in style minecraft" prompt_embeds, negative_embeds = pipe.encode_prompt(prompt)

image = pipe( ... image=original_image, ... prompt_embeds=prompt_embeds, ... negative_prompt_embeds=negative_embeds, ... output_type="pt", ... ).images

pil_image = pt_to_pil(image) pil_image[0].save("./if_stage_I.png")

super_res_1_pipe = IFImg2ImgSuperResolutionPipeline.from_pretrained( ... "DeepFloyd/IF-II-L-v1.0", ... text_encoder=None, ... variant="fp16", ... torch_dtype=torch.float16, ... ) super_res_1_pipe.enable_model_cpu_offload()

image = super_res_1_pipe( ... image=image, ... original_image=original_image, ... prompt_embeds=prompt_embeds, ... negative_prompt_embeds=negative_embeds, ... ).images image[0].save("./if_stage_II.png")

encode_prompt

< source >

( prompt: typing.Union[str, typing.List[str]] do_classifier_free_guidance: bool = True num_images_per_prompt: int = 1 device: typing.Optional[torch.device] = None negative_prompt: typing.Union[str, typing.List[str], NoneType] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None clean_caption: bool = False )

Parameters

Encodes the prompt into text encoder hidden states.

IFInpaintingPipeline

class diffusers.IFInpaintingPipeline

< source >

( tokenizer: T5Tokenizer text_encoder: T5EncoderModel unet: UNet2DConditionModel scheduler: DDPMScheduler safety_checker: typing.Optional[diffusers.pipelines.deepfloyd_if.safety_checker.IFSafetyChecker] feature_extractor: typing.Optional[transformers.models.clip.image_processing_clip.CLIPImageProcessor] watermarker: typing.Optional[diffusers.pipelines.deepfloyd_if.watermark.IFWatermarker] requires_safety_checker: bool = True )

__call__

< source >

( prompt: typing.Union[str, typing.List[str]] = None image: typing.Union[PIL.Image.Image, torch.Tensor, numpy.ndarray, typing.List[PIL.Image.Image], typing.List[torch.Tensor], typing.List[numpy.ndarray]] = None mask_image: typing.Union[PIL.Image.Image, torch.Tensor, numpy.ndarray, typing.List[PIL.Image.Image], typing.List[torch.Tensor], typing.List[numpy.ndarray]] = None strength: float = 1.0 num_inference_steps: int = 50 timesteps: typing.List[int] = None guidance_scale: float = 7.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback: typing.Optional[typing.Callable[[int, int, torch.Tensor], NoneType]] = None callback_steps: int = 1 clean_caption: bool = True cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None ) → ~pipelines.stable_diffusion.IFPipelineOutput or tuple

Parameters

Returns

~pipelines.stable_diffusion.IFPipelineOutput or tuple

~pipelines.stable_diffusion.IFPipelineOutput if return_dict is True, otherwise a tuple. When returning a tuple, the first element is a list with the generated images, and the second element is a list of bools denoting whether the corresponding generated image likely represents "not-safe-for-work" (nsfw) or watermarked content, according to the safety_checker`.

Function invoked when calling the pipeline for generation.

Examples:

from diffusers import IFInpaintingPipeline, IFInpaintingSuperResolutionPipeline, DiffusionPipeline from diffusers.utils import pt_to_pil import torch from PIL import Image import requests from io import BytesIO

url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/if/person.png" response = requests.get(url) original_image = Image.open(BytesIO(response.content)).convert("RGB") original_image = original_image

url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/if/glasses_mask.png" response = requests.get(url) mask_image = Image.open(BytesIO(response.content)) mask_image = mask_image

pipe = IFInpaintingPipeline.from_pretrained( ... "DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16 ... ) pipe.enable_model_cpu_offload()

prompt = "blue sunglasses" prompt_embeds, negative_embeds = pipe.encode_prompt(prompt)

image = pipe( ... image=original_image, ... mask_image=mask_image, ... prompt_embeds=prompt_embeds, ... negative_prompt_embeds=negative_embeds, ... output_type="pt", ... ).images

pil_image = pt_to_pil(image) pil_image[0].save("./if_stage_I.png")

super_res_1_pipe = IFInpaintingSuperResolutionPipeline.from_pretrained( ... "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16 ... ) super_res_1_pipe.enable_model_cpu_offload()

image = super_res_1_pipe( ... image=image, ... mask_image=mask_image, ... original_image=original_image, ... prompt_embeds=prompt_embeds, ... negative_prompt_embeds=negative_embeds, ... ).images image[0].save("./if_stage_II.png")

encode_prompt

< source >

( prompt: typing.Union[str, typing.List[str]] do_classifier_free_guidance: bool = True num_images_per_prompt: int = 1 device: typing.Optional[torch.device] = None negative_prompt: typing.Union[str, typing.List[str], NoneType] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None clean_caption: bool = False )

Parameters

Encodes the prompt into text encoder hidden states.

IFInpaintingSuperResolutionPipeline

class diffusers.IFInpaintingSuperResolutionPipeline

< source >

( tokenizer: T5Tokenizer text_encoder: T5EncoderModel unet: UNet2DConditionModel scheduler: DDPMScheduler image_noising_scheduler: DDPMScheduler safety_checker: typing.Optional[diffusers.pipelines.deepfloyd_if.safety_checker.IFSafetyChecker] feature_extractor: typing.Optional[transformers.models.clip.image_processing_clip.CLIPImageProcessor] watermarker: typing.Optional[diffusers.pipelines.deepfloyd_if.watermark.IFWatermarker] requires_safety_checker: bool = True )

__call__

< source >

( image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor] original_image: typing.Union[PIL.Image.Image, torch.Tensor, numpy.ndarray, typing.List[PIL.Image.Image], typing.List[torch.Tensor], typing.List[numpy.ndarray]] = None mask_image: typing.Union[PIL.Image.Image, torch.Tensor, numpy.ndarray, typing.List[PIL.Image.Image], typing.List[torch.Tensor], typing.List[numpy.ndarray]] = None strength: float = 0.8 prompt: typing.Union[str, typing.List[str]] = None num_inference_steps: int = 100 timesteps: typing.List[int] = None guidance_scale: float = 4.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback: typing.Optional[typing.Callable[[int, int, torch.Tensor], NoneType]] = None callback_steps: int = 1 cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None noise_level: int = 0 clean_caption: bool = True ) → ~pipelines.stable_diffusion.IFPipelineOutput or tuple

Parameters

Returns

~pipelines.stable_diffusion.IFPipelineOutput or tuple

~pipelines.stable_diffusion.IFPipelineOutput if return_dict is True, otherwise a tuple. When returning a tuple, the first element is a list with the generated images, and the second element is a list of bools denoting whether the corresponding generated image likely represents "not-safe-for-work" (nsfw) or watermarked content, according to the safety_checker`.

Function invoked when calling the pipeline for generation.

Examples:

from diffusers import IFInpaintingPipeline, IFInpaintingSuperResolutionPipeline, DiffusionPipeline from diffusers.utils import pt_to_pil import torch from PIL import Image import requests from io import BytesIO

url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/if/person.png" response = requests.get(url) original_image = Image.open(BytesIO(response.content)).convert("RGB") original_image = original_image

url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/if/glasses_mask.png" response = requests.get(url) mask_image = Image.open(BytesIO(response.content)) mask_image = mask_image

pipe = IFInpaintingPipeline.from_pretrained( ... "DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16 ... ) pipe.enable_model_cpu_offload()

prompt = "blue sunglasses"

prompt_embeds, negative_embeds = pipe.encode_prompt(prompt) image = pipe( ... image=original_image, ... mask_image=mask_image, ... prompt_embeds=prompt_embeds, ... negative_prompt_embeds=negative_embeds, ... output_type="pt", ... ).images

pil_image = pt_to_pil(image) pil_image[0].save("./if_stage_I.png")

super_res_1_pipe = IFInpaintingSuperResolutionPipeline.from_pretrained( ... "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16 ... ) super_res_1_pipe.enable_model_cpu_offload()

image = super_res_1_pipe( ... image=image, ... mask_image=mask_image, ... original_image=original_image, ... prompt_embeds=prompt_embeds, ... negative_prompt_embeds=negative_embeds, ... ).images image[0].save("./if_stage_II.png")

encode_prompt

< source >

( prompt: typing.Union[str, typing.List[str]] do_classifier_free_guidance: bool = True num_images_per_prompt: int = 1 device: typing.Optional[torch.device] = None negative_prompt: typing.Union[str, typing.List[str], NoneType] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None clean_caption: bool = False )

Parameters

Encodes the prompt into text encoder hidden states.

< > Update on GitHub