DeepFloyd IF (original) (raw)

Overview

DeepFloyd IF is a novel state-of-the-art open-source text-to-image model with a high degree of photorealism and language understanding. The model is a modular composed of a frozen text encoder and three cascaded pixel diffusion modules:

Stage 1: a base model that generates 64x64 px image based on text prompt,
Stage 2: a 64x64 px => 256x256 px super-resolution model, and
Stage 3: a 256x256 px => 1024x1024 px super-resolution model Stage 1 and Stage 2 utilize a frozen text encoder based on the T5 transformer to extract text embeddings, which are then fed into a UNet architecture enhanced with cross-attention and attention pooling. Stage 3 is Stability AI’s x4 Upscaling model. The result is a highly efficient model that outperforms current state-of-the-art models, achieving a zero-shot FID score of 6.66 on the COCO dataset. Our work underscores the potential of larger UNet architectures in the first stage of cascaded diffusion models and depicts a promising future for text-to-image synthesis.

Usage

Before you can use IF, you need to accept its usage conditions. To do so:

Make sure to have a Hugging Face account and be logged in.
Accept the license on the model card of DeepFloyd/IF-I-XL-v1.0. Accepting the license on the stage I model card will auto accept for the other IF models.
Make sure to login locally. Install huggingface_hub:

pip install huggingface_hub --upgrade

run the login function in a Python shell:

from huggingface_hub import login

and enter your Hugging Face Hub access token.

Next we install diffusers and dependencies:

pip install -q diffusers accelerate transformers

The following sections give more in-detail examples of how to use IF. Specifically:

Text-to-Image Generation
Image-to-Image Generation
Inpainting
Reusing model weights
Speed optimization
Memory optimization

Available checkpoints

Stage-1
Stage-2
- DeepFloyd/IF-II-L-v1.0
- DeepFloyd/IF-II-M-v1.0
Stage-3
- stabilityai/stable-diffusion-x4-upscaler

Google Colab

Text-to-Image Generation

By default diffusers makes use of model cpu offloading to run the whole IF pipeline with as little as 14 GB of VRAM.

from diffusers import DiffusionPipeline from diffusers.utils import pt_to_pil, make_image_grid import torch

stage_1 = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) stage_1.enable_model_cpu_offload()

stage_2 = DiffusionPipeline.from_pretrained( "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16 ) stage_2.enable_model_cpu_offload()

safety_modules = { "feature_extractor": stage_1.feature_extractor, "safety_checker": stage_1.safety_checker, "watermarker": stage_1.watermarker, } stage_3 = DiffusionPipeline.from_pretrained( "stabilityai/stable-diffusion-x4-upscaler", **safety_modules, torch_dtype=torch.float16 ) stage_3.enable_model_cpu_offload()

prompt = 'a photo of a kangaroo wearing an orange hoodie and blue sunglasses standing in front of the eiffel tower holding a sign that says "very deep learning"' generator = torch.manual_seed(1)

prompt_embeds, negative_embeds = stage_1.encode_prompt(prompt)

stage_1_output = stage_1( prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, generator=generator, output_type="pt" ).images

stage_2_output = stage_2( image=stage_1_output, prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, generator=generator, output_type="pt", ).images

stage_3_output = stage_3(prompt=prompt, image=stage_2_output, noise_level=100, generator=generator).images

make_image_grid([pt_to_pil(stage_1_output)[0], pt_to_pil(stage_2_output)[0], stage_3_output[0]], rows=1, rows=3)

Text Guided Image-to-Image Generation

The same IF model weights can be used for text-guided image-to-image translation or image variation. In this case just make sure to load the weights using the IFImg2ImgPipeline and IFImg2ImgSuperResolutionPipeline pipelines.

Note: You can also directly move the weights of the text-to-image pipelines to the image-to-image pipelines without loading them twice by making use of the components argument as explained here.

from diffusers import IFImg2ImgPipeline, IFImg2ImgSuperResolutionPipeline, DiffusionPipeline from diffusers.utils import pt_to_pil, load_image, make_image_grid import torch

url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg" original_image = load_image(url) original_image = original_image.resize((768, 512))

stage_1 = IFImg2ImgPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) stage_1.enable_model_cpu_offload()

stage_2 = IFImg2ImgSuperResolutionPipeline.from_pretrained( "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16 ) stage_2.enable_model_cpu_offload()

prompt = "A fantasy landscape in style minecraft" generator = torch.manual_seed(1)

prompt_embeds, negative_embeds = stage_1.encode_prompt(prompt)

stage_1_output = stage_1( image=original_image, prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, generator=generator, output_type="pt", ).images

stage_2_output = stage_2( image=stage_1_output, original_image=original_image, prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, generator=generator, output_type="pt", ).images

stage_3_output = stage_3(prompt=prompt, image=stage_2_output, generator=generator, noise_level=100).images

make_image_grid([original_image, pt_to_pil(stage_1_output)[0], pt_to_pil(stage_2_output)[0], stage_3_output[0]], rows=1, rows=4)

Text Guided Inpainting Generation

The same IF model weights can be used for text-guided image-to-image translation or image variation. In this case just make sure to load the weights using the IFInpaintingPipeline and IFInpaintingSuperResolutionPipeline pipelines.

Note: You can also directly move the weights of the text-to-image pipelines to the image-to-image pipelines without loading them twice by making use of the ~DiffusionPipeline.components() function as explained here.

from diffusers import IFInpaintingPipeline, IFInpaintingSuperResolutionPipeline, DiffusionPipeline from diffusers.utils import pt_to_pil, load_image, make_image_grid import torch

url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/if/person.png" original_image = load_image(url)

url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/if/glasses_mask.png" mask_image = load_image(url)

stage_1 = IFInpaintingPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) stage_1.enable_model_cpu_offload()

stage_2 = IFInpaintingSuperResolutionPipeline.from_pretrained( "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16 ) stage_2.enable_model_cpu_offload()

prompt = "blue sunglasses" generator = torch.manual_seed(1)

prompt_embeds, negative_embeds = stage_1.encode_prompt(prompt)

stage_1_output = stage_1( image=original_image, mask_image=mask_image, prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, generator=generator, output_type="pt", ).images

stage_2_output = stage_2( image=stage_1_output, original_image=original_image, mask_image=mask_image, prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, generator=generator, output_type="pt", ).images

stage_3_output = stage_3(prompt=prompt, image=stage_2_output, generator=generator, noise_level=100).images

make_image_grid([original_image, mask_image, pt_to_pil(stage_1_output)[0], pt_to_pil(stage_2_output)[0], stage_3_output[0]], rows=1, rows=5)

Converting between different pipelines

In addition to being loaded with from_pretrained, Pipelines can also be loaded directly from each other.

from diffusers import IFPipeline, IFSuperResolutionPipeline

pipe_1 = IFPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0") pipe_2 = IFSuperResolutionPipeline.from_pretrained("DeepFloyd/IF-II-L-v1.0")

from diffusers import IFImg2ImgPipeline, IFImg2ImgSuperResolutionPipeline

pipe_1 = IFImg2ImgPipeline(**pipe_1.components) pipe_2 = IFImg2ImgSuperResolutionPipeline(**pipe_2.components)

from diffusers import IFInpaintingPipeline, IFInpaintingSuperResolutionPipeline

pipe_1 = IFInpaintingPipeline(**pipe_1.components) pipe_2 = IFInpaintingSuperResolutionPipeline(**pipe_2.components)

Optimizing for speed

The simplest optimization to run IF faster is to move all model components to the GPU.

pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) pipe.to("cuda")

You can also run the diffusion process for a shorter number of timesteps.

This can either be done with the num_inference_steps argument:

pipe("", num_inference_steps=30)

Or with the timesteps argument:

from diffusers.pipelines.deepfloyd_if import fast27_timesteps

pipe("", timesteps=fast27_timesteps)

When doing image variation or inpainting, you can also decrease the number of timesteps with the strength argument. The strength argument is the amount of noise to add to the input image which also determines how many steps to run in the denoising process. A smaller number will vary the image less but run faster.

pipe = IFImg2ImgPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) pipe.to("cuda")

image = pipe(image=image, prompt="", strength=0.3).images

You can also use torch.compile. Note that we have not exhaustively tested torch.compilewith IF and it might not give expected results.

from diffusers import DiffusionPipeline import torch

pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) pipe.to("cuda")

pipe.text_encoder = torch.compile(pipe.text_encoder, mode="reduce-overhead", fullgraph=True) pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)

Optimizing for memory

When optimizing for GPU memory, we can use the standard diffusers CPU offloading APIs.

Either the model based CPU offloading,

pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) pipe.enable_model_cpu_offload()

or the more aggressive layer based CPU offloading.

pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) pipe.enable_sequential_cpu_offload()

Additionally, T5 can be loaded in 8bit precision

from transformers import T5EncoderModel

text_encoder = T5EncoderModel.from_pretrained( "DeepFloyd/IF-I-XL-v1.0", subfolder="text_encoder", device_map="auto", load_in_8bit=True, variant="8bit" )

from diffusers import DiffusionPipeline

pipe = DiffusionPipeline.from_pretrained( "DeepFloyd/IF-I-XL-v1.0", text_encoder=text_encoder,
unet=None, device_map="auto", )

prompt_embeds, negative_embeds = pipe.encode_prompt("")

For CPU RAM constrained machines like Google Colab free tier where we can’t load all model components to the CPU at once, we can manually only load the pipeline with the text encoder or UNet when the respective model components are needed.

from diffusers import IFPipeline, IFSuperResolutionPipeline import torch import gc from transformers import T5EncoderModel from diffusers.utils import pt_to_pil, make_image_grid

text_encoder = T5EncoderModel.from_pretrained( "DeepFloyd/IF-I-XL-v1.0", subfolder="text_encoder", device_map="auto", load_in_8bit=True, variant="8bit" )

pipe = DiffusionPipeline.from_pretrained( "DeepFloyd/IF-I-XL-v1.0", text_encoder=text_encoder,
unet=None, device_map="auto", )

prompt = 'a photo of a kangaroo wearing an orange hoodie and blue sunglasses standing in front of the eiffel tower holding a sign that says "very deep learning"' prompt_embeds, negative_embeds = pipe.encode_prompt(prompt)

del text_encoder del pipe gc.collect() torch.cuda.empty_cache()

pipe = IFPipeline.from_pretrained( "DeepFloyd/IF-I-XL-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16, device_map="auto" )

generator = torch.Generator().manual_seed(0) stage_1_output = pipe( prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, output_type="pt", generator=generator, ).images

del pipe gc.collect() torch.cuda.empty_cache()

pipe = IFSuperResolutionPipeline.from_pretrained( "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16, device_map="auto" )

generator = torch.Generator().manual_seed(0) stage_2_output = pipe( image=stage_1_output, prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, output_type="pt", generator=generator, ).images

make_image_grid([pt_to_pil(stage_1_output)[0], pt_to_pil(stage_2_output)[0]], rows=1, rows=2)

Available Pipelines:

Pipeline	Tasks	Colab
pipeline_if.py	Text-to-Image Generation	-
pipeline_if_superresolution.py	Text-to-Image Generation	-
pipeline_if_img2img.py	Image-to-Image Generation	-
pipeline_if_img2img_superresolution.py	Image-to-Image Generation	-
pipeline_if_inpainting.py	Image-to-Image Generation	-
pipeline_if_inpainting_superresolution.py	Image-to-Image Generation	-

DeepFloyd IF (original) (raw)

Overview

Usage

Text-to-Image Generation

Text Guided Image-to-Image Generation

Text Guided Inpainting Generation

Converting between different pipelines

Optimizing for speed

Optimizing for memory

Available Pipelines:

IFPipeline

class diffusers.IFPipeline

__call__

encode_prompt

IFSuperResolutionPipeline

class diffusers.IFSuperResolutionPipeline

__call__

encode_prompt

IFImg2ImgPipeline

class diffusers.IFImg2ImgPipeline

__call__

encode_prompt

IFImg2ImgSuperResolutionPipeline

class diffusers.IFImg2ImgSuperResolutionPipeline

__call__

encode_prompt

IFInpaintingPipeline

class diffusers.IFInpaintingPipeline

__call__

encode_prompt

IFInpaintingSuperResolutionPipeline

class diffusers.IFInpaintingSuperResolutionPipeline

__call__

encode_prompt

call

call

call

call

call

call