InstructPix2Pix (original) (raw)

LoRA

InstructPix2Pix: Learning to Follow Image Editing Instructions is by Tim Brooks, Aleksander Holynski and Alexei A. Efros.

The abstract from the paper is:

We propose a method for editing images from human instructions: given an input image and a written instruction that tells the model what to do, our model follows these instructions to edit the image. To obtain training data for this problem, we combine the knowledge of two large pretrained models — a language model (GPT-3) and a text-to-image model (Stable Diffusion) — to generate a large dataset of image editing examples. Our conditional diffusion model, InstructPix2Pix, is trained on our generated data, and generalizes to real images and user-written instructions at inference time. Since it performs edits in the forward pass and does not require per example fine-tuning or inversion, our model edits images quickly, in a matter of seconds. We show compelling editing results for a diverse collection of input images and written instructions.

You can find additional information about InstructPix2Pix on the project page, original codebase, and try it out in a demo.

Make sure to check out the Schedulers guide to learn how to explore the tradeoff between scheduler speed and quality, and see the reuse components across pipelines section to learn how to efficiently load the same components into multiple pipelines.

StableDiffusionInstructPix2PixPipeline

class diffusers.StableDiffusionInstructPix2PixPipeline

< source >

( vae: AutoencoderKL text_encoder: CLIPTextModel tokenizer: CLIPTokenizer unet: UNet2DConditionModel scheduler: KarrasDiffusionSchedulers safety_checker: StableDiffusionSafetyChecker feature_extractor: CLIPImageProcessor image_encoder: typing.Optional[transformers.models.clip.modeling_clip.CLIPVisionModelWithProjection] = None requires_safety_checker: bool = True )

Parameters

Pipeline for pixel-level image editing by following text instructions (based on Stable Diffusion).

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).

The pipeline also inherits the following loading methods:

__call__

< source >

( prompt: typing.Union[str, typing.List[str]] = None image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] = None num_inference_steps: int = 100 guidance_scale: float = 7.5 image_guidance_scale: float = 1.5 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None ip_adapter_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor], NoneType] = None ip_adapter_image_embeds: typing.Optional[typing.List[torch.Tensor]] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None **kwargs ) → StableDiffusionPipelineOutput or tuple

Parameters

If return_dict is True, StableDiffusionPipelineOutput is returned, otherwise a tuple is returned where the first element is a list with the generated images and the second element is a list of bools indicating whether the corresponding generated image contains “not-safe-for-work” (nsfw) content.

The call function to the pipeline for generation.

Examples:

import PIL import requests import torch from io import BytesIO

from diffusers import StableDiffusionInstructPix2PixPipeline

def download_image(url): ... response = requests.get(url) ... return PIL.Image.open(BytesIO(response.content)).convert("RGB")

img_url = "https://huggingface.co/datasets/diffusers/diffusers-images-docs/resolve/main/mountain.png"

image = download_image(img_url).resize((512, 512))

pipe = StableDiffusionInstructPix2PixPipeline.from_pretrained( ... "timbrooks/instruct-pix2pix", torch_dtype=torch.float16 ... ) pipe = pipe.to("cuda")

prompt = "make the mountains snowy" image = pipe(prompt=prompt, image=image).images[0]

load_textual_inversion

< source >

( pretrained_model_name_or_path: typing.Union[str, typing.List[str], typing.Dict[str, torch.Tensor], typing.List[typing.Dict[str, torch.Tensor]]] token: typing.Union[typing.List[str], str, NoneType] = None tokenizer: typing.Optional[ForwardRef('PreTrainedTokenizer')] = None text_encoder: typing.Optional[ForwardRef('PreTrainedModel')] = None **kwargs )

Parameters

Load Textual Inversion embeddings into the text encoder of StableDiffusionPipeline (both 🤗 Diffusers and Automatic1111 formats are supported).

Example:

To load a Textual Inversion embedding vector in 🤗 Diffusers format:

from diffusers import StableDiffusionPipeline import torch

model_id = "stable-diffusion-v1-5/stable-diffusion-v1-5" pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")

pipe.load_textual_inversion("sd-concepts-library/cat-toy")

prompt = "A backpack"

image = pipe(prompt, num_inference_steps=50).images[0] image.save("cat-backpack.png")

To load a Textual Inversion embedding vector in Automatic1111 format, make sure to download the vector first (for example from civitAI) and then load the vector

locally:

from diffusers import StableDiffusionPipeline import torch

model_id = "stable-diffusion-v1-5/stable-diffusion-v1-5" pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")

pipe.load_textual_inversion("./charturnerv2.pt", token="charturnerv2")

prompt = "charturnerv2, multiple views of the same character in the same outfit, a character turnaround of a woman wearing a black jacket and red shirt, best quality, intricate details."

image = pipe(prompt, num_inference_steps=50).images[0] image.save("character.png")

load_lora_weights

< source >

( pretrained_model_name_or_path_or_dict: typing.Union[str, typing.Dict[str, torch.Tensor]] adapter_name = None hotswap: bool = False **kwargs )

Parameters

Load LoRA weights specified in pretrained_model_name_or_path_or_dict into self.unet andself.text_encoder.

All kwargs are forwarded to self.lora_state_dict.

See lora_state_dict() for more details on how the state dict is loaded.

See load_lora_into_unet() for more details on how the state dict is loaded into self.unet.

See load_lora_into_text_encoder() for more details on how the state dict is loaded into self.text_encoder.

save_lora_weights

< source >

( save_directory: typing.Union[str, os.PathLike] unet_lora_layers: typing.Dict[str, typing.Union[torch.nn.modules.module.Module, torch.Tensor]] = None text_encoder_lora_layers: typing.Dict[str, torch.nn.modules.module.Module] = None is_main_process: bool = True weight_name: str = None save_function: typing.Callable = None safe_serialization: bool = True )

Parameters

Save the LoRA parameters corresponding to the UNet and text encoder.

StableDiffusionXLInstructPix2PixPipeline

class diffusers.StableDiffusionXLInstructPix2PixPipeline

< source >

( vae: AutoencoderKL text_encoder: CLIPTextModel text_encoder_2: CLIPTextModelWithProjection tokenizer: CLIPTokenizer tokenizer_2: CLIPTokenizer unet: UNet2DConditionModel scheduler: KarrasDiffusionSchedulers force_zeros_for_empty_prompt: bool = True add_watermarker: typing.Optional[bool] = None is_cosxl_edit: typing.Optional[bool] = False )

Parameters

Pipeline for pixel-level image editing by following text instructions. Based on Stable Diffusion XL.

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods the library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)

The pipeline also inherits the following loading methods:

__call__

< source >

( prompt: typing.Union[str, typing.List[str]] = None prompt_2: typing.Union[str, typing.List[str], NoneType] = None image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] = None height: typing.Optional[int] = None width: typing.Optional[int] = None num_inference_steps: int = 100 denoising_end: typing.Optional[float] = None guidance_scale: float = 5.0 image_guidance_scale: float = 1.5 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None negative_prompt_2: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None pooled_prompt_embeds: typing.Optional[torch.Tensor] = None negative_pooled_prompt_embeds: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback: typing.Optional[typing.Callable[[int, int, torch.Tensor], NoneType]] = None callback_steps: int = 1 cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None guidance_rescale: float = 0.0 original_size: typing.Tuple[int, int] = None crops_coords_top_left: typing.Tuple[int, int] = (0, 0) target_size: typing.Tuple[int, int] = None ) → ~pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput or tuple

Parameters

Returns

~pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput or tuple

~pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput if return_dict is True, otherwise atuple. When returning a tuple, the first element is a list with the generated images.

Function invoked when calling the pipeline for generation.

Examples:

import torch from diffusers import StableDiffusionXLInstructPix2PixPipeline from diffusers.utils import load_image

resolution = 768 image = load_image( ... "https://hf.co/datasets/diffusers/diffusers-images-docs/resolve/main/mountain.png" ... ).resize((resolution, resolution)) edit_instruction = "Turn sky into a cloudy one"

pipe = StableDiffusionXLInstructPix2PixPipeline.from_pretrained( ... "diffusers/sdxl-instructpix2pix-768", torch_dtype=torch.float16 ... ).to("cuda")

edited_image = pipe( ... prompt=edit_instruction, ... image=image, ... height=resolution, ... width=resolution, ... guidance_scale=3.0, ... image_guidance_scale=1.5, ... num_inference_steps=30, ... ).images[0] edited_image

encode_prompt

< source >

( prompt: str prompt_2: typing.Optional[str] = None device: typing.Optional[torch.device] = None num_images_per_prompt: int = 1 do_classifier_free_guidance: bool = True negative_prompt: typing.Optional[str] = None negative_prompt_2: typing.Optional[str] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None pooled_prompt_embeds: typing.Optional[torch.Tensor] = None negative_pooled_prompt_embeds: typing.Optional[torch.Tensor] = None lora_scale: typing.Optional[float] = None )

Parameters

Encodes the prompt into text encoder hidden states.

< > Update on GitHub