T2I-Adapter (original) (raw)

T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models by Chong Mou, Xintao Wang, Liangbin Xie, Jian Zhang, Zhongang Qi, Ying Shan, Xiaohu Qie.

Using the pretrained models we can provide control images (for example, a depth map) to control Stable Diffusion text-to-image generation so that it follows the structure of the depth image and fills in the details.

The abstract of the paper is the following:

The incredible generative ability of large-scale text-to-image (T2I) models has demonstrated strong power of learning complex structures and meaningful semantics. However, relying solely on text prompts cannot fully take advantage of the knowledge learned by the model, especially when flexible and accurate controlling (e.g., color and structure) is needed. In this paper, we aim to “dig out” the capabilities that T2I models have implicitly learned, and then explicitly use them to control the generation more granularly. Specifically, we propose to learn simple and lightweight T2I-Adapters to align internal knowledge in T2I models with external control signals, while freezing the original large T2I models. In this way, we can train various adapters according to different conditions, achieving rich control and editing effects in the color and structure of the generation results. Further, the proposed T2I-Adapters have attractive properties of practical value, such as composability and generalization ability. Extensive experiments demonstrate that our T2I-Adapter has promising generation quality and a wide range of applications.

This model was contributed by the community contributor HimariO ❤️ .

StableDiffusionAdapterPipeline

class diffusers.StableDiffusionAdapterPipeline

< source >

( vae: AutoencoderKL text_encoder: CLIPTextModel tokenizer: CLIPTokenizer unet: UNet2DConditionModel adapter: typing.Union[diffusers.models.adapter.T2IAdapter, diffusers.models.adapter.MultiAdapter, typing.List[diffusers.models.adapter.T2IAdapter]] scheduler: KarrasDiffusionSchedulers safety_checker: StableDiffusionSafetyChecker feature_extractor: CLIPImageProcessor requires_safety_checker: bool = True )

Parameters

Pipeline for text-to-image generation using Stable Diffusion augmented with T2I-Adapterhttps://arxiv.org/abs/2302.08453

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods the library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)

__call__

< source >

( prompt: typing.Union[str, typing.List[str]] = None image: typing.Union[torch.Tensor, PIL.Image.Image, typing.List[PIL.Image.Image]] = None height: typing.Optional[int] = None width: typing.Optional[int] = None num_inference_steps: int = 50 timesteps: typing.List[int] = None sigmas: typing.List[float] = None guidance_scale: float = 7.5 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback: typing.Optional[typing.Callable[[int, int, torch.Tensor], NoneType]] = None callback_steps: int = 1 cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None adapter_conditioning_scale: typing.Union[float, typing.List[float]] = 1.0 clip_skip: typing.Optional[int] = None ) → ~pipelines.stable_diffusion.StableDiffusionAdapterPipelineOutput or tuple

Parameters

Returns

~pipelines.stable_diffusion.StableDiffusionAdapterPipelineOutput or tuple

~pipelines.stable_diffusion.StableDiffusionAdapterPipelineOutput if return_dict is True, otherwise atuple. When returning a tuple, the first element is a list with the generated images, and the second element is a list of bools denoting whether the corresponding generated image likely represents "not-safe-for-work" (nsfw) content, according to the safety_checker`.

Function invoked when calling the pipeline for generation.

Examples:

from PIL import Image from diffusers.utils import load_image import torch from diffusers import StableDiffusionAdapterPipeline, T2IAdapter

image = load_image( ... "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_ref.png" ... )

color_palette = image.resize((8, 8)) color_palette = color_palette.resize((512, 512), resample=Image.Resampling.NEAREST)

adapter = T2IAdapter.from_pretrained("TencentARC/t2iadapter_color_sd14v1", torch_dtype=torch.float16) pipe = StableDiffusionAdapterPipeline.from_pretrained( ... "CompVis/stable-diffusion-v1-4", ... adapter=adapter, ... torch_dtype=torch.float16, ... )

pipe.to("cuda")

out_image = pipe( ... "At night, glowing cubes in front of the beach", ... image=color_palette, ... ).images[0]

enable_attention_slicing

< source >

( slice_size: typing.Union[int, str, NoneType] = 'auto' )

Parameters

Enable sliced attention computation. When this option is enabled, the attention module splits the input tensor in slices to compute attention in several steps. For more than one attention head, the computation is performed sequentially over each head. This is useful to save some memory in exchange for a small speed decrease.

⚠️ Don’t enable attention slicing if you’re already using scaled_dot_product_attention (SDPA) from PyTorch 2.0 or xFormers. These attention computations are already very memory efficient so you won’t need to enable this function. If you enable attention slicing with SDPA or xFormers, it can lead to serious slow downs!

Examples:

import torch from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained( ... "stable-diffusion-v1-5/stable-diffusion-v1-5", ... torch_dtype=torch.float16, ... use_safetensors=True, ... )

prompt = "a photo of an astronaut riding a horse on mars" pipe.enable_attention_slicing() image = pipe(prompt).images[0]

Disable sliced attention computation. If enable_attention_slicing was previously called, attention is computed in one step.

Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.

Disable sliced VAE decoding. If enable_vae_slicing was previously enabled, this method will go back to computing decoding in one step.

enable_xformers_memory_efficient_attention

< source >

( attention_op: typing.Optional[typing.Callable] = None )

Parameters

Enable memory efficient attention from xFormers. When this option is enabled, you should observe lower GPU memory usage and a potential speed up during inference. Speed up during training is not guaranteed.

⚠️ When memory efficient attention and sliced attention are both enabled, memory efficient attention takes precedent.

Examples:

import torch from diffusers import DiffusionPipeline from xformers.ops import MemoryEfficientAttentionFlashAttentionOp

pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-1", torch_dtype=torch.float16) pipe = pipe.to("cuda") pipe.enable_xformers_memory_efficient_attention(attention_op=MemoryEfficientAttentionFlashAttentionOp)

pipe.vae.enable_xformers_memory_efficient_attention(attention_op=None)

disable_xformers_memory_efficient_attention

< source >

( )

Disable memory efficient attention from xFormers.

encode_prompt

< source >

( prompt device num_images_per_prompt do_classifier_free_guidance negative_prompt = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None lora_scale: typing.Optional[float] = None clip_skip: typing.Optional[int] = None )

Parameters

Encodes the prompt into text encoder hidden states.

get_guidance_scale_embedding

< source >

( w: Tensor embedding_dim: int = 512 dtype: dtype = torch.float32 ) → torch.Tensor

Parameters

Embedding vectors with shape (len(w), embedding_dim).

See https://github.com/google-research/vdm/blob/dc27b98a554f65cdc654b800da5aa1846545d41b/model_vdm.py#L298

StableDiffusionXLAdapterPipeline

class diffusers.StableDiffusionXLAdapterPipeline

< source >

( vae: AutoencoderKL text_encoder: CLIPTextModel text_encoder_2: CLIPTextModelWithProjection tokenizer: CLIPTokenizer tokenizer_2: CLIPTokenizer unet: UNet2DConditionModel adapter: typing.Union[diffusers.models.adapter.T2IAdapter, diffusers.models.adapter.MultiAdapter, typing.List[diffusers.models.adapter.T2IAdapter]] scheduler: KarrasDiffusionSchedulers force_zeros_for_empty_prompt: bool = True feature_extractor: CLIPImageProcessor = None image_encoder: CLIPVisionModelWithProjection = None )

Parameters

Pipeline for text-to-image generation using Stable Diffusion augmented with T2I-Adapterhttps://arxiv.org/abs/2302.08453

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods the library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)

The pipeline also inherits the following loading methods:

__call__

< source >

( prompt: typing.Union[str, typing.List[str]] = None prompt_2: typing.Union[str, typing.List[str], NoneType] = None image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] = None height: typing.Optional[int] = None width: typing.Optional[int] = None num_inference_steps: int = 50 timesteps: typing.List[int] = None sigmas: typing.List[float] = None denoising_end: typing.Optional[float] = None guidance_scale: float = 5.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None negative_prompt_2: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None pooled_prompt_embeds: typing.Optional[torch.Tensor] = None negative_pooled_prompt_embeds: typing.Optional[torch.Tensor] = None ip_adapter_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor], NoneType] = None ip_adapter_image_embeds: typing.Optional[typing.List[torch.Tensor]] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback: typing.Optional[typing.Callable[[int, int, torch.Tensor], NoneType]] = None callback_steps: int = 1 cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None guidance_rescale: float = 0.0 original_size: typing.Optional[typing.Tuple[int, int]] = None crops_coords_top_left: typing.Tuple[int, int] = (0, 0) target_size: typing.Optional[typing.Tuple[int, int]] = None negative_original_size: typing.Optional[typing.Tuple[int, int]] = None negative_crops_coords_top_left: typing.Tuple[int, int] = (0, 0) negative_target_size: typing.Optional[typing.Tuple[int, int]] = None adapter_conditioning_scale: typing.Union[float, typing.List[float]] = 1.0 adapter_conditioning_factor: float = 1.0 clip_skip: typing.Optional[int] = None ) → ~pipelines.stable_diffusion.StableDiffusionAdapterPipelineOutput or tuple

Parameters

Returns

~pipelines.stable_diffusion.StableDiffusionAdapterPipelineOutput or tuple

~pipelines.stable_diffusion.StableDiffusionAdapterPipelineOutput if return_dict is True, otherwise atuple. When returning a tuple, the first element is a list with the generated images.

Function invoked when calling the pipeline for generation.

Examples:

import torch from diffusers import T2IAdapter, StableDiffusionXLAdapterPipeline, DDPMScheduler from diffusers.utils import load_image

sketch_image = load_image("https://huggingface.co/Adapter/t2iadapter/resolve/main/sketch.png").convert("L")

model_id = "stabilityai/stable-diffusion-xl-base-1.0"

adapter = T2IAdapter.from_pretrained( ... "Adapter/t2iadapter", ... subfolder="sketch_sdxl_1.0", ... torch_dtype=torch.float16, ... adapter_type="full_adapter_xl", ... ) scheduler = DDPMScheduler.from_pretrained(model_id, subfolder="scheduler")

pipe = StableDiffusionXLAdapterPipeline.from_pretrained( ... model_id, adapter=adapter, torch_dtype=torch.float16, variant="fp16", scheduler=scheduler ... ).to("cuda")

generator = torch.manual_seed(42) sketch_image_out = pipe( ... prompt="a photo of a dog in real world, high quality", ... negative_prompt="extra digit, fewer digits, cropped, worst quality, low quality", ... image=sketch_image, ... generator=generator, ... guidance_scale=7.5, ... ).images[0]

enable_attention_slicing

< source >

( slice_size: typing.Union[int, str, NoneType] = 'auto' )

Parameters

Enable sliced attention computation. When this option is enabled, the attention module splits the input tensor in slices to compute attention in several steps. For more than one attention head, the computation is performed sequentially over each head. This is useful to save some memory in exchange for a small speed decrease.

⚠️ Don’t enable attention slicing if you’re already using scaled_dot_product_attention (SDPA) from PyTorch 2.0 or xFormers. These attention computations are already very memory efficient so you won’t need to enable this function. If you enable attention slicing with SDPA or xFormers, it can lead to serious slow downs!

Examples:

import torch from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained( ... "stable-diffusion-v1-5/stable-diffusion-v1-5", ... torch_dtype=torch.float16, ... use_safetensors=True, ... )

prompt = "a photo of an astronaut riding a horse on mars" pipe.enable_attention_slicing() image = pipe(prompt).images[0]

Disable sliced attention computation. If enable_attention_slicing was previously called, attention is computed in one step.

Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.

Disable sliced VAE decoding. If enable_vae_slicing was previously enabled, this method will go back to computing decoding in one step.

enable_xformers_memory_efficient_attention

< source >

( attention_op: typing.Optional[typing.Callable] = None )

Parameters

Enable memory efficient attention from xFormers. When this option is enabled, you should observe lower GPU memory usage and a potential speed up during inference. Speed up during training is not guaranteed.

⚠️ When memory efficient attention and sliced attention are both enabled, memory efficient attention takes precedent.

Examples:

import torch from diffusers import DiffusionPipeline from xformers.ops import MemoryEfficientAttentionFlashAttentionOp

pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-1", torch_dtype=torch.float16) pipe = pipe.to("cuda") pipe.enable_xformers_memory_efficient_attention(attention_op=MemoryEfficientAttentionFlashAttentionOp)

pipe.vae.enable_xformers_memory_efficient_attention(attention_op=None)

disable_xformers_memory_efficient_attention

< source >

( )

Disable memory efficient attention from xFormers.

encode_prompt

< source >

( prompt: str prompt_2: typing.Optional[str] = None device: typing.Optional[torch.device] = None num_images_per_prompt: int = 1 do_classifier_free_guidance: bool = True negative_prompt: typing.Optional[str] = None negative_prompt_2: typing.Optional[str] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None pooled_prompt_embeds: typing.Optional[torch.Tensor] = None negative_pooled_prompt_embeds: typing.Optional[torch.Tensor] = None lora_scale: typing.Optional[float] = None clip_skip: typing.Optional[int] = None )

Parameters

Encodes the prompt into text encoder hidden states.

get_guidance_scale_embedding

< source >

( w: Tensor embedding_dim: int = 512 dtype: dtype = torch.float32 ) → torch.Tensor

Parameters

Embedding vectors with shape (len(w), embedding_dim).

See https://github.com/google-research/vdm/blob/dc27b98a554f65cdc654b800da5aa1846545d41b/model_vdm.py#L298

< > Update on GitHub