Perturbed-Attention Guidance (original) (raw)

LoRA

Perturbed-Attention Guidance (PAG) is a new diffusion sampling guidance that improves sample quality across both unconditional and conditional settings, achieving this without requiring further training or the integration of external modules.

PAG was introduced in Self-Rectifying Diffusion Sampling with Perturbed-Attention Guidance by Donghoon Ahn, Hyoungwon Cho, Jaewon Min, Wooseok Jang, Jungwoo Kim, SeonHwa Kim, Hyun Hee Park, Kyong Hwan Jin and Seungryong Kim.

The abstract from the paper is:

Recent studies have demonstrated that diffusion models are capable of generating high-quality samples, but their quality heavily depends on sampling guidance techniques, such as classifier guidance (CG) and classifier-free guidance (CFG). These techniques are often not applicable in unconditional generation or in various downstream tasks such as image restoration. In this paper, we propose a novel sampling guidance, called Perturbed-Attention Guidance (PAG), which improves diffusion sample quality across both unconditional and conditional settings, achieving this without requiring additional training or the integration of external modules. PAG is designed to progressively enhance the structure of samples throughout the denoising process. It involves generating intermediate samples with degraded structure by substituting selected self-attention maps in diffusion U-Net with an identity matrix, by considering the self-attention mechanisms’ ability to capture structural information, and guiding the denoising process away from these degraded samples. In both ADM and Stable Diffusion, PAG surprisingly improves sample quality in conditional and even unconditional scenarios. Moreover, PAG significantly improves the baseline performance in various downstream tasks where existing guidances such as CG or CFG cannot be fully utilized, including ControlNet with empty prompts and image restoration such as inpainting and deblurring.

PAG can be used by specifying the pag_applied_layers as a parameter when instantiating a PAG pipeline. It can be a single string or a list of strings. Each string can be a unique layer identifier or a regular expression to identify one or more layers.

Since RegEx is supported as a way for matching layer identifiers, it is crucial to use it correctly otherwise there might be unexpected behaviour. The recommended way to use PAG is by specifying layers as blocks.{layer_index} and blocks.({layer_index_1|layer_index_2|...}). Using it in any other way, while doable, may bypass our basic validation checks and give you unexpected results.

AnimateDiffPAGPipeline

class diffusers.AnimateDiffPAGPipeline

< source >

( vae: AutoencoderKL text_encoder: CLIPTextModel tokenizer: CLIPTokenizer unet: typing.Union[diffusers.models.unets.unet_2d_condition.UNet2DConditionModel, diffusers.models.unets.unet_motion_model.UNetMotionModel] motion_adapter: MotionAdapter scheduler: KarrasDiffusionSchedulers feature_extractor: CLIPImageProcessor = None image_encoder: CLIPVisionModelWithProjection = None pag_applied_layers: typing.Union[str, typing.List[str]] = 'mid_block.*attn1' )

Parameters

Pipeline for text-to-video generation usingAnimateDiff and Perturbed Attention Guidance.

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).

The pipeline also inherits the following loading methods:

__call__

< source >

( prompt: typing.Union[str, typing.List[str], NoneType] = None num_frames: typing.Optional[int] = 16 height: typing.Optional[int] = None width: typing.Optional[int] = None num_inference_steps: int = 50 guidance_scale: float = 7.5 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_videos_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None ip_adapter_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor], NoneType] = None ip_adapter_image_embeds: typing.Optional[typing.List[torch.Tensor]] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None clip_skip: typing.Optional[int] = None callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] decode_chunk_size: int = 16 pag_scale: float = 3.0 pag_adaptive_scale: float = 0.0 ) → AnimateDiffPipelineOutput or tuple

Parameters

If return_dict is True, AnimateDiffPipelineOutput is returned, otherwise a tuple is returned where the first element is a list with the generated frames.

The call function to the pipeline for generation.

Examples:

import torch from diffusers import AnimateDiffPAGPipeline, MotionAdapter, DDIMScheduler from diffusers.utils import export_to_gif

model_id = "SG161222/Realistic_Vision_V5.1_noVAE" motion_adapter_id = "guoyww/animatediff-motion-adapter-v1-5-2" motion_adapter = MotionAdapter.from_pretrained(motion_adapter_id) scheduler = DDIMScheduler.from_pretrained( ... model_id, subfolder="scheduler", beta_schedule="linear", steps_offset=1, clip_sample=False ... ) pipe = AnimateDiffPAGPipeline.from_pretrained( ... model_id, ... motion_adapter=motion_adapter, ... scheduler=scheduler, ... pag_applied_layers=["mid"], ... torch_dtype=torch.float16, ... ).to("cuda")

video = pipe( ... prompt="car, futuristic cityscape with neon lights, street, no human", ... negative_prompt="low quality, bad quality", ... num_inference_steps=25, ... guidance_scale=6.0, ... pag_scale=3.0, ... generator=torch.Generator().manual_seed(42), ... ).frames[0]

export_to_gif(video, "animatediff_pag.gif")

encode_prompt

< source >

( prompt device num_images_per_prompt do_classifier_free_guidance negative_prompt = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None lora_scale: typing.Optional[float] = None clip_skip: typing.Optional[int] = None )

Parameters

Encodes the prompt into text encoder hidden states.

HunyuanDiTPAGPipeline

class diffusers.HunyuanDiTPAGPipeline

< source >

( vae: AutoencoderKL text_encoder: BertModel tokenizer: BertTokenizer transformer: HunyuanDiT2DModel scheduler: DDPMScheduler safety_checker: typing.Optional[diffusers.pipelines.stable_diffusion.safety_checker.StableDiffusionSafetyChecker] = None feature_extractor: typing.Optional[transformers.models.clip.image_processing_clip.CLIPImageProcessor] = None requires_safety_checker: bool = True text_encoder_2: typing.Optional[transformers.models.t5.modeling_t5.T5EncoderModel] = None tokenizer_2: typing.Optional[transformers.models.mt5.tokenization_mt5.MT5Tokenizer] = None pag_applied_layers: typing.Union[str, typing.List[str]] = 'blocks.1' )

Parameters

Pipeline for English/Chinese-to-image generation using HunyuanDiT and Perturbed Attention Guidance.

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods the library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)

HunyuanDiT uses two text encoders: mT5 and [bilingual CLIP](fine-tuned by ourselves)

__call__

< source >

( prompt: typing.Union[str, typing.List[str]] = None height: typing.Optional[int] = None width: typing.Optional[int] = None num_inference_steps: typing.Optional[int] = 50 guidance_scale: typing.Optional[float] = 5.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 eta: typing.Optional[float] = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None prompt_embeds_2: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds_2: typing.Optional[torch.Tensor] = None prompt_attention_mask: typing.Optional[torch.Tensor] = None prompt_attention_mask_2: typing.Optional[torch.Tensor] = None negative_prompt_attention_mask: typing.Optional[torch.Tensor] = None negative_prompt_attention_mask_2: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] guidance_rescale: float = 0.0 original_size: typing.Optional[typing.Tuple[int, int]] = (1024, 1024) target_size: typing.Optional[typing.Tuple[int, int]] = None crops_coords_top_left: typing.Tuple[int, int] = (0, 0) use_resolution_binning: bool = True pag_scale: float = 3.0 pag_adaptive_scale: float = 0.0 ) → StableDiffusionPipelineOutput or tuple

Parameters

If return_dict is True, StableDiffusionPipelineOutput is returned, otherwise a tuple is returned where the first element is a list with the generated images and the second element is a list of bools indicating whether the corresponding generated image contains “not-safe-for-work” (nsfw) content.

The call function to the pipeline for generation with HunyuanDiT.

Examples:

import torch from diffusers import AutoPipelineForText2Image

pipe = AutoPipelineForText2Image.from_pretrained( ... "Tencent-Hunyuan/HunyuanDiT-v1.2-Diffusers", ... torch_dtype=torch.float16, ... enable_pag=True, ... pag_applied_layers=[14], ... ).to("cuda")

prompt = "一个宇航员在骑马" image = pipe(prompt, guidance_scale=4, pag_scale=3).images[0]

encode_prompt

< source >

( prompt: str device: device = None dtype: dtype = None num_images_per_prompt: int = 1 do_classifier_free_guidance: bool = True negative_prompt: typing.Optional[str] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None prompt_attention_mask: typing.Optional[torch.Tensor] = None negative_prompt_attention_mask: typing.Optional[torch.Tensor] = None max_sequence_length: typing.Optional[int] = None text_encoder_index: int = 0 )

Parameters

Encodes the prompt into text encoder hidden states.

KolorsPAGPipeline

class diffusers.KolorsPAGPipeline

< source >

( vae: AutoencoderKL text_encoder: ChatGLMModel tokenizer: ChatGLMTokenizer unet: UNet2DConditionModel scheduler: KarrasDiffusionSchedulers image_encoder: CLIPVisionModelWithProjection = None feature_extractor: CLIPImageProcessor = None force_zeros_for_empty_prompt: bool = False pag_applied_layers: typing.Union[str, typing.List[str]] = 'mid' )

Parameters

Pipeline for text-to-image generation using Kolors.

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods the library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)

The pipeline also inherits the following loading methods:

__call__

< source >

( prompt: typing.Union[str, typing.List[str]] = None height: typing.Optional[int] = None width: typing.Optional[int] = None num_inference_steps: int = 50 timesteps: typing.List[int] = None sigmas: typing.List[float] = None denoising_end: typing.Optional[float] = None guidance_scale: float = 5.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None pooled_prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None negative_pooled_prompt_embeds: typing.Optional[torch.Tensor] = None ip_adapter_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor], NoneType] = None ip_adapter_image_embeds: typing.Optional[typing.List[torch.Tensor]] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None original_size: typing.Optional[typing.Tuple[int, int]] = None crops_coords_top_left: typing.Tuple[int, int] = (0, 0) target_size: typing.Optional[typing.Tuple[int, int]] = None negative_original_size: typing.Optional[typing.Tuple[int, int]] = None negative_crops_coords_top_left: typing.Tuple[int, int] = (0, 0) negative_target_size: typing.Optional[typing.Tuple[int, int]] = None callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] pag_scale: float = 3.0 pag_adaptive_scale: float = 0.0 max_sequence_length: int = 256 ) → ~pipelines.kolors.KolorsPipelineOutput or tuple

Parameters

Returns

~pipelines.kolors.KolorsPipelineOutput or tuple

~pipelines.kolors.KolorsPipelineOutput ifreturn_dict is True, otherwise a tuple. When returning a tuple, the first element is a list with the generated images.

Function invoked when calling the pipeline for generation.

Examples:

import torch from diffusers import AutoPipelineForText2Image

pipe = AutoPipelineForText2Image.from_pretrained( ... "Kwai-Kolors/Kolors-diffusers", ... variant="fp16", ... torch_dtype=torch.float16, ... enable_pag=True, ... pag_applied_layers=["down.block_2.attentions_1", "up.block_0.attentions_1"], ... ) pipe = pipe.to("cuda")

prompt = ( ... "A photo of a ladybug, macro, zoom, high quality, film, holding a wooden sign with the text 'KOLORS'" ... ) image = pipe(prompt, guidance_scale=5.5, pag_scale=1.5).images[0]

encode_prompt

< source >

( prompt device: typing.Optional[torch.device] = None num_images_per_prompt: int = 1 do_classifier_free_guidance: bool = True negative_prompt = None prompt_embeds: typing.Optional[torch.FloatTensor] = None pooled_prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.FloatTensor] = None negative_pooled_prompt_embeds: typing.Optional[torch.Tensor] = None max_sequence_length: int = 256 )

Parameters

Encodes the prompt into text encoder hidden states.

get_guidance_scale_embedding

< source >

( w: Tensor embedding_dim: int = 512 dtype: dtype = torch.float32 ) → torch.Tensor

Parameters

Embedding vectors with shape (len(w), embedding_dim).

See https://github.com/google-research/vdm/blob/dc27b98a554f65cdc654b800da5aa1846545d41b/model_vdm.py#L298

StableDiffusionPAGInpaintPipeline

class diffusers.StableDiffusionPAGInpaintPipeline

< source >

( vae: AutoencoderKL text_encoder: CLIPTextModel tokenizer: CLIPTokenizer unet: UNet2DConditionModel scheduler: KarrasDiffusionSchedulers safety_checker: StableDiffusionSafetyChecker feature_extractor: CLIPImageProcessor image_encoder: CLIPVisionModelWithProjection = None requires_safety_checker: bool = True pag_applied_layers: typing.Union[str, typing.List[str]] = 'mid' )

Parameters

Pipeline for text-to-image generation using Stable Diffusion.

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).

The pipeline also inherits the following loading methods:

__call__

< source >

( prompt: typing.Union[str, typing.List[str]] = None image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] = None mask_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] = None masked_image_latents: Tensor = None height: typing.Optional[int] = None width: typing.Optional[int] = None padding_mask_crop: typing.Optional[int] = None strength: float = 0.9999 num_inference_steps: int = 50 timesteps: typing.List[int] = None sigmas: typing.List[float] = None guidance_scale: float = 7.5 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None ip_adapter_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor], NoneType] = None ip_adapter_image_embeds: typing.Optional[typing.List[torch.Tensor]] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None guidance_rescale: float = 0.0 clip_skip: typing.Optional[int] = None callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] pag_scale: float = 3.0 pag_adaptive_scale: float = 0.0 ) → StableDiffusionPipelineOutput or tuple

Parameters

If return_dict is True, StableDiffusionPipelineOutput is returned, otherwise a tuple is returned where the first element is a list with the generated images and the second element is a list of bools indicating whether the corresponding generated image contains “not-safe-for-work” (nsfw) content.

The call function to the pipeline for generation.

Examples:

import torch from diffusers import AutoPipelineForInpainting

pipe = AutoPipelineForInpainting.from_pretrained( ... "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, enable_pag=True ... ) pipe = pipe.to("cuda") img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png" mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png" init_image = load_image(img_url).convert("RGB") mask_image = load_image(mask_url).convert("RGB") prompt = "A majestic tiger sitting on a bench" image = pipe( ... prompt=prompt, ... image=init_image, ... mask_image=mask_image, ... strength=0.8, ... num_inference_steps=50, ... guidance_scale=guidance_scale, ... generator=generator, ... pag_scale=pag_scale, ... ).images[0]

encode_prompt

< source >

( prompt device num_images_per_prompt do_classifier_free_guidance negative_prompt = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None lora_scale: typing.Optional[float] = None clip_skip: typing.Optional[int] = None )

Parameters

Encodes the prompt into text encoder hidden states.

get_guidance_scale_embedding

< source >

( w: Tensor embedding_dim: int = 512 dtype: dtype = torch.float32 ) → torch.Tensor

Parameters

Embedding vectors with shape (len(w), embedding_dim).

See https://github.com/google-research/vdm/blob/dc27b98a554f65cdc654b800da5aa1846545d41b/model_vdm.py#L298

StableDiffusionPAGPipeline

class diffusers.StableDiffusionPAGPipeline

< source >

( vae: AutoencoderKL text_encoder: CLIPTextModel tokenizer: CLIPTokenizer unet: UNet2DConditionModel scheduler: KarrasDiffusionSchedulers safety_checker: StableDiffusionSafetyChecker feature_extractor: CLIPImageProcessor image_encoder: CLIPVisionModelWithProjection = None requires_safety_checker: bool = True pag_applied_layers: typing.Union[str, typing.List[str]] = 'mid' )

Parameters

Pipeline for text-to-image generation using Stable Diffusion.

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).

The pipeline also inherits the following loading methods:

__call__

< source >

( prompt: typing.Union[str, typing.List[str]] = None height: typing.Optional[int] = None width: typing.Optional[int] = None num_inference_steps: int = 50 timesteps: typing.List[int] = None sigmas: typing.List[float] = None guidance_scale: float = 7.5 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None ip_adapter_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor], NoneType] = None ip_adapter_image_embeds: typing.Optional[typing.List[torch.Tensor]] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None guidance_rescale: float = 0.0 clip_skip: typing.Optional[int] = None callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] pag_scale: float = 3.0 pag_adaptive_scale: float = 0.0 ) → StableDiffusionPipelineOutput or tuple

Parameters

If return_dict is True, StableDiffusionPipelineOutput is returned, otherwise a tuple is returned where the first element is a list with the generated images and the second element is a list of bools indicating whether the corresponding generated image contains “not-safe-for-work” (nsfw) content.

The call function to the pipeline for generation.

Examples:

import torch from diffusers import AutoPipelineForText2Image

pipe = AutoPipelineForText2Image.from_pretrained( ... "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, enable_pag=True ... ) pipe = pipe.to("cuda")

prompt = "a photo of an astronaut riding a horse on mars" image = pipe(prompt, pag_scale=0.3).images[0]

encode_prompt

< source >

( prompt device num_images_per_prompt do_classifier_free_guidance negative_prompt = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None lora_scale: typing.Optional[float] = None clip_skip: typing.Optional[int] = None )

Parameters

Encodes the prompt into text encoder hidden states.

get_guidance_scale_embedding

< source >

( w: Tensor embedding_dim: int = 512 dtype: dtype = torch.float32 ) → torch.Tensor

Parameters

Embedding vectors with shape (len(w), embedding_dim).

See https://github.com/google-research/vdm/blob/dc27b98a554f65cdc654b800da5aa1846545d41b/model_vdm.py#L298

StableDiffusionPAGImg2ImgPipeline

class diffusers.StableDiffusionPAGImg2ImgPipeline

< source >

( vae: AutoencoderKL text_encoder: CLIPTextModel tokenizer: CLIPTokenizer unet: UNet2DConditionModel scheduler: KarrasDiffusionSchedulers safety_checker: StableDiffusionSafetyChecker feature_extractor: CLIPImageProcessor image_encoder: CLIPVisionModelWithProjection = None requires_safety_checker: bool = True pag_applied_layers: typing.Union[str, typing.List[str]] = 'mid' )

Parameters

Pipeline for text-guided image-to-image generation using Stable Diffusion.

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).

The pipeline also inherits the following loading methods:

__call__

< source >

( prompt: typing.Union[str, typing.List[str]] = None image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] = None strength: float = 0.8 num_inference_steps: typing.Optional[int] = 50 timesteps: typing.List[int] = None sigmas: typing.List[float] = None guidance_scale: typing.Optional[float] = 7.5 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 eta: typing.Optional[float] = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None ip_adapter_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor], NoneType] = None ip_adapter_image_embeds: typing.Optional[typing.List[torch.Tensor]] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None clip_skip: int = None callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] pag_scale: float = 3.0 pag_adaptive_scale: float = 0.0 ) → StableDiffusionPipelineOutput or tuple

Parameters

If return_dict is True, StableDiffusionPipelineOutput is returned, otherwise a tuple is returned where the first element is a list with the generated images and the second element is a list of bools indicating whether the corresponding generated image contains “not-safe-for-work” (nsfw) content.

The call function to the pipeline for generation.

Examples:

import torch from diffusers import AutoPipelineForImage2Image from diffusers.utils import load_image

pipe = AutoPipelineForImage2Image.from_pretrained( ... "runwayml/stable-diffusion-v1-5", ... torch_dtype=torch.float16, ... enable_pag=True, ... ) pipe = pipe.to("cuda") url = "https://huggingface.co/datasets/patrickvonplaten/images/resolve/main/aa_xl/000000009.png"

init_image = load_image(url).convert("RGB") prompt = "a photo of an astronaut riding a horse on mars" image = pipe(prompt, image=init_image, pag_scale=0.3).images[0]

encode_prompt

< source >

( prompt device num_images_per_prompt do_classifier_free_guidance negative_prompt = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None lora_scale: typing.Optional[float] = None clip_skip: typing.Optional[int] = None )

Parameters

Encodes the prompt into text encoder hidden states.

get_guidance_scale_embedding

< source >

( w: Tensor embedding_dim: int = 512 dtype: dtype = torch.float32 ) → torch.Tensor

Parameters

Embedding vectors with shape (len(w), embedding_dim).

See https://github.com/google-research/vdm/blob/dc27b98a554f65cdc654b800da5aa1846545d41b/model_vdm.py#L298

StableDiffusionControlNetPAGPipeline

class diffusers.StableDiffusionControlNetPAGPipeline

< source >

( vae: AutoencoderKL text_encoder: CLIPTextModel tokenizer: CLIPTokenizer unet: UNet2DConditionModel controlnet: typing.Union[diffusers.models.controlnets.controlnet.ControlNetModel, typing.List[diffusers.models.controlnets.controlnet.ControlNetModel], typing.Tuple[diffusers.models.controlnets.controlnet.ControlNetModel], diffusers.models.controlnets.multicontrolnet.MultiControlNetModel] scheduler: KarrasDiffusionSchedulers safety_checker: StableDiffusionSafetyChecker feature_extractor: CLIPImageProcessor image_encoder: CLIPVisionModelWithProjection = None requires_safety_checker: bool = True pag_applied_layers: typing.Union[str, typing.List[str]] = 'mid' )

Parameters

Pipeline for text-to-image generation using Stable Diffusion with ControlNet guidance.

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).

The pipeline also inherits the following loading methods:

encode_prompt

< source >

( prompt device num_images_per_prompt do_classifier_free_guidance negative_prompt = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None lora_scale: typing.Optional[float] = None clip_skip: typing.Optional[int] = None )

Parameters

Encodes the prompt into text encoder hidden states.

get_guidance_scale_embedding

< source >

( w: Tensor embedding_dim: int = 512 dtype: dtype = torch.float32 ) → torch.Tensor

Parameters

Embedding vectors with shape (len(w), embedding_dim).

See https://github.com/google-research/vdm/blob/dc27b98a554f65cdc654b800da5aa1846545d41b/model_vdm.py#L298

StableDiffusionControlNetPAGInpaintPipeline

class diffusers.StableDiffusionControlNetPAGInpaintPipeline

< source >

( vae: AutoencoderKL text_encoder: CLIPTextModel tokenizer: CLIPTokenizer unet: UNet2DConditionModel controlnet: typing.Union[diffusers.models.controlnets.controlnet.ControlNetModel, typing.List[diffusers.models.controlnets.controlnet.ControlNetModel], typing.Tuple[diffusers.models.controlnets.controlnet.ControlNetModel], diffusers.models.controlnets.multicontrolnet.MultiControlNetModel] scheduler: KarrasDiffusionSchedulers safety_checker: StableDiffusionSafetyChecker feature_extractor: CLIPImageProcessor image_encoder: CLIPVisionModelWithProjection = None requires_safety_checker: bool = True pag_applied_layers: typing.Union[str, typing.List[str]] = 'mid' )

Parameters

Pipeline for image inpainting using Stable Diffusion with ControlNet guidance.

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).

The pipeline also inherits the following loading methods:

This pipeline can be used with checkpoints that have been specifically fine-tuned for inpainting (runwayml/stable-diffusion-inpainting) as well as default text-to-image Stable Diffusion checkpoints (runwayml/stable-diffusion-v1-5). Default text-to-image Stable Diffusion checkpoints might be preferable for ControlNets that have been fine-tuned on those, such aslllyasviel/control_v11p_sd15_inpaint.

__call__

< source >

( prompt: typing.Union[str, typing.List[str]] = None image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] = None mask_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] = None control_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] = None height: typing.Optional[int] = None width: typing.Optional[int] = None padding_mask_crop: typing.Optional[int] = None strength: float = 1.0 num_inference_steps: int = 50 guidance_scale: float = 7.5 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None ip_adapter_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor], NoneType] = None ip_adapter_image_embeds: typing.Optional[typing.List[torch.Tensor]] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None controlnet_conditioning_scale: typing.Union[float, typing.List[float]] = 0.5 control_guidance_start: typing.Union[float, typing.List[float]] = 0.0 control_guidance_end: typing.Union[float, typing.List[float]] = 1.0 clip_skip: typing.Optional[int] = None callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] pag_scale: float = 3.0 pag_adaptive_scale: float = 0.0 ) → StableDiffusionPipelineOutput or tuple

Parameters

If return_dict is True, StableDiffusionPipelineOutput is returned, otherwise a tuple is returned where the first element is a list with the generated images and the second element is a list of bools indicating whether the corresponding generated image contains “not-safe-for-work” (nsfw) content.

The call function to the pipeline for generation.

Examples:

import cv2 from diffusers import AutoPipelineForInpainting, ControlNetModel, DDIMScheduler from diffusers.utils import load_image import numpy as np from PIL import Image import torch

init_image = load_image( ... "https://huggingface.co/datasets/diffusers/test-arrays/resolve/main/stable_diffusion_inpaint/boy.png" ... ) init_image = init_image.resize((512, 512))

generator = torch.Generator(device="cpu").manual_seed(1)

mask_image = load_image( ... "https://huggingface.co/datasets/diffusers/test-arrays/resolve/main/stable_diffusion_inpaint/boy_mask.png" ... ) mask_image = mask_image.resize((512, 512))

def make_canny_condition(image): ... image = np.array(image) ... image = cv2.Canny(image, 100, 200) ... image = image[:, :, None] ... image = np.concatenate([image, image, image], axis=2) ... image = Image.fromarray(image) ... return image

control_image = make_canny_condition(init_image)

controlnet = ControlNetModel.from_pretrained( ... "lllyasviel/control_v11p_sd15_inpaint", torch_dtype=torch.float16 ... ) pipe = AutoPipelineForInpainting.from_pretrained( ... "runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16, enable_pag=True ... )

pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config) pipe.enable_model_cpu_offload()

image = pipe( ... "a handsome man with ray-ban sunglasses", ... num_inference_steps=20, ... generator=generator, ... eta=1.0, ... image=init_image, ... mask_image=mask_image, ... control_image=control_image, ... pag_scale=0.3, ... ).images[0]

encode_prompt

< source >

( prompt device num_images_per_prompt do_classifier_free_guidance negative_prompt = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None lora_scale: typing.Optional[float] = None clip_skip: typing.Optional[int] = None )

Parameters

Encodes the prompt into text encoder hidden states.

get_guidance_scale_embedding

< source >

( w: Tensor embedding_dim: int = 512 dtype: dtype = torch.float32 ) → torch.Tensor

Parameters

Embedding vectors with shape (len(w), embedding_dim).

See https://github.com/google-research/vdm/blob/dc27b98a554f65cdc654b800da5aa1846545d41b/model_vdm.py#L298

StableDiffusionXLPAGPipeline

class diffusers.StableDiffusionXLPAGPipeline

< source >

( vae: AutoencoderKL text_encoder: CLIPTextModel text_encoder_2: CLIPTextModelWithProjection tokenizer: CLIPTokenizer tokenizer_2: CLIPTokenizer unet: UNet2DConditionModel scheduler: KarrasDiffusionSchedulers image_encoder: CLIPVisionModelWithProjection = None feature_extractor: CLIPImageProcessor = None force_zeros_for_empty_prompt: bool = True add_watermarker: typing.Optional[bool] = None pag_applied_layers: typing.Union[str, typing.List[str]] = 'mid' )

Parameters

Pipeline for text-to-image generation using Stable Diffusion XL.

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods the library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)

The pipeline also inherits the following loading methods:

__call__

< source >

( prompt: typing.Union[str, typing.List[str]] = None prompt_2: typing.Union[str, typing.List[str], NoneType] = None height: typing.Optional[int] = None width: typing.Optional[int] = None num_inference_steps: int = 50 timesteps: typing.List[int] = None sigmas: typing.List[float] = None denoising_end: typing.Optional[float] = None guidance_scale: float = 5.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None negative_prompt_2: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None pooled_prompt_embeds: typing.Optional[torch.Tensor] = None negative_pooled_prompt_embeds: typing.Optional[torch.Tensor] = None ip_adapter_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor], NoneType] = None ip_adapter_image_embeds: typing.Optional[typing.List[torch.Tensor]] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None guidance_rescale: float = 0.0 original_size: typing.Optional[typing.Tuple[int, int]] = None crops_coords_top_left: typing.Tuple[int, int] = (0, 0) target_size: typing.Optional[typing.Tuple[int, int]] = None negative_original_size: typing.Optional[typing.Tuple[int, int]] = None negative_crops_coords_top_left: typing.Tuple[int, int] = (0, 0) negative_target_size: typing.Optional[typing.Tuple[int, int]] = None clip_skip: typing.Optional[int] = None callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] pag_scale: float = 3.0 pag_adaptive_scale: float = 0.0 ) → ~pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput or tuple

Parameters

Returns

~pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput or tuple

~pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput if return_dict is True, otherwise atuple. When returning a tuple, the first element is a list with the generated images.

Function invoked when calling the pipeline for generation.

Examples:

import torch from diffusers import AutoPipelineForText2Image

pipe = AutoPipelineForText2Image.from_pretrained( ... "stabilityai/stable-diffusion-xl-base-1.0", ... torch_dtype=torch.float16, ... enable_pag=True, ... ) pipe = pipe.to("cuda")

prompt = "a photo of an astronaut riding a horse on mars" image = pipe(prompt, pag_scale=0.3).images[0]

encode_prompt

< source >

( prompt: str prompt_2: typing.Optional[str] = None device: typing.Optional[torch.device] = None num_images_per_prompt: int = 1 do_classifier_free_guidance: bool = True negative_prompt: typing.Optional[str] = None negative_prompt_2: typing.Optional[str] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None pooled_prompt_embeds: typing.Optional[torch.Tensor] = None negative_pooled_prompt_embeds: typing.Optional[torch.Tensor] = None lora_scale: typing.Optional[float] = None clip_skip: typing.Optional[int] = None )

Parameters

Encodes the prompt into text encoder hidden states.

get_guidance_scale_embedding

< source >

( w: Tensor embedding_dim: int = 512 dtype: dtype = torch.float32 ) → torch.Tensor

Parameters

Embedding vectors with shape (len(w), embedding_dim).

See https://github.com/google-research/vdm/blob/dc27b98a554f65cdc654b800da5aa1846545d41b/model_vdm.py#L298

StableDiffusionXLPAGImg2ImgPipeline

class diffusers.StableDiffusionXLPAGImg2ImgPipeline

< source >

( vae: AutoencoderKL text_encoder: CLIPTextModel text_encoder_2: CLIPTextModelWithProjection tokenizer: CLIPTokenizer tokenizer_2: CLIPTokenizer unet: UNet2DConditionModel scheduler: KarrasDiffusionSchedulers image_encoder: CLIPVisionModelWithProjection = None feature_extractor: CLIPImageProcessor = None requires_aesthetics_score: bool = False force_zeros_for_empty_prompt: bool = True add_watermarker: typing.Optional[bool] = None pag_applied_layers: typing.Union[str, typing.List[str]] = 'mid' )

Parameters

Pipeline for text-to-image generation using Stable Diffusion XL.

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods the library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)

The pipeline also inherits the following loading methods:

__call__

< source >

( prompt: typing.Union[str, typing.List[str]] = None prompt_2: typing.Union[str, typing.List[str], NoneType] = None image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] = None strength: float = 0.3 num_inference_steps: int = 50 timesteps: typing.List[int] = None sigmas: typing.List[float] = None denoising_start: typing.Optional[float] = None denoising_end: typing.Optional[float] = None guidance_scale: float = 5.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None negative_prompt_2: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None pooled_prompt_embeds: typing.Optional[torch.Tensor] = None negative_pooled_prompt_embeds: typing.Optional[torch.Tensor] = None ip_adapter_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor], NoneType] = None ip_adapter_image_embeds: typing.Optional[typing.List[torch.Tensor]] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None guidance_rescale: float = 0.0 original_size: typing.Tuple[int, int] = None crops_coords_top_left: typing.Tuple[int, int] = (0, 0) target_size: typing.Tuple[int, int] = None negative_original_size: typing.Optional[typing.Tuple[int, int]] = None negative_crops_coords_top_left: typing.Tuple[int, int] = (0, 0) negative_target_size: typing.Optional[typing.Tuple[int, int]] = None aesthetic_score: float = 6.0 negative_aesthetic_score: float = 2.5 clip_skip: typing.Optional[int] = None callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] pag_scale: float = 3.0 pag_adaptive_scale: float = 0.0 ) → ~pipelines.stable_diffusion.StableDiffusionXLPipelineOutput or tuple

Parameters

Returns

~pipelines.stable_diffusion.StableDiffusionXLPipelineOutput or tuple

~pipelines.stable_diffusion.StableDiffusionXLPipelineOutput if return_dict is True, otherwise a `tuple. When returning a tuple, the first element is a list with the generated images.

Function invoked when calling the pipeline for generation.

Examples:

import torch from diffusers import AutoPipelineForImage2Image from diffusers.utils import load_image

pipe = AutoPipelineForImage2Image.from_pretrained( ... "stabilityai/stable-diffusion-xl-refiner-1.0", ... torch_dtype=torch.float16, ... enable_pag=True, ... ) pipe = pipe.to("cuda") url = "https://huggingface.co/datasets/patrickvonplaten/images/resolve/main/aa_xl/000000009.png"

init_image = load_image(url).convert("RGB") prompt = "a photo of an astronaut riding a horse on mars" image = pipe(prompt, image=init_image, pag_scale=0.3).images[0]

encode_prompt

< source >

( prompt: str prompt_2: typing.Optional[str] = None device: typing.Optional[torch.device] = None num_images_per_prompt: int = 1 do_classifier_free_guidance: bool = True negative_prompt: typing.Optional[str] = None negative_prompt_2: typing.Optional[str] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None pooled_prompt_embeds: typing.Optional[torch.Tensor] = None negative_pooled_prompt_embeds: typing.Optional[torch.Tensor] = None lora_scale: typing.Optional[float] = None clip_skip: typing.Optional[int] = None )

Parameters

Encodes the prompt into text encoder hidden states.

get_guidance_scale_embedding

< source >

( w: Tensor embedding_dim: int = 512 dtype: dtype = torch.float32 ) → torch.Tensor

Parameters

Embedding vectors with shape (len(w), embedding_dim).

See https://github.com/google-research/vdm/blob/dc27b98a554f65cdc654b800da5aa1846545d41b/model_vdm.py#L298

StableDiffusionXLPAGInpaintPipeline

class diffusers.StableDiffusionXLPAGInpaintPipeline

< source >

( vae: AutoencoderKL text_encoder: CLIPTextModel text_encoder_2: CLIPTextModelWithProjection tokenizer: CLIPTokenizer tokenizer_2: CLIPTokenizer unet: UNet2DConditionModel scheduler: KarrasDiffusionSchedulers image_encoder: CLIPVisionModelWithProjection = None feature_extractor: CLIPImageProcessor = None requires_aesthetics_score: bool = False force_zeros_for_empty_prompt: bool = True add_watermarker: typing.Optional[bool] = None pag_applied_layers: typing.Union[str, typing.List[str]] = 'mid' )

Parameters

Pipeline for text-to-image generation using Stable Diffusion XL.

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods the library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)

The pipeline also inherits the following loading methods:

__call__

< source >

( prompt: typing.Union[str, typing.List[str]] = None prompt_2: typing.Union[str, typing.List[str], NoneType] = None image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] = None mask_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] = None masked_image_latents: Tensor = None height: typing.Optional[int] = None width: typing.Optional[int] = None padding_mask_crop: typing.Optional[int] = None strength: float = 0.9999 num_inference_steps: int = 50 timesteps: typing.List[int] = None sigmas: typing.List[float] = None denoising_start: typing.Optional[float] = None denoising_end: typing.Optional[float] = None guidance_scale: float = 7.5 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None negative_prompt_2: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None pooled_prompt_embeds: typing.Optional[torch.Tensor] = None negative_pooled_prompt_embeds: typing.Optional[torch.Tensor] = None ip_adapter_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor], NoneType] = None ip_adapter_image_embeds: typing.Optional[typing.List[torch.Tensor]] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None guidance_rescale: float = 0.0 original_size: typing.Tuple[int, int] = None crops_coords_top_left: typing.Tuple[int, int] = (0, 0) target_size: typing.Tuple[int, int] = None negative_original_size: typing.Optional[typing.Tuple[int, int]] = None negative_crops_coords_top_left: typing.Tuple[int, int] = (0, 0) negative_target_size: typing.Optional[typing.Tuple[int, int]] = None aesthetic_score: float = 6.0 negative_aesthetic_score: float = 2.5 clip_skip: typing.Optional[int] = None callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] pag_scale: float = 3.0 pag_adaptive_scale: float = 0.0 ) → ~pipelines.stable_diffusion.StableDiffusionXLPipelineOutput or tuple

Parameters

Returns

~pipelines.stable_diffusion.StableDiffusionXLPipelineOutput or tuple

~pipelines.stable_diffusion.StableDiffusionXLPipelineOutput if return_dict is True, otherwise atuple. tuple. When returning a tuple, the first element is a list with the generated images.

Function invoked when calling the pipeline for generation.

Examples:

import torch from diffusers import AutoPipelineForInpainting from diffusers.utils import load_image

pipe = AutoPipelineForInpainting.from_pretrained( ... "stabilityai/stable-diffusion-xl-base-1.0", ... torch_dtype=torch.float16, ... variant="fp16", ... enable_pag=True, ... ) pipe.to("cuda")

img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png" mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"

init_image = load_image(img_url).convert("RGB") mask_image = load_image(mask_url).convert("RGB")

prompt = "A majestic tiger sitting on a bench" image = pipe( ... prompt=prompt, ... image=init_image, ... mask_image=mask_image, ... num_inference_steps=50, ... strength=0.80, ... pag_scale=0.3, ... ).images[0]

encode_prompt

< source >

( prompt: str prompt_2: typing.Optional[str] = None device: typing.Optional[torch.device] = None num_images_per_prompt: int = 1 do_classifier_free_guidance: bool = True negative_prompt: typing.Optional[str] = None negative_prompt_2: typing.Optional[str] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None pooled_prompt_embeds: typing.Optional[torch.Tensor] = None negative_pooled_prompt_embeds: typing.Optional[torch.Tensor] = None lora_scale: typing.Optional[float] = None clip_skip: typing.Optional[int] = None )

Parameters

Encodes the prompt into text encoder hidden states.

get_guidance_scale_embedding

< source >

( w: Tensor embedding_dim: int = 512 dtype: dtype = torch.float32 ) → torch.Tensor

Parameters

Embedding vectors with shape (len(w), embedding_dim).

See https://github.com/google-research/vdm/blob/dc27b98a554f65cdc654b800da5aa1846545d41b/model_vdm.py#L298

StableDiffusionXLControlNetPAGPipeline

class diffusers.StableDiffusionXLControlNetPAGPipeline

< source >

( vae: AutoencoderKL text_encoder: CLIPTextModel text_encoder_2: CLIPTextModelWithProjection tokenizer: CLIPTokenizer tokenizer_2: CLIPTokenizer unet: UNet2DConditionModel controlnet: typing.Union[diffusers.models.controlnets.controlnet.ControlNetModel, typing.List[diffusers.models.controlnets.controlnet.ControlNetModel], typing.Tuple[diffusers.models.controlnets.controlnet.ControlNetModel], diffusers.models.controlnets.multicontrolnet.MultiControlNetModel] scheduler: KarrasDiffusionSchedulers force_zeros_for_empty_prompt: bool = True add_watermarker: typing.Optional[bool] = None feature_extractor: CLIPImageProcessor = None image_encoder: CLIPVisionModelWithProjection = None pag_applied_layers: typing.Union[str, typing.List[str]] = 'mid' )

Parameters

Pipeline for text-to-image generation using Stable Diffusion XL with ControlNet guidance.

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).

The pipeline also inherits the following loading methods:

__call__

< source >

( prompt: typing.Union[str, typing.List[str]] = None prompt_2: typing.Union[str, typing.List[str], NoneType] = None image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] = None height: typing.Optional[int] = None width: typing.Optional[int] = None num_inference_steps: int = 50 timesteps: typing.List[int] = None sigmas: typing.List[float] = None denoising_end: typing.Optional[float] = None guidance_scale: float = 5.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None negative_prompt_2: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None pooled_prompt_embeds: typing.Optional[torch.Tensor] = None negative_pooled_prompt_embeds: typing.Optional[torch.Tensor] = None ip_adapter_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor], NoneType] = None ip_adapter_image_embeds: typing.Optional[typing.List[torch.Tensor]] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None controlnet_conditioning_scale: typing.Union[float, typing.List[float]] = 1.0 control_guidance_start: typing.Union[float, typing.List[float]] = 0.0 control_guidance_end: typing.Union[float, typing.List[float]] = 1.0 original_size: typing.Tuple[int, int] = None crops_coords_top_left: typing.Tuple[int, int] = (0, 0) target_size: typing.Tuple[int, int] = None negative_original_size: typing.Optional[typing.Tuple[int, int]] = None negative_crops_coords_top_left: typing.Tuple[int, int] = (0, 0) negative_target_size: typing.Optional[typing.Tuple[int, int]] = None clip_skip: typing.Optional[int] = None callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] pag_scale: float = 3.0 pag_adaptive_scale: float = 0.0 ) → StableDiffusionPipelineOutput or tuple

Parameters

If return_dict is True, StableDiffusionPipelineOutput is returned, otherwise a tuple is returned containing the output images.

The call function to the pipeline for generation.

Examples:

from diffusers import AutoPipelineForText2Image, ControlNetModel, AutoencoderKL from diffusers.utils import load_image import numpy as np import torch

import cv2 from PIL import Image

prompt = "aerial view, a futuristic research complex in a bright foggy jungle, hard lighting" negative_prompt = "low quality, bad quality, sketches"

image = load_image( ... "https://hf.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/hf-logo.png" ... )

controlnet_conditioning_scale = 0.5
controlnet = ControlNetModel.from_pretrained( ... "diffusers/controlnet-canny-sdxl-1.0", torch_dtype=torch.float16 ... ) vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16) pipe = AutoPipelineForText2Image.from_pretrained( ... "stabilityai/stable-diffusion-xl-base-1.0", ... controlnet=controlnet, ... vae=vae, ... torch_dtype=torch.float16, ... enable_pag=True, ... ) pipe.enable_model_cpu_offload()

image = np.array(image) image = cv2.Canny(image, 100, 200) image = image[:, :, None] image = np.concatenate([image, image, image], axis=2) canny_image = Image.fromarray(image)

image = pipe( ... prompt, controlnet_conditioning_scale=controlnet_conditioning_scale, image=canny_image, pag_scale=0.3 ... ).images[0]

encode_prompt

< source >

( prompt: str prompt_2: typing.Optional[str] = None device: typing.Optional[torch.device] = None num_images_per_prompt: int = 1 do_classifier_free_guidance: bool = True negative_prompt: typing.Optional[str] = None negative_prompt_2: typing.Optional[str] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None pooled_prompt_embeds: typing.Optional[torch.Tensor] = None negative_pooled_prompt_embeds: typing.Optional[torch.Tensor] = None lora_scale: typing.Optional[float] = None clip_skip: typing.Optional[int] = None )

Parameters

Encodes the prompt into text encoder hidden states.

get_guidance_scale_embedding

< source >

( w: Tensor embedding_dim: int = 512 dtype: dtype = torch.float32 ) → torch.Tensor

Parameters

Embedding vectors with shape (len(w), embedding_dim).

See https://github.com/google-research/vdm/blob/dc27b98a554f65cdc654b800da5aa1846545d41b/model_vdm.py#L298

StableDiffusionXLControlNetPAGImg2ImgPipeline

class diffusers.StableDiffusionXLControlNetPAGImg2ImgPipeline

< source >

( vae: AutoencoderKL text_encoder: CLIPTextModel text_encoder_2: CLIPTextModelWithProjection tokenizer: CLIPTokenizer tokenizer_2: CLIPTokenizer unet: UNet2DConditionModel controlnet: typing.Union[diffusers.models.controlnets.controlnet.ControlNetModel, typing.List[diffusers.models.controlnets.controlnet.ControlNetModel], typing.Tuple[diffusers.models.controlnets.controlnet.ControlNetModel], diffusers.models.controlnets.multicontrolnet.MultiControlNetModel] scheduler: KarrasDiffusionSchedulers requires_aesthetics_score: bool = False force_zeros_for_empty_prompt: bool = True add_watermarker: typing.Optional[bool] = None feature_extractor: CLIPImageProcessor = None image_encoder: CLIPVisionModelWithProjection = None pag_applied_layers: typing.Union[str, typing.List[str]] = 'mid' )

Parameters

Pipeline for image-to-image generation using Stable Diffusion XL with ControlNet guidance.

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods the library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)

The pipeline also inherits the following loading methods:

__call__

< source >

( prompt: typing.Union[str, typing.List[str]] = None prompt_2: typing.Union[str, typing.List[str], NoneType] = None image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] = None control_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] = None height: typing.Optional[int] = None width: typing.Optional[int] = None strength: float = 0.8 num_inference_steps: int = 50 guidance_scale: float = 5.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None negative_prompt_2: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None pooled_prompt_embeds: typing.Optional[torch.Tensor] = None negative_pooled_prompt_embeds: typing.Optional[torch.Tensor] = None ip_adapter_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor], NoneType] = None ip_adapter_image_embeds: typing.Optional[typing.List[torch.Tensor]] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None controlnet_conditioning_scale: typing.Union[float, typing.List[float]] = 0.8 guess_mode: bool = False control_guidance_start: typing.Union[float, typing.List[float]] = 0.0 control_guidance_end: typing.Union[float, typing.List[float]] = 1.0 original_size: typing.Tuple[int, int] = None crops_coords_top_left: typing.Tuple[int, int] = (0, 0) target_size: typing.Tuple[int, int] = None negative_original_size: typing.Optional[typing.Tuple[int, int]] = None negative_crops_coords_top_left: typing.Tuple[int, int] = (0, 0) negative_target_size: typing.Optional[typing.Tuple[int, int]] = None aesthetic_score: float = 6.0 negative_aesthetic_score: float = 2.5 clip_skip: typing.Optional[int] = None callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] pag_scale: float = 3.0 pag_adaptive_scale: float = 0.0 ) → ~pipelines.stable_diffusion.StableDiffusionXLPipelineOutput or tuple

Parameters

Returns

~pipelines.stable_diffusion.StableDiffusionXLPipelineOutput or tuple

~pipelines.stable_diffusion.StableDiffusionXLPipelineOutput if return_dict is True, otherwise atuple containing the output images.

Function invoked when calling the pipeline for generation.

Examples:

import torch import numpy as np from PIL import Image

from transformers import DPTFeatureExtractor, DPTForDepthEstimation from diffusers import ControlNetModel, StableDiffusionXLControlNetPAGImg2ImgPipeline, AutoencoderKL from diffusers.utils import load_image

depth_estimator = DPTForDepthEstimation.from_pretrained("Intel/dpt-hybrid-midas").to("cuda") feature_extractor = DPTFeatureExtractor.from_pretrained("Intel/dpt-hybrid-midas") controlnet = ControlNetModel.from_pretrained( ... "diffusers/controlnet-depth-sdxl-1.0-small", ... variant="fp16", ... use_safetensors="True", ... torch_dtype=torch.float16, ... ) vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16) pipe = StableDiffusionXLControlNetPAGImg2ImgPipeline.from_pretrained( ... "stabilityai/stable-diffusion-xl-base-1.0", ... controlnet=controlnet, ... vae=vae, ... variant="fp16", ... use_safetensors=True, ... torch_dtype=torch.float16, ... enable_pag=True, ... ) pipe.enable_model_cpu_offload()

def get_depth_map(image): ... image = feature_extractor(images=image, return_tensors="pt").pixel_values.to("cuda") ... with torch.no_grad(), torch.autocast("cuda"): ... depth_map = depth_estimator(image).predicted_depth

... depth_map = torch.nn.functional.interpolate( ... depth_map.unsqueeze(1), ... size=(1024, 1024), ... mode="bicubic", ... align_corners=False, ... ) ... depth_min = torch.amin(depth_map, dim=[1, 2, 3], keepdim=True) ... depth_max = torch.amax(depth_map, dim=[1, 2, 3], keepdim=True) ... depth_map = (depth_map - depth_min) / (depth_max - depth_min) ... image = torch.cat([depth_map] * 3, dim=1) ... image = image.permute(0, 2, 3, 1).cpu().numpy()[0] ... image = Image.fromarray((image * 255.0).clip(0, 255).astype(np.uint8)) ... return image

prompt = "A robot, 4k photo" image = load_image( ... "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main" ... "/kandinsky/cat.png" ... ).resize((1024, 1024)) controlnet_conditioning_scale = 0.5
depth_image = get_depth_map(image)

images = pipe( ... prompt, ... image=image, ... control_image=depth_image, ... strength=0.99, ... num_inference_steps=50, ... controlnet_conditioning_scale=controlnet_conditioning_scale, ... ).images images[0].save(f"robot_cat.png")

encode_prompt

< source >

( prompt: str prompt_2: typing.Optional[str] = None device: typing.Optional[torch.device] = None num_images_per_prompt: int = 1 do_classifier_free_guidance: bool = True negative_prompt: typing.Optional[str] = None negative_prompt_2: typing.Optional[str] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None pooled_prompt_embeds: typing.Optional[torch.Tensor] = None negative_pooled_prompt_embeds: typing.Optional[torch.Tensor] = None lora_scale: typing.Optional[float] = None clip_skip: typing.Optional[int] = None )

Parameters

Encodes the prompt into text encoder hidden states.

StableDiffusion3PAGPipeline

class diffusers.StableDiffusion3PAGPipeline

< source >

( transformer: SD3Transformer2DModel scheduler: FlowMatchEulerDiscreteScheduler vae: AutoencoderKL text_encoder: CLIPTextModelWithProjection tokenizer: CLIPTokenizer text_encoder_2: CLIPTextModelWithProjection tokenizer_2: CLIPTokenizer text_encoder_3: T5EncoderModel tokenizer_3: T5TokenizerFast pag_applied_layers: typing.Union[str, typing.List[str]] = 'blocks.1' )

Parameters

PAG pipeline for text-to-image generation using Stable Diffusion 3.

__call__

< source >

( prompt: typing.Union[str, typing.List[str]] = None prompt_2: typing.Union[str, typing.List[str], NoneType] = None prompt_3: typing.Union[str, typing.List[str], NoneType] = None height: typing.Optional[int] = None width: typing.Optional[int] = None num_inference_steps: int = 28 sigmas: typing.Optional[typing.List[float]] = None guidance_scale: float = 7.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None negative_prompt_2: typing.Union[str, typing.List[str], NoneType] = None negative_prompt_3: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.FloatTensor] = None prompt_embeds: typing.Optional[torch.FloatTensor] = None negative_prompt_embeds: typing.Optional[torch.FloatTensor] = None pooled_prompt_embeds: typing.Optional[torch.FloatTensor] = None negative_pooled_prompt_embeds: typing.Optional[torch.FloatTensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True joint_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None clip_skip: typing.Optional[int] = None callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 256 pag_scale: float = 3.0 pag_adaptive_scale: float = 0.0 ) → ~pipelines.stable_diffusion_3.StableDiffusion3PipelineOutput or tuple

Parameters

Returns

~pipelines.stable_diffusion_3.StableDiffusion3PipelineOutput or tuple

~pipelines.stable_diffusion_3.StableDiffusion3PipelineOutput if return_dict is True, otherwise atuple. When returning a tuple, the first element is a list with the generated images.

Function invoked when calling the pipeline for generation.

Examples:

import torch from diffusers import AutoPipelineForText2Image

pipe = AutoPipelineForText2Image.from_pretrained( ... "stabilityai/stable-diffusion-3-medium-diffusers", ... torch_dtype=torch.float16, ... enable_pag=True, ... pag_applied_layers=["blocks.13"], ... ) pipe.to("cuda") prompt = "A cat holding a sign that says hello world" image = pipe(prompt, guidance_scale=5.0, pag_scale=0.7).images[0] image.save("sd3_pag.png")

encode_prompt

< source >

( prompt: typing.Union[str, typing.List[str]] prompt_2: typing.Union[str, typing.List[str]] prompt_3: typing.Union[str, typing.List[str]] device: typing.Optional[torch.device] = None num_images_per_prompt: int = 1 do_classifier_free_guidance: bool = True negative_prompt: typing.Union[str, typing.List[str], NoneType] = None negative_prompt_2: typing.Union[str, typing.List[str], NoneType] = None negative_prompt_3: typing.Union[str, typing.List[str], NoneType] = None prompt_embeds: typing.Optional[torch.FloatTensor] = None negative_prompt_embeds: typing.Optional[torch.FloatTensor] = None pooled_prompt_embeds: typing.Optional[torch.FloatTensor] = None negative_pooled_prompt_embeds: typing.Optional[torch.FloatTensor] = None clip_skip: typing.Optional[int] = None max_sequence_length: int = 256 lora_scale: typing.Optional[float] = None )

Parameters

StableDiffusion3PAGImg2ImgPipeline

class diffusers.StableDiffusion3PAGImg2ImgPipeline

< source >

( transformer: SD3Transformer2DModel scheduler: FlowMatchEulerDiscreteScheduler vae: AutoencoderKL text_encoder: CLIPTextModelWithProjection tokenizer: CLIPTokenizer text_encoder_2: CLIPTextModelWithProjection tokenizer_2: CLIPTokenizer text_encoder_3: T5EncoderModel tokenizer_3: T5TokenizerFast pag_applied_layers: typing.Union[str, typing.List[str]] = 'blocks.1' )

Parameters

PAG pipeline for image-to-image generation using Stable Diffusion 3.

__call__

< source >

( prompt: typing.Union[str, typing.List[str]] = None prompt_2: typing.Union[str, typing.List[str], NoneType] = None prompt_3: typing.Union[str, typing.List[str], NoneType] = None height: typing.Optional[int] = None width: typing.Optional[int] = None image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] = None strength: float = 0.6 num_inference_steps: int = 50 sigmas: typing.Optional[typing.List[float]] = None guidance_scale: float = 7.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None negative_prompt_2: typing.Union[str, typing.List[str], NoneType] = None negative_prompt_3: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.FloatTensor] = None prompt_embeds: typing.Optional[torch.FloatTensor] = None negative_prompt_embeds: typing.Optional[torch.FloatTensor] = None pooled_prompt_embeds: typing.Optional[torch.FloatTensor] = None negative_pooled_prompt_embeds: typing.Optional[torch.FloatTensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True joint_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None clip_skip: typing.Optional[int] = None callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 256 pag_scale: float = 3.0 pag_adaptive_scale: float = 0.0 ) → ~pipelines.stable_diffusion_3.StableDiffusion3PipelineOutput or tuple

Parameters

Returns

~pipelines.stable_diffusion_3.StableDiffusion3PipelineOutput or tuple

~pipelines.stable_diffusion_3.StableDiffusion3PipelineOutput if return_dict is True, otherwise atuple. When returning a tuple, the first element is a list with the generated images.

Function invoked when calling the pipeline for generation.

Examples:

import torch from diffusers import StableDiffusion3PAGImg2ImgPipeline from diffusers.utils import load_image

pipe = StableDiffusion3PAGImg2ImgPipeline.from_pretrained( ... "stabilityai/stable-diffusion-3-medium-diffusers", ... torch_dtype=torch.float16, ... pag_applied_layers=["blocks.13"], ... ) pipe.to("cuda") prompt = "a photo of an astronaut riding a horse on mars" url = "https://huggingface.co/datasets/patrickvonplaten/images/resolve/main/aa_xl/000000009.png" init_image = load_image(url).convert("RGB") image = pipe(prompt, image=init_image, guidance_scale=5.0, pag_scale=0.7).images[0]

encode_prompt

< source >

( prompt: typing.Union[str, typing.List[str]] prompt_2: typing.Union[str, typing.List[str]] prompt_3: typing.Union[str, typing.List[str]] device: typing.Optional[torch.device] = None num_images_per_prompt: int = 1 do_classifier_free_guidance: bool = True negative_prompt: typing.Union[str, typing.List[str], NoneType] = None negative_prompt_2: typing.Union[str, typing.List[str], NoneType] = None negative_prompt_3: typing.Union[str, typing.List[str], NoneType] = None prompt_embeds: typing.Optional[torch.FloatTensor] = None negative_prompt_embeds: typing.Optional[torch.FloatTensor] = None pooled_prompt_embeds: typing.Optional[torch.FloatTensor] = None negative_pooled_prompt_embeds: typing.Optional[torch.FloatTensor] = None clip_skip: typing.Optional[int] = None max_sequence_length: int = 256 lora_scale: typing.Optional[float] = None )

Parameters

PixArtSigmaPAGPipeline

class diffusers.PixArtSigmaPAGPipeline

< source >

( tokenizer: T5Tokenizer text_encoder: T5EncoderModel vae: AutoencoderKL transformer: PixArtTransformer2DModel scheduler: KarrasDiffusionSchedulers pag_applied_layers: typing.Union[str, typing.List[str]] = 'blocks.1' )

PAG pipeline for text-to-image generation using PixArt-Sigma.

__call__

< source >

( prompt: typing.Union[str, typing.List[str]] = None negative_prompt: str = '' num_inference_steps: int = 20 timesteps: typing.List[int] = None sigmas: typing.List[float] = None guidance_scale: float = 4.5 num_images_per_prompt: typing.Optional[int] = 1 height: typing.Optional[int] = None width: typing.Optional[int] = None eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None prompt_attention_mask: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_attention_mask: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback: typing.Optional[typing.Callable[[int, int, torch.Tensor], NoneType]] = None callback_steps: int = 1 clean_caption: bool = True use_resolution_binning: bool = True max_sequence_length: int = 300 pag_scale: float = 3.0 pag_adaptive_scale: float = 0.0 ) → ImagePipelineOutput or tuple

Parameters

If return_dict is True, ImagePipelineOutput is returned, otherwise a tuple is returned where the first element is a list with the generated images

Function invoked when calling the pipeline for generation.

Examples:

import torch from diffusers import AutoPipelineForText2Image

pipe = AutoPipelineForText2Image.from_pretrained( ... "PixArt-alpha/PixArt-Sigma-XL-2-1024-MS", ... torch_dtype=torch.float16, ... pag_applied_layers=["blocks.14"], ... enable_pag=True, ... ) pipe = pipe.to("cuda")

prompt = "A small cactus with a happy face in the Sahara desert" image = pipe(prompt, pag_scale=4.0, guidance_scale=1.0).images[0]

encode_prompt

< source >

( prompt: typing.Union[str, typing.List[str]] do_classifier_free_guidance: bool = True negative_prompt: str = '' num_images_per_prompt: int = 1 device: typing.Optional[torch.device] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None prompt_attention_mask: typing.Optional[torch.Tensor] = None negative_prompt_attention_mask: typing.Optional[torch.Tensor] = None clean_caption: bool = False max_sequence_length: int = 300 **kwargs )

Parameters

Encodes the prompt into text encoder hidden states.

< > Update on GitHub