Self-Attention Guidance (original) (raw)

Improving Sample Quality of Diffusion Models Using Self-Attention Guidance is by Susung Hong et al.

The abstract from the paper is:

Denoising diffusion models (DDMs) have attracted attention for their exceptional generation quality and diversity. This success is largely attributed to the use of class- or text-conditional diffusion guidance methods, such as classifier and classifier-free guidance. In this paper, we present a more comprehensive perspective that goes beyond the traditional guidance methods. From this generalized perspective, we introduce novel condition- and training-free strategies to enhance the quality of generated images. As a simple solution, blur guidance improves the suitability of intermediate samples for their fine-scale information and structures, enabling diffusion models to generate higher quality samples with a moderate guidance scale. Improving upon this, Self-Attention Guidance (SAG) uses the intermediate self-attention maps of diffusion models to enhance their stability and efficacy. Specifically, SAG adversarially blurs only the regions that diffusion models attend to at each iteration and guides them accordingly. Our experimental results show that our SAG improves the performance of various diffusion models, including ADM, IDDPM, Stable Diffusion, and DiT. Moreover, combining SAG with conventional guidance methods leads to further improvement.

You can find additional information about Self-Attention Guidance on the project page, original codebase, and try it out in a demo or notebook.

Make sure to check out the Schedulers guide to learn how to explore the tradeoff between scheduler speed and quality, and see the reuse components across pipelines section to learn how to efficiently load the same components into multiple pipelines.

StableDiffusionSAGPipeline

class diffusers.StableDiffusionSAGPipeline

< source >

( vae: AutoencoderKL text_encoder: CLIPTextModel tokenizer: CLIPTokenizer unet: UNet2DConditionModel scheduler: KarrasDiffusionSchedulers safety_checker: StableDiffusionSafetyChecker feature_extractor: CLIPImageProcessor image_encoder: Optional = None requires_safety_checker: bool = True )

Parameters

Pipeline for text-to-image generation using Stable Diffusion.

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).

The pipeline also inherits the following loading methods:

__call__

< source >

( prompt: Union = None height: Optional = None width: Optional = None num_inference_steps: int = 50 guidance_scale: float = 7.5 sag_scale: float = 0.75 negative_prompt: Union = None num_images_per_prompt: Optional = 1 eta: float = 0.0 generator: Union = None latents: Optional = None prompt_embeds: Optional = None negative_prompt_embeds: Optional = None ip_adapter_image: Union = None ip_adapter_image_embeds: Optional = None output_type: Optional = 'pil' return_dict: bool = True callback: Optional = None callback_steps: Optional = 1 cross_attention_kwargs: Optional = None clip_skip: Optional = None ) → StableDiffusionPipelineOutput or tuple

Parameters

If return_dict is True, StableDiffusionPipelineOutput is returned, otherwise a tuple is returned where the first element is a list with the generated images and the second element is a list of bools indicating whether the corresponding generated image contains “not-safe-for-work” (nsfw) content.

The call function to the pipeline for generation.

Examples:

import torch from diffusers import StableDiffusionSAGPipeline

pipe = StableDiffusionSAGPipeline.from_pretrained( ... "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16 ... ) pipe = pipe.to("cuda")

prompt = "a photo of an astronaut riding a horse on mars" image = pipe(prompt, sag_scale=0.75).images[0]

encode_prompt

< source >

( prompt device num_images_per_prompt do_classifier_free_guidance negative_prompt = None prompt_embeds: Optional = None negative_prompt_embeds: Optional = None lora_scale: Optional = None clip_skip: Optional = None )

Parameters

Encodes the prompt into text encoder hidden states.

StableDiffusionOutput

class diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput

< source >

( images: Union nsfw_content_detected: Optional )

Parameters

Output class for Stable Diffusion pipelines.

< > Update on GitHub