MultiDiffusion (original) (raw)

LoRA

MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation is by Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel.

The abstract from the paper is:

Recent advances in text-to-image generation with diffusion models present transformative capabilities in image quality. However, user controllability of the generated image, and fast adaptation to new tasks still remains an open challenge, currently mostly addressed by costly and long re-training and fine-tuning or ad-hoc adaptations to specific image generation tasks. In this work, we present MultiDiffusion, a unified framework that enables versatile and controllable image generation, using a pre-trained text-to-image diffusion model, without any further training or finetuning. At the center of our approach is a new generation process, based on an optimization task that binds together multiple diffusion generation processes with a shared set of parameters or constraints. We show that MultiDiffusion can be readily applied to generate high quality and diverse images that adhere to user-provided controls, such as desired aspect ratio (e.g., panorama), and spatial guiding signals, ranging from tight segmentation masks to bounding boxes.

You can find additional information about MultiDiffusion on the project page, original codebase, and try it out in a demo.

Tips

While calling StableDiffusionPanoramaPipeline, it’s possible to specify the view_batch_size parameter to be > 1. For some GPUs with high performance, this can speedup the generation process and increase VRAM usage.

To generate panorama-like images make sure you pass the width parameter accordingly. We recommend a width value of 2048 which is the default.

Circular padding is applied to ensure there are no stitching artifacts when working with panoramas to ensure a seamless transition from the rightmost part to the leftmost part. By enabling circular padding (set circular_padding=True), the operation applies additional crops after the rightmost point of the image, allowing the model to “see” the transition from the rightmost part to the leftmost part. This helps maintain visual consistency in a 360-degree sense and creates a proper “panorama” that can be viewed using 360-degree panorama viewers. When decoding latents in Stable Diffusion, circular padding is applied to ensure that the decoded latents match in the RGB space.

For example, without circular padding, there is a stitching artifact (default):img

But with circular padding, the right and the left parts are matching (circular_padding=True):img

Make sure to check out the Schedulers guide to learn how to explore the tradeoff between scheduler speed and quality, and see the reuse components across pipelines section to learn how to efficiently load the same components into multiple pipelines.

StableDiffusionPanoramaPipeline

class diffusers.StableDiffusionPanoramaPipeline

< source >

( vae: AutoencoderKL text_encoder: CLIPTextModel tokenizer: CLIPTokenizer unet: UNet2DConditionModel scheduler: DDIMScheduler safety_checker: StableDiffusionSafetyChecker feature_extractor: CLIPImageProcessor image_encoder: typing.Optional[transformers.models.clip.modeling_clip.CLIPVisionModelWithProjection] = None requires_safety_checker: bool = True )

Parameters

Pipeline for text-to-image generation using MultiDiffusion.

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).

The pipeline also inherits the following loading methods:

__call__

< source >

( prompt: typing.Union[str, typing.List[str]] = None height: typing.Optional[int] = 512 width: typing.Optional[int] = 2048 num_inference_steps: int = 50 timesteps: typing.List[int] = None guidance_scale: float = 7.5 view_batch_size: int = 1 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None ip_adapter_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor], NoneType] = None ip_adapter_image_embeds: typing.Optional[typing.List[torch.Tensor]] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None guidance_rescale: float = 0.0 circular_padding: bool = False clip_skip: typing.Optional[int] = None callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] **kwargs: typing.Any ) → StableDiffusionPipelineOutput or tuple

Parameters

If return_dict is True, StableDiffusionPipelineOutput is returned, otherwise a tuple is returned where the first element is a list with the generated images and the second element is a list of bools indicating whether the corresponding generated image contains “not-safe-for-work” (nsfw) content.

The call function to the pipeline for generation.

Examples:

import torch from diffusers import StableDiffusionPanoramaPipeline, DDIMScheduler

model_ckpt = "stabilityai/stable-diffusion-2-base" scheduler = DDIMScheduler.from_pretrained(model_ckpt, subfolder="scheduler") pipe = StableDiffusionPanoramaPipeline.from_pretrained( ... model_ckpt, scheduler=scheduler, torch_dtype=torch.float16 ... )

pipe = pipe.to("cuda")

prompt = "a photo of the dolomites" image = pipe(prompt).images[0]

decode_latents_with_padding

< source >

( latents: Tensor padding: int = 8 ) → torch.Tensor

Parameters

The decoded image with padding removed.

Decode the given latents with padding for circular inference.

Notes:

encode_prompt

< source >

( prompt device num_images_per_prompt do_classifier_free_guidance negative_prompt = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None lora_scale: typing.Optional[float] = None clip_skip: typing.Optional[int] = None )

Parameters

Encodes the prompt into text encoder hidden states.

get_guidance_scale_embedding

< source >

( w: Tensor embedding_dim: int = 512 dtype: dtype = torch.float32 ) → torch.Tensor

Parameters

Embedding vectors with shape (len(w), embedding_dim).

See https://github.com/google-research/vdm/blob/dc27b98a554f65cdc654b800da5aa1846545d41b/model_vdm.py#L298

get_views

< source >

( panorama_height: int panorama_width: int window_size: int = 64 stride: int = 8 circular_padding: bool = False ) → List[Tuple[int, int, int, int]]

Parameters

Returns

List[Tuple[int, int, int, int]]

A list of tuples representing the views. Each tuple contains four integers representing the start and end coordinates of the window in the panorama.

Generates a list of views based on the given parameters. Here, we define the mappings F_i (see Eq. 7 in the MultiDiffusion paper https://arxiv.org/abs/2302.08113). If panorama’s height/width < window_size, num_blocks of height/width should return 1.

StableDiffusionPipelineOutput

class diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput

< source >

( images: typing.Union[typing.List[PIL.Image.Image], numpy.ndarray] nsfw_content_detected: typing.Optional[typing.List[bool]] )

Parameters

Output class for Stable Diffusion pipelines.

< > Update on GitHub