ControlNet-XS with Stable Diffusion XL (original) (raw)

ControlNet-XS was introduced in ControlNet-XS by Denis Zavadski and Carsten Rother. It is based on the observation that the control model in the original ControlNet can be made much smaller and still produce good results.

Like the original ControlNet model, you can provide an additional control image to condition and control Stable Diffusion generation. For example, if you provide a depth map, the ControlNet model generates an image that’ll preserve the spatial information from the depth map. It is a more flexible and accurate way to control the image generation process.

ControlNet-XS generates images with comparable quality to a regular ControlNet, but it is 20-25% faster (see benchmark) and uses ~45% less memory.

Here’s the overview from the project page:

With increasing computing capabilities, current model architectures appear to follow the trend of simply upscaling all components without validating the necessity for doing so. In this project we investigate the size and architectural design of ControlNet [Zhang et al., 2023] for controlling the image generation process with stable diffusion-based models. We show that a new architecture with as little as 1% of the parameters of the base model achieves state-of-the art results, considerably better than ControlNet in terms of FID score. Hence we call it ControlNet-XS. We provide the code for controlling StableDiffusion-XL [Podell et al., 2023] (Model B, 48M Parameters) and StableDiffusion 2.1 [Rombach et al. 2022] (Model B, 14M Parameters), all under openrail license.

This model was contributed by UmerHA. ❤️

🧪 Many of the SDXL ControlNet checkpoints are experimental, and there is a lot of room for improvement. Feel free to open an Issue and leave us feedback on how we can improve!

Make sure to check out the Schedulers guide to learn how to explore the tradeoff between scheduler speed and quality, and see the reuse components across pipelines section to learn how to efficiently load the same components into multiple pipelines.

StableDiffusionXLControlNetXSPipeline

class diffusers.StableDiffusionXLControlNetXSPipeline

< source >

( vae: AutoencoderKL text_encoder: CLIPTextModel text_encoder_2: CLIPTextModelWithProjection tokenizer: CLIPTokenizer tokenizer_2: CLIPTokenizer unet: typing.Union[diffusers.models.unets.unet_2d_condition.UNet2DConditionModel, diffusers.models.controlnets.controlnet_xs.UNetControlNetXSModel] controlnet: ControlNetXSAdapter scheduler: KarrasDiffusionSchedulers force_zeros_for_empty_prompt: bool = True add_watermarker: typing.Optional[bool] = None feature_extractor: CLIPImageProcessor = None )

Parameters

Pipeline for text-to-image generation using Stable Diffusion XL with ControlNet-XS guidance.

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).

The pipeline also inherits the following loading methods:

__call__

< source >

( prompt: typing.Union[str, typing.List[str]] = None prompt_2: typing.Union[str, typing.List[str], NoneType] = None image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] = None height: typing.Optional[int] = None width: typing.Optional[int] = None num_inference_steps: int = 50 guidance_scale: float = 5.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None negative_prompt_2: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None pooled_prompt_embeds: typing.Optional[torch.Tensor] = None negative_pooled_prompt_embeds: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None controlnet_conditioning_scale: typing.Union[float, typing.List[float]] = 1.0 control_guidance_start: float = 0.0 control_guidance_end: float = 1.0 original_size: typing.Tuple[int, int] = None crops_coords_top_left: typing.Tuple[int, int] = (0, 0) target_size: typing.Tuple[int, int] = None negative_original_size: typing.Optional[typing.Tuple[int, int]] = None negative_crops_coords_top_left: typing.Tuple[int, int] = (0, 0) negative_target_size: typing.Optional[typing.Tuple[int, int]] = None clip_skip: typing.Optional[int] = None callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] ) → ~pipelines.stable_diffusion.StableDiffusionXLPipelineOutput or tuple

Parameters

Returns

~pipelines.stable_diffusion.StableDiffusionXLPipelineOutput or tuple

If return_dict is True, ~pipelines.stable_diffusion.StableDiffusionXLPipelineOutput is returned, otherwise a tuple is returned containing the output images.

The call function to the pipeline for generation.

Examples:

from diffusers import StableDiffusionXLControlNetXSPipeline, ControlNetXSAdapter, AutoencoderKL from diffusers.utils import load_image import numpy as np import torch

import cv2 from PIL import Image

prompt = "aerial view, a futuristic research complex in a bright foggy jungle, hard lighting" negative_prompt = "low quality, bad quality, sketches"

image = load_image( ... "https://hf.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/hf-logo.png" ... )

controlnet_conditioning_scale = 0.5 vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16) controlnet = ControlNetXSAdapter.from_pretrained( ... "UmerHA/Testing-ConrolNetXS-SDXL-canny", torch_dtype=torch.float16 ... ) pipe = StableDiffusionXLControlNetXSPipeline.from_pretrained( ... "stabilityai/stable-diffusion-xl-base-1.0", controlnet=controlnet, torch_dtype=torch.float16 ... ) pipe.enable_model_cpu_offload()

image = np.array(image) image = cv2.Canny(image, 100, 200) image = image[:, :, None] image = np.concatenate([image, image, image], axis=2) canny_image = Image.fromarray(image)

image = pipe( ... prompt, controlnet_conditioning_scale=controlnet_conditioning_scale, image=canny_image ... ).images[0]

encode_prompt

< source >

( prompt: str prompt_2: typing.Optional[str] = None device: typing.Optional[torch.device] = None num_images_per_prompt: int = 1 do_classifier_free_guidance: bool = True negative_prompt: typing.Optional[str] = None negative_prompt_2: typing.Optional[str] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None pooled_prompt_embeds: typing.Optional[torch.Tensor] = None negative_pooled_prompt_embeds: typing.Optional[torch.Tensor] = None lora_scale: typing.Optional[float] = None clip_skip: typing.Optional[int] = None )

Parameters

Encodes the prompt into text encoder hidden states.

StableDiffusionPipelineOutput

class diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput

< source >

( images: typing.Union[typing.List[PIL.Image.Image], numpy.ndarray] nsfw_content_detected: typing.Optional[typing.List[bool]] )

Parameters

Output class for Stable Diffusion pipelines.

< > Update on GitHub