ControlNet with Hunyuan-DiT (original) (raw)

HunyuanDiTControlNetPipeline is an implementation of ControlNet for Hunyuan-DiT.

ControlNet was introduced in Adding Conditional Control to Text-to-Image Diffusion Models by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala.

With a ControlNet model, you can provide an additional control image to condition and control Hunyuan-DiT generation. For example, if you provide a depth map, the ControlNet model generates an image that’ll preserve the spatial information from the depth map. It is a more flexible and accurate way to control the image generation process.

The abstract from the paper is:

We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with “zero convolutions” (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, eg, edges, depth, segmentation, human pose, etc, with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.

This code is implemented by Tencent Hunyuan Team. You can find pre-trained checkpoints for Hunyuan-DiT ControlNets on Tencent Hunyuan.

Make sure to check out the Schedulers guide to learn how to explore the tradeoff between scheduler speed and quality, and see the reuse components across pipelines section to learn how to efficiently load the same components into multiple pipelines.

HunyuanDiTControlNetPipeline

class diffusers.HunyuanDiTControlNetPipeline

< source >

( vae: AutoencoderKL text_encoder: BertModel tokenizer: BertTokenizer transformer: HunyuanDiT2DModel scheduler: DDPMScheduler safety_checker: StableDiffusionSafetyChecker feature_extractor: CLIPImageProcessor controlnet: typing.Union[diffusers.models.controlnets.controlnet_hunyuan.HunyuanDiT2DControlNetModel, typing.List[diffusers.models.controlnets.controlnet_hunyuan.HunyuanDiT2DControlNetModel], typing.Tuple[diffusers.models.controlnets.controlnet_hunyuan.HunyuanDiT2DControlNetModel], diffusers.models.controlnets.controlnet_hunyuan.HunyuanDiT2DMultiControlNetModel] text_encoder_2: typing.Optional[transformers.models.t5.modeling_t5.T5EncoderModel] = None tokenizer_2: typing.Optional[transformers.models.mt5.tokenization_mt5.MT5Tokenizer] = None requires_safety_checker: bool = True )

Parameters

Pipeline for English/Chinese-to-image generation using HunyuanDiT.

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods the library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)

HunyuanDiT uses two text encoders: mT5 and [bilingual CLIP](fine-tuned by ourselves)

__call__

< source >

( prompt: typing.Union[str, typing.List[str]] = None height: typing.Optional[int] = None width: typing.Optional[int] = None num_inference_steps: typing.Optional[int] = 50 guidance_scale: typing.Optional[float] = 5.0 control_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] = None controlnet_conditioning_scale: typing.Union[float, typing.List[float]] = 1.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 eta: typing.Optional[float] = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None prompt_embeds_2: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds_2: typing.Optional[torch.Tensor] = None prompt_attention_mask: typing.Optional[torch.Tensor] = None prompt_attention_mask_2: typing.Optional[torch.Tensor] = None negative_prompt_attention_mask: typing.Optional[torch.Tensor] = None negative_prompt_attention_mask_2: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] guidance_rescale: float = 0.0 original_size: typing.Optional[typing.Tuple[int, int]] = (1024, 1024) target_size: typing.Optional[typing.Tuple[int, int]] = None crops_coords_top_left: typing.Tuple[int, int] = (0, 0) use_resolution_binning: bool = True ) → StableDiffusionPipelineOutput or tuple

Parameters

If return_dict is True, StableDiffusionPipelineOutput is returned, otherwise a tuple is returned where the first element is a list with the generated images and the second element is a list of bools indicating whether the corresponding generated image contains “not-safe-for-work” (nsfw) content.

The call function to the pipeline for generation with HunyuanDiT.

Examples:

from diffusers import HunyuanDiT2DControlNetModel, HunyuanDiTControlNetPipeline import torch

controlnet = HunyuanDiT2DControlNetModel.from_pretrained( "Tencent-Hunyuan/HunyuanDiT-v1.1-ControlNet-Diffusers-Canny", torch_dtype=torch.float16 )

pipe = HunyuanDiTControlNetPipeline.from_pretrained( "Tencent-Hunyuan/HunyuanDiT-v1.1-Diffusers", controlnet=controlnet, torch_dtype=torch.float16 ) pipe.to("cuda")

from diffusers.utils import load_image

cond_image = load_image( "https://huggingface.co/Tencent-Hunyuan/HunyuanDiT-v1.1-ControlNet-Diffusers-Canny/resolve/main/canny.jpg?download=true" )

prompt = "在夜晚的酒店门前,一座古老的中国风格的狮子雕像矗立着,它的眼睛闪烁着光芒,仿佛在守护着这座建筑。背景是夜晚的酒店前,构图方式是特写,平视,居中构图。这张照片呈现了真实摄影风格,蕴含了中国雕塑文化,同时展现了神秘氛围"

image = pipe( prompt, height=1024, width=1024, control_image=cond_image, num_inference_steps=50, ).images[0]

encode_prompt

< source >

( prompt: str device: device = None dtype: dtype = None num_images_per_prompt: int = 1 do_classifier_free_guidance: bool = True negative_prompt: typing.Optional[str] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None prompt_attention_mask: typing.Optional[torch.Tensor] = None negative_prompt_attention_mask: typing.Optional[torch.Tensor] = None max_sequence_length: typing.Optional[int] = None text_encoder_index: int = 0 )

Parameters

Encodes the prompt into text encoder hidden states.

< > Update on GitHub