Models · Hugging Face (original) (raw)

Diffusers contains pretrained models for popular algorithms and modules for creating the next set of diffusion models. The primary function of these models is to denoise an input sample, by modeling the distribution pθ(xt−1∣xt)p_{\theta}(x_{t-1}|x_{t}). The models are built on the base class [‘ModelMixin’] that is a torch.nn.module with basic functionality for saving and loading models both locally and from the HuggingFace hub.

ModelMixin

Base class for all models.

ModelMixin takes care of storing the configuration of the models and handles methods for loading, downloading and saving models.

config_name (str) — A filename under which the model should be stored when callingsave_pretrained().

Deactivates gradient checkpointing for the current model.

Note that in other frameworks this feature can be referred to as “activation checkpointing” or “checkpoint activations”.

disable_xformers_memory_efficient_attention

< source >

( )

Disable memory efficient attention as implemented in xformers.

Activates gradient checkpointing for the current model.

Note that in other frameworks this feature can be referred to as “activation checkpointing” or “checkpoint activations”.

enable_xformers_memory_efficient_attention

< source >

( attention_op: typing.Optional[typing.Callable] = None )

Parameters

attention_op (Callable, optional) — Override the default None operator for use as op argument to thememory_efficient_attention()function of xFormers.

Enable memory efficient attention as implemented in xformers.

When this option is enabled, you should observe lower GPU memory usage and a potential speed up at inference time. Speed up at training time is not guaranteed.

Warning: When Memory Efficient Attention and Sliced attention are both enabled, the Memory Efficient Attention is used.

Examples:

import torch from diffusers import UNet2DConditionModel from xformers.ops import MemoryEfficientAttentionFlashAttentionOp

model = UNet2DConditionModel.from_pretrained( ... "stabilityai/stable-diffusion-2-1", subfolder="unet", torch_dtype=torch.float16 ... ) model = model.to("cuda") model.enable_xformers_memory_efficient_attention(attention_op=MemoryEfficientAttentionFlashAttentionOp)

from_pretrained

< source >

( pretrained_model_name_or_path: typing.Union[str, os.PathLike, NoneType] **kwargs )

Parameters

pretrained_model_name_or_path (str or os.PathLike, optional) — Can be either:
- A string, the model id of a pretrained model hosted inside a model repo on huggingface.co. Valid model ids should have an organization name, like google/ddpm-celebahq-256.
- A path to a directory containing model weights saved using ~ModelMixin.save_config, e.g.,./my_model_directory/.
cache_dir (Union[str, os.PathLike], optional) — Path to a directory in which a downloaded pretrained model configuration should be cached if the standard cache should not be used.
torch_dtype (str or torch.dtype, optional) — Override the default torch.dtype and load the model under this dtype. If "auto" is passed the dtype will be automatically derived from the model’s weights.
force_download (bool, optional, defaults to False) — Whether or not to force the (re-)download of the model weights and configuration files, overriding the cached versions if they exist.
resume_download (bool, optional, defaults to False) — Whether or not to delete incompletely received files. Will attempt to resume the download if such a file exists.
proxies (Dict[str, str], optional) — A dictionary of proxy servers to use by protocol or endpoint, e.g., {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}. The proxies are used on each request.
output_loading_info(bool, optional, defaults to False) — Whether or not to also return a dictionary containing missing keys, unexpected keys and error messages.
local_files_only(bool, optional, defaults to False) — Whether or not to only look at local files (i.e., do not try to download the model).
use_auth_token (str or bool, optional) — The token to use as HTTP bearer authorization for remote files. If True, will use the token generated when running diffusers-cli login (stored in ~/.huggingface).
revision (str, optional, defaults to "main") — The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a git-based system for storing models and other artifacts on huggingface.co, so revision can be any identifier allowed by git.
from_flax (bool, optional, defaults to False) — Load the model weights from a Flax checkpoint save file.
subfolder (str, optional, defaults to "") — In case the relevant files are located inside a subfolder of the model repo (either remote in huggingface.co or downloaded locally), you can specify the folder name here.
mirror (str, optional) — Mirror source to accelerate downloads in China. If you are from China and have an accessibility problem, you can set this option to resolve it. Note that we do not guarantee the timeliness or safety. Please refer to the mirror site for more information.
device_map (str or Dict[str, Union[int, str, torch.device]], optional) — A map that specifies where each submodule should go. It doesn’t need to be refined to each parameter/buffer name, once a given module name is inside, every submodule of it will be sent to the same device.
To have Accelerate compute the most optimized device_map automatically, set device_map="auto". For more information about each option see designing a device map.
max_memory (Dict, optional) — A dictionary device identifier to maximum memory. Will default to the maximum memory available for each GPU and the available CPU RAM if unset.
offload_folder (str or os.PathLike, optional) — If the device_map contains any value "disk", the folder where we will offload weights.
offload_state_dict (bool, optional) — If True, will temporarily offload the CPU state dict to the hard drive to avoid getting out of CPU RAM if the weight of the CPU state dict + the biggest shard of the checkpoint does not fit. Defaults toTrue when there is some disk offload.
low_cpu_mem_usage (bool, optional, defaults to True if torch version >= 1.9.0 else False) — Speed up model loading by not initializing the weights and only loading the pre-trained weights. This also tries to not use more than 1x model size in CPU memory (including peak memory) while loading the model. This is only supported when torch version >= 1.9.0. If you are using an older version of torch, setting this argument to True will raise an error.
variant (str, optional) — If specified load weights from variant filename, e.g. pytorch_model..bin. variant is ignored when using from_flax.
use_safetensors (bool, optional, defaults to None) — If set to None, the safetensors weights will be downloaded if they’re available and if thesafetensors library is installed. If set to True, the model will be forcibly loaded fromsafetensors weights. If set to False, loading will not use safetensors.

Instantiate a pretrained pytorch model from a pre-trained model configuration.

The model is set in evaluation mode by default using model.eval() (Dropout modules are deactivated). To train the model, you should first set it back in training mode with model.train().

The warning Weights from XXX not initialized from pretrained model means that the weights of XXX do not come pretrained with the rest of the model. It is up to you to train those weights with a downstream fine-tuning task.

The warning Weights from XXX not used in YYY means that the layer XXX is not used by YYY, therefore those weights are discarded.

It is required to be logged in (huggingface-cli login) when you want to use private or gated models.

Activate the special “offline-mode” to use this method in a firewalled environment.

num_parameters

< source >

( only_trainable: bool = False exclude_embeddings: bool = False ) → int

Parameters

only_trainable (bool, optional, defaults to False) — Whether or not to return only the number of trainable parameters
exclude_embeddings (bool, optional, defaults to False) — Whether or not to return only the number of non-embeddings parameters

The number of parameters.

Get number of (optionally, trainable or non-embeddings) parameters in the module.

save_pretrained

< source >

( save_directory: typing.Union[str, os.PathLike] is_main_process: bool = True save_function: typing.Callable = None safe_serialization: bool = False variant: typing.Optional[str] = None )

Parameters

save_directory (str or os.PathLike) — Directory to which to save. Will be created if it doesn’t exist.
is_main_process (bool, optional, defaults to True) — Whether the process calling this is the main process or not. Useful when in distributed training like TPUs and need to call this function on all processes. In this case, set is_main_process=True only on the main process to avoid race conditions.
save_function (Callable) — The function to use to save the state dictionary. Useful on distributed training like TPUs when one need to replace torch.save by another method. Can be configured with the environment variableDIFFUSERS_SAVE_MODE.
safe_serialization (bool, optional, defaults to False) — Whether to save the model using safetensors or the traditional PyTorch way (that uses pickle).
variant (str, optional) — If specified, weights are saved in the format pytorch_model..bin.

Save a model and its configuration file to a directory, so that it can be re-loaded using the[from_pretrained()](/docs/diffusers/main/en/api/models#diffusers.ModelMixin.from_pretrained) class method.

UNet2DOutput

class diffusers.models.unet_2d.UNet2DOutput

< source >

( sample: FloatTensor )

Parameters

sample (torch.FloatTensor of shape (batch_size, num_channels, height, width)) — Hidden states output. Output of last layer of model.

UNet2DModel

class diffusers.UNet2DModel

< source >

( sample_size: typing.Union[int, typing.Tuple[int, int], NoneType] = None in_channels: int = 3 out_channels: int = 3 center_input_sample: bool = False time_embedding_type: str = 'positional' freq_shift: int = 0 flip_sin_to_cos: bool = True down_block_types: typing.Tuple[str] = ('DownBlock2D', 'AttnDownBlock2D', 'AttnDownBlock2D', 'AttnDownBlock2D') up_block_types: typing.Tuple[str] = ('AttnUpBlock2D', 'AttnUpBlock2D', 'AttnUpBlock2D', 'UpBlock2D') block_out_channels: typing.Tuple[int] = (224, 448, 672, 896) layers_per_block: int = 2 mid_block_scale_factor: float = 1 downsample_padding: int = 1 act_fn: str = 'silu' attention_head_dim: typing.Optional[int] = 8 norm_num_groups: int = 32 norm_eps: float = 1e-05 resnet_time_scale_shift: str = 'default' add_attention: bool = True class_embed_type: typing.Optional[str] = None num_class_embeds: typing.Optional[int] = None )

Parameters

sample_size (int or Tuple[int, int], optional, defaults to None) — Height and width of input/output sample. Dimensions must be a multiple of 2 ** (len(block_out_channels) - 1).
in_channels (int, optional, defaults to 3) — Number of channels in the input image.
out_channels (int, optional, defaults to 3) — Number of channels in the output.
center_input_sample (bool, optional, defaults to False) — Whether to center the input sample.
time_embedding_type (str, optional, defaults to "positional") — Type of time embedding to use.
freq_shift (int, optional, defaults to 0) — Frequency shift for fourier time embedding.
flip_sin_to_cos (bool, optional, defaults to — obj:True): Whether to flip sin to cos for fourier time embedding.
down_block_types (Tuple[str], optional, defaults to — obj:("DownBlock2D", "AttnDownBlock2D", "AttnDownBlock2D", "AttnDownBlock2D")): Tuple of downsample block types.
mid_block_type (str, optional, defaults to "UNetMidBlock2D") — The mid block type. Choose from UNetMidBlock2D or UnCLIPUNetMidBlock2D.
up_block_types (Tuple[str], optional, defaults to — obj:("AttnUpBlock2D", "AttnUpBlock2D", "AttnUpBlock2D", "UpBlock2D")): Tuple of upsample block types.
block_out_channels (Tuple[int], optional, defaults to — obj:(224, 448, 672, 896)): Tuple of block output channels.
layers_per_block (int, optional, defaults to 2) — The number of layers per block.
mid_block_scale_factor (float, optional, defaults to 1) — The scale factor for the mid block.
downsample_padding (int, optional, defaults to 1) — The padding for the downsample convolution.
act_fn (str, optional, defaults to "silu") — The activation function to use.
attention_head_dim (int, optional, defaults to 8) — The attention head dimension.
norm_num_groups (int, optional, defaults to 32) — The number of groups for the normalization.
norm_eps (float, optional, defaults to 1e-5) — The epsilon for the normalization.
resnet_time_scale_shift (str, optional, defaults to "default") — Time scale shift config for resnet blocks, see ResnetBlock2D. Choose from default or scale_shift.
class_embed_type (str, optional, defaults to None) — The type of class embedding to use which is ultimately summed with the time embeddings. Choose from None,"timestep", or "identity".
num_class_embeds (int, optional, defaults to None) — Input dimension of the learnable embedding matrix to be projected to time_embed_dim, when performing class conditioning with class_embed_type equal to None.

UNet2DModel is a 2D UNet model that takes in a noisy sample and a timestep and returns sample shaped output.

This model inherits from ModelMixin. Check the superclass documentation for the generic methods the library implements for all the model (such as downloading or saving, etc.)

forward

< source >

( sample: FloatTensor timestep: typing.Union[torch.Tensor, float, int] class_labels: typing.Optional[torch.Tensor] = None return_dict: bool = True ) → UNet2DOutput or tuple

Parameters

sample (torch.FloatTensor) — (batch, channel, height, width) noisy inputs tensor
timestep (torch.FloatTensor or float or `int) — (batch) timesteps
class_labels (torch.FloatTensor, optional, defaults to None) — Optional class labels for conditioning. Their embeddings will be summed with the timestep embeddings.
return_dict (bool, optional, defaults to True) — Whether or not to return a UNet2DOutput instead of a plain tuple.

Returns

UNet2DOutput or tuple

UNet2DOutput if return_dict is True, otherwise a tuple. When returning a tuple, the first element is the sample tensor.

UNet1DOutput

class diffusers.models.unet_1d.UNet1DOutput

< source >

( sample: FloatTensor )

Parameters

sample (torch.FloatTensor of shape (batch_size, num_channels, sample_size)) — Hidden states output. Output of last layer of model.

UNet1DModel

class diffusers.UNet1DModel

< source >

( sample_size: int = 65536 sample_rate: typing.Optional[int] = None in_channels: int = 2 out_channels: int = 2 extra_in_channels: int = 0 time_embedding_type: str = 'fourier' flip_sin_to_cos: bool = True use_timestep_embedding: bool = False freq_shift: float = 0.0 down_block_types: typing.Tuple[str] = ('DownBlock1DNoSkip', 'DownBlock1D', 'AttnDownBlock1D') up_block_types: typing.Tuple[str] = ('AttnUpBlock1D', 'UpBlock1D', 'UpBlock1DNoSkip') mid_block_type: typing.Tuple[str] = 'UNetMidBlock1D' out_block_type: str = None block_out_channels: typing.Tuple[int] = (32, 32, 64) act_fn: str = None norm_num_groups: int = 8 layers_per_block: int = 1 downsample_each_block: bool = False )

Parameters

sample_size (int, optional) — Default length of sample. Should be adaptable at runtime.
in_channels (int, optional, defaults to 2) — Number of channels in the input sample.
out_channels (int, optional, defaults to 2) — Number of channels in the output.
extra_in_channels (int, optional, defaults to 0) — Number of additional channels to be added to the input of the first down block. Useful for cases where the input data has more channels than what the model is initially designed for.
time_embedding_type (str, optional, defaults to "fourier") — Type of time embedding to use.
freq_shift (float, optional, defaults to 0.0) — Frequency shift for fourier time embedding.
flip_sin_to_cos (bool, optional, defaults to — obj:False): Whether to flip sin to cos for fourier time embedding.
down_block_types (Tuple[str], optional, defaults to — obj:("DownBlock1D", "DownBlock1DNoSkip", "AttnDownBlock1D")): Tuple of downsample block types.
up_block_types (Tuple[str], optional, defaults to — obj:("UpBlock1D", "UpBlock1DNoSkip", "AttnUpBlock1D")): Tuple of upsample block types.
block_out_channels (Tuple[int], optional, defaults to — obj:(32, 32, 64)): Tuple of block output channels.
mid_block_type (str, optional, defaults to “UNetMidBlock1D”) — block type for middle of UNet.
out_block_type (str, optional, defaults to None) — optional output processing of UNet.
act_fn (str, optional, defaults to None) — optional activation function in UNet blocks.
norm_num_groups (int, optional, defaults to 8) — group norm member count in UNet blocks.
layers_per_block (int, optional, defaults to 1) — added number of layers in a UNet block.
downsample_each_block (int, optional, defaults to False — experimental feature for using a UNet without upsampling.

UNet1DModel is a 1D UNet model that takes in a noisy sample and a timestep and returns sample shaped output.

This model inherits from ModelMixin. Check the superclass documentation for the generic methods the library implements for all the model (such as downloading or saving, etc.)

forward

< source >

( sample: FloatTensor timestep: typing.Union[torch.Tensor, float, int] return_dict: bool = True ) → UNet1DOutput or tuple

Parameters

sample (torch.FloatTensor) — (batch_size, num_channels, sample_size) noisy inputs tensor
timestep (torch.FloatTensor or float or `int) — (batch) timesteps
return_dict (bool, optional, defaults to True) — Whether or not to return a UNet1DOutput instead of a plain tuple.

Returns

UNet1DOutput or tuple

UNet1DOutput if return_dict is True, otherwise a tuple. When returning a tuple, the first element is the sample tensor.

UNet2DConditionOutput

class diffusers.models.unet_2d_condition.UNet2DConditionOutput

< source >

( sample: FloatTensor )

Parameters

sample (torch.FloatTensor of shape (batch_size, num_channels, height, width)) — Hidden states conditioned on encoder_hidden_states input. Output of last layer of model.

UNet2DConditionModel

class diffusers.UNet2DConditionModel

< source >

( sample_size: typing.Optional[int] = None in_channels: int = 4 out_channels: int = 4 center_input_sample: bool = False flip_sin_to_cos: bool = True freq_shift: int = 0 down_block_types: typing.Tuple[str] = ('CrossAttnDownBlock2D', 'CrossAttnDownBlock2D', 'CrossAttnDownBlock2D', 'DownBlock2D') mid_block_type: typing.Optional[str] = 'UNetMidBlock2DCrossAttn' up_block_types: typing.Tuple[str] = ('UpBlock2D', 'CrossAttnUpBlock2D', 'CrossAttnUpBlock2D', 'CrossAttnUpBlock2D') only_cross_attention: typing.Union[bool, typing.Tuple[bool]] = False block_out_channels: typing.Tuple[int] = (320, 640, 1280, 1280) layers_per_block: typing.Union[int, typing.Tuple[int]] = 2 downsample_padding: int = 1 mid_block_scale_factor: float = 1 act_fn: str = 'silu' norm_num_groups: typing.Optional[int] = 32 norm_eps: float = 1e-05 cross_attention_dim: typing.Union[int, typing.Tuple[int]] = 1280 encoder_hid_dim: typing.Optional[int] = None encoder_hid_dim_type: typing.Optional[str] = None attention_head_dim: typing.Union[int, typing.Tuple[int]] = 8 num_attention_heads: typing.Union[int, typing.Tuple[int], NoneType] = None dual_cross_attention: bool = False use_linear_projection: bool = False class_embed_type: typing.Optional[str] = None addition_embed_type: typing.Optional[str] = None num_class_embeds: typing.Optional[int] = None upcast_attention: bool = False resnet_time_scale_shift: str = 'default' resnet_skip_time_act: bool = False resnet_out_scale_factor: int = 1.0 time_embedding_type: str = 'positional' time_embedding_dim: typing.Optional[int] = None time_embedding_act_fn: typing.Optional[str] = None timestep_post_act: typing.Optional[str] = None time_cond_proj_dim: typing.Optional[int] = None conv_in_kernel: int = 3 conv_out_kernel: int = 3 projection_class_embeddings_input_dim: typing.Optional[int] = None class_embeddings_concat: bool = False mid_block_only_cross_attention: typing.Optional[bool] = None cross_attention_norm: typing.Optional[str] = None addition_embed_type_num_heads = 64 )

Parameters

sample_size (int or Tuple[int, int], optional, defaults to None) — Height and width of input/output sample.
in_channels (int, optional, defaults to 4) — The number of channels in the input sample.
out_channels (int, optional, defaults to 4) — The number of channels in the output.
center_input_sample (bool, optional, defaults to False) — Whether to center the input sample.
flip_sin_to_cos (bool, optional, defaults to False) — Whether to flip the sin to cos in the time embedding.
freq_shift (int, optional, defaults to 0) — The frequency shift to apply to the time embedding.
down_block_types (Tuple[str], optional, defaults to ("CrossAttnDownBlock2D", "CrossAttnDownBlock2D", "CrossAttnDownBlock2D", "DownBlock2D")) — The tuple of downsample blocks to use.
mid_block_type (str, optional, defaults to "UNetMidBlock2DCrossAttn") — The mid block type. Choose from UNetMidBlock2DCrossAttn or UNetMidBlock2DSimpleCrossAttn, will skip the mid block layer if None.
up_block_types (Tuple[str], optional, defaults to ("UpBlock2D", "CrossAttnUpBlock2D", "CrossAttnUpBlock2D", "CrossAttnUpBlock2D",)) — The tuple of upsample blocks to use.
only_cross_attention(bool or Tuple[bool], optional, default to False) — Whether to include self-attention in the basic transformer blocks, seeBasicTransformerBlock.
block_out_channels (Tuple[int], optional, defaults to (320, 640, 1280, 1280)) — The tuple of output channels for each block.
layers_per_block (int, optional, defaults to 2) — The number of layers per block.
downsample_padding (int, optional, defaults to 1) — The padding to use for the downsampling convolution.
mid_block_scale_factor (float, optional, defaults to 1.0) — The scale factor to use for the mid block.
act_fn (str, optional, defaults to "silu") — The activation function to use.
norm_num_groups (int, optional, defaults to 32) — The number of groups to use for the normalization. If None, it will skip the normalization and activation layers in post-processing
norm_eps (float, optional, defaults to 1e-5) — The epsilon to use for the normalization.
cross_attention_dim (int or Tuple[int], optional, defaults to 1280) — The dimension of the cross attention features.
encoder_hid_dim (int, optional, defaults to None) — If encoder_hid_dim_type is defined, encoder_hidden_states will be projected from encoder_hid_dimdimension to cross_attention_dim.
encoder_hid_dim_type (str, optional, defaults to None) — If given, the encoder_hidden_states and potentially other embeddings will be down-projected to text embeddings of dimension cross_attention according to encoder_hid_dim_type.
attention_head_dim (int, optional, defaults to 8) — The dimension of the attention heads.
num_attention_heads (int, optional) — The number of attention heads. If not defined, defaults to attention_head_dim
resnet_time_scale_shift (str, optional, defaults to "default") — Time scale shift config for resnet blocks, see ResnetBlock2D. Choose from default or scale_shift.
class_embed_type (str, optional, defaults to None) — The type of class embedding to use which is ultimately summed with the time embeddings. Choose from None,"timestep", "identity", "projection", or "simple_projection".
addition_embed_type (str, optional, defaults to None) — Configures an optional embedding which will be summed with the time embeddings. Choose from None or “text”. “text” will use the TextTimeEmbedding layer.
num_class_embeds (int, optional, defaults to None) — Input dimension of the learnable embedding matrix to be projected to time_embed_dim, when performing class conditioning with class_embed_type equal to None.
time_embedding_type (str, optional, default to positional) — The type of position embedding to use for timesteps. Choose from positional or fourier.
time_embedding_dim (int, optional, default to None) — An optional override for the dimension of the projected time embedding.
time_embedding_act_fn (str, optional, default to None) — Optional activation function to use on the time embeddings only one time before they as passed to the rest of the unet. Choose from silu, mish, gelu, and swish.
timestep_post_act (str, *optional*, default to None) -- The second activation function to use in timestep embedding. Choose from silu, mishandgelu`.
time_cond_proj_dim (int, optional, default to None) — The dimension of cond_proj layer in timestep embedding.
conv_in_kernel (int, optional, default to 3) — The kernel size of conv_in layer.
conv_out_kernel (int, optional, default to 3) — The kernel size of conv_out layer.
projection_class_embeddings_input_dim (int, optional) — The dimension of the class_labels input when using the “projection” class_embed_type. Required when using the “projection” class_embed_type.
class_embeddings_concat (bool, optional, defaults to False) — Whether to concatenate the time embeddings with the class embeddings.
mid_block_only_cross_attention (bool, optional, defaults to None) — Whether to use cross attention with the mid block when using the UNetMidBlock2DSimpleCrossAttn. Ifonly_cross_attention is given as a single boolean and mid_block_only_cross_attention is None, theonly_cross_attention value will be used as the value for mid_block_only_cross_attention. Else, it will default to False.

UNet2DConditionModel is a conditional 2D UNet model that takes in a noisy sample, conditional state, and a timestep and returns sample shaped output.

This model inherits from ModelMixin. Check the superclass documentation for the generic methods the library implements for all the models (such as downloading or saving, etc.)

forward

< source >

( sample: FloatTensor timestep: typing.Union[torch.Tensor, float, int] encoder_hidden_states: Tensor class_labels: typing.Optional[torch.Tensor] = None timestep_cond: typing.Optional[torch.Tensor] = None attention_mask: typing.Optional[torch.Tensor] = None cross_attention_kwargs: typing.Union[typing.Dict[str, typing.Any], NoneType] = None added_cond_kwargs: typing.Union[typing.Dict[str, torch.Tensor], NoneType] = None down_block_additional_residuals: typing.Optional[typing.Tuple[torch.Tensor]] = None mid_block_additional_residual: typing.Optional[torch.Tensor] = None encoder_attention_mask: typing.Optional[torch.Tensor] = None return_dict: bool = True ) → UNet2DConditionOutput or tuple

Parameters

sample (torch.FloatTensor) — (batch, channel, height, width) noisy inputs tensor
timestep (torch.FloatTensor or float or int) — (batch) timesteps
encoder_hidden_states (torch.FloatTensor) — (batch, sequence_length, feature_dim) encoder hidden states
encoder_attention_mask (torch.Tensor) — (batch, sequence_length) cross-attention mask, applied to encoder_hidden_states. True = keep, False = discard. Mask will be converted into a bias, which adds large negative values to attention scores corresponding to “discard” tokens.
return_dict (bool, optional, defaults to True) — Whether or not to return a models.unet_2d_condition.UNet2DConditionOutput instead of a plain tuple.
cross_attention_kwargs (dict, optional) — A kwargs dictionary that if specified is passed along to the AttentionProcessor as defined underself.processor indiffusers.cross_attention.
added_cond_kwargs (dict, optional) — A kwargs dictionary that if specified includes additonal conditions that can be used for additonal time embeddings or encoder hidden states projections. See the configurations encoder_hid_dim_type andaddition_embed_type for more information.

UNet2DConditionOutput if return_dict is True, otherwise a tuple. When returning a tuple, the first element is the sample tensor.

set_attention_slice

< source >

( slice_size )

Parameters

slice_size (str or int or list(int), optional, defaults to "auto") — When "auto", halves the input to the attention heads, so attention will be computed in two steps. If"max", maximum amount of memory will be saved by running only one slice at a time. If a number is provided, uses as many slices as num_attention_heads // slice_size. In this case,num_attention_heads must be a multiple of slice_size.

Enable sliced attention computation.

When this option is enabled, the attention module will split the input tensor in slices, to compute attention in several steps. This is useful to save some memory in exchange for a small speed decrease.

set_attn_processor

< source >

( processor: typing.Union[diffusers.models.attention_processor.AttnProcessor, diffusers.models.attention_processor.AttnProcessor2_0, diffusers.models.attention_processor.XFormersAttnProcessor, diffusers.models.attention_processor.SlicedAttnProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor, diffusers.models.attention_processor.SlicedAttnAddedKVProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor2_0, diffusers.models.attention_processor.XFormersAttnAddedKVProcessor, diffusers.models.attention_processor.LoRAAttnProcessor, diffusers.models.attention_processor.LoRAXFormersAttnProcessor, diffusers.models.attention_processor.LoRAAttnProcessor2_0, diffusers.models.attention_processor.LoRAAttnAddedKVProcessor, diffusers.models.attention_processor.CustomDiffusionAttnProcessor, diffusers.models.attention_processor.CustomDiffusionXFormersAttnProcessor, typing.Dict[str, typing.Union[diffusers.models.attention_processor.AttnProcessor, diffusers.models.attention_processor.AttnProcessor2_0, diffusers.models.attention_processor.XFormersAttnProcessor, diffusers.models.attention_processor.SlicedAttnProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor, diffusers.models.attention_processor.SlicedAttnAddedKVProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor2_0, diffusers.models.attention_processor.XFormersAttnAddedKVProcessor, diffusers.models.attention_processor.LoRAAttnProcessor, diffusers.models.attention_processor.LoRAXFormersAttnProcessor, diffusers.models.attention_processor.LoRAAttnProcessor2_0, diffusers.models.attention_processor.LoRAAttnAddedKVProcessor, diffusers.models.attention_processor.CustomDiffusionAttnProcessor, diffusers.models.attention_processor.CustomDiffusionXFormersAttnProcessor]]] )

Parameters

**`processor** (dict of AttentionProcessor or AttentionProcessor) — The instantiated processor class or a dictionary of processor classes that will be set as the processor of **all** Attention layers.
In case processor is a dict, the key needs to define the path to the corresponding cross attention processor. This is strongly recommended when setting trainable attention processors. —

Disables custom attention processors and sets the default attention implementation.

UNet3DConditionOutput

class diffusers.models.unet_3d_condition.UNet3DConditionOutput

< source >

( sample: FloatTensor )

Parameters

sample (torch.FloatTensor of shape (batch_size, num_frames, num_channels, height, width)) — Hidden states conditioned on encoder_hidden_states input. Output of last layer of model.

UNet3DConditionModel

class diffusers.UNet3DConditionModel

< source >

( sample_size: typing.Optional[int] = None in_channels: int = 4 out_channels: int = 4 down_block_types: typing.Tuple[str] = ('CrossAttnDownBlock3D', 'CrossAttnDownBlock3D', 'CrossAttnDownBlock3D', 'DownBlock3D') up_block_types: typing.Tuple[str] = ('UpBlock3D', 'CrossAttnUpBlock3D', 'CrossAttnUpBlock3D', 'CrossAttnUpBlock3D') block_out_channels: typing.Tuple[int] = (320, 640, 1280, 1280) layers_per_block: int = 2 downsample_padding: int = 1 mid_block_scale_factor: float = 1 act_fn: str = 'silu' norm_num_groups: typing.Optional[int] = 32 norm_eps: float = 1e-05 cross_attention_dim: int = 1024 attention_head_dim: typing.Union[int, typing.Tuple[int]] = 64 num_attention_heads: typing.Union[int, typing.Tuple[int], NoneType] = None )

Parameters

sample_size (int or Tuple[int, int], optional, defaults to None) — Height and width of input/output sample.
in_channels (int, optional, defaults to 4) — The number of channels in the input sample.
out_channels (int, optional, defaults to 4) — The number of channels in the output.
down_block_types (Tuple[str], optional, defaults to ("CrossAttnDownBlock2D", "CrossAttnDownBlock2D", "CrossAttnDownBlock2D", "DownBlock2D")) — The tuple of downsample blocks to use.
up_block_types (Tuple[str], optional, defaults to ("UpBlock2D", "CrossAttnUpBlock2D", "CrossAttnUpBlock2D", "CrossAttnUpBlock2D",)) — The tuple of upsample blocks to use.
block_out_channels (Tuple[int], optional, defaults to (320, 640, 1280, 1280)) — The tuple of output channels for each block.
layers_per_block (int, optional, defaults to 2) — The number of layers per block.
downsample_padding (int, optional, defaults to 1) — The padding to use for the downsampling convolution.
mid_block_scale_factor (float, optional, defaults to 1.0) — The scale factor to use for the mid block.
act_fn (str, optional, defaults to "silu") — The activation function to use.
norm_num_groups (int, optional, defaults to 32) — The number of groups to use for the normalization. If None, it will skip the normalization and activation layers in post-processing
norm_eps (float, optional, defaults to 1e-5) — The epsilon to use for the normalization.
cross_attention_dim (int, optional, defaults to 1280) — The dimension of the cross attention features.
attention_head_dim (int, optional, defaults to 8) — The dimension of the attention heads.
num_attention_heads (int, optional) — The number of attention heads.

UNet3DConditionModel is a conditional 2D UNet model that takes in a noisy sample, conditional state, and a timestep and returns sample shaped output.

This model inherits from ModelMixin. Check the superclass documentation for the generic methods the library implements for all the models (such as downloading or saving, etc.)

forward

< source >

( sample: FloatTensor timestep: typing.Union[torch.Tensor, float, int] encoder_hidden_states: Tensor class_labels: typing.Optional[torch.Tensor] = None timestep_cond: typing.Optional[torch.Tensor] = None attention_mask: typing.Optional[torch.Tensor] = None cross_attention_kwargs: typing.Union[typing.Dict[str, typing.Any], NoneType] = None down_block_additional_residuals: typing.Optional[typing.Tuple[torch.Tensor]] = None mid_block_additional_residual: typing.Optional[torch.Tensor] = None return_dict: bool = True ) → ~models.unet_2d_condition.UNet3DConditionOutput or tuple

Parameters

sample (torch.FloatTensor) — (batch, num_frames, channel, height, width) noisy inputs tensor
timestep (torch.FloatTensor or float or int) — (batch) timesteps
encoder_hidden_states (torch.FloatTensor) — (batch, sequence_length, feature_dim) encoder hidden states
return_dict (bool, optional, defaults to True) — Whether or not to return a models.unet_2d_condition.UNet3DConditionOutput instead of a plain tuple.
cross_attention_kwargs (dict, optional) — A kwargs dictionary that if specified is passed along to the AttentionProcessor as defined underself.processor indiffusers.cross_attention.

Returns

~models.unet_2d_condition.UNet3DConditionOutput or tuple

~models.unet_2d_condition.UNet3DConditionOutput if return_dict is True, otherwise a tuple. When returning a tuple, the first element is the sample tensor.

set_attention_slice

< source >

( slice_size )

Parameters

slice_size (str or int or list(int), optional, defaults to "auto") — When "auto", halves the input to the attention heads, so attention will be computed in two steps. If"max", maximum amount of memory will be saved by running only one slice at a time. If a number is provided, uses as many slices as num_attention_heads // slice_size. In this case,num_attention_heads must be a multiple of slice_size.

Enable sliced attention computation.

set_attn_processor

< source >

Parameters

**`processor** (dict of AttentionProcessor or AttentionProcessor) — The instantiated processor class or a dictionary of processor classes that will be set as the processor of **all** Attention layers.
In case processor is a dict, the key needs to define the path to the corresponding cross attention processor. This is strongly recommended when setting trainable attention processors. —

Disables custom attention processors and sets the default attention implementation.

DecoderOutput

class diffusers.models.vae.DecoderOutput

< source >

( sample: FloatTensor )

Parameters

sample (torch.FloatTensor of shape (batch_size, num_channels, height, width)) — Decoded output sample of the model. Output of the last layer of the model.

Output of decoding method.

VQEncoderOutput

class diffusers.models.vq_model.VQEncoderOutput

< source >

( latents: FloatTensor )

Parameters

latents (torch.FloatTensor of shape (batch_size, num_channels, height, width)) — Encoded output sample of the model. Output of the last layer of the model.

Output of VQModel encoding method.

VQModel

class diffusers.VQModel

< source >

( in_channels: int = 3 out_channels: int = 3 down_block_types: typing.Tuple[str] = ('DownEncoderBlock2D',) up_block_types: typing.Tuple[str] = ('UpDecoderBlock2D',) block_out_channels: typing.Tuple[int] = (64,) layers_per_block: int = 1 act_fn: str = 'silu' latent_channels: int = 3 sample_size: int = 32 num_vq_embeddings: int = 256 norm_num_groups: int = 32 vq_embed_dim: typing.Optional[int] = None scaling_factor: float = 0.18215 norm_type: str = 'group' )

Parameters

in_channels (int, optional, defaults to 3) — Number of channels in the input image.
out_channels (int, optional, defaults to 3) — Number of channels in the output.
down_block_types (Tuple[str], optional, defaults to — obj:("DownEncoderBlock2D",)): Tuple of downsample block types.
up_block_types (Tuple[str], optional, defaults to — obj:("UpDecoderBlock2D",)): Tuple of upsample block types.
block_out_channels (Tuple[int], optional, defaults to — obj:(64,)): Tuple of block output channels.
act_fn (str, optional, defaults to "silu") — The activation function to use.
latent_channels (int, optional, defaults to 3) — Number of channels in the latent space.
sample_size (int, optional, defaults to 32) — TODO
num_vq_embeddings (int, optional, defaults to 256) — Number of codebook vectors in the VQ-VAE.
vq_embed_dim (int, optional) — Hidden dim of codebook vectors in the VQ-VAE.
scaling_factor (float, optional, defaults to 0.18215) — The component-wise standard deviation of the trained latent space computed using the first batch of the training set. This is used to scale the latent space to have unit variance when training the diffusion model. The latents are scaled with the formula z = z * scaling_factor before being passed to the diffusion model. When decoding, the latents are scaled back to the original scale with the formula: z = 1 / scaling_factor * z. For more details, refer to sections 4.3.2 and D.1 of the High-Resolution Image Synthesis with Latent Diffusion Models paper.

VQ-VAE model from the paper Neural Discrete Representation Learning by Aaron van den Oord, Oriol Vinyals and Koray Kavukcuoglu.

This model inherits from ModelMixin. Check the superclass documentation for the generic methods the library implements for all the model (such as downloading or saving, etc.)

forward

< source >

( sample: FloatTensor return_dict: bool = True )

Parameters

sample (torch.FloatTensor) — Input sample.
return_dict (bool, optional, defaults to True) — Whether or not to return a DecoderOutput instead of a plain tuple.

AutoencoderKLOutput

class diffusers.models.autoencoder_kl.AutoencoderKLOutput

< source >

( latent_dist: DiagonalGaussianDistribution )

Parameters

latent_dist (DiagonalGaussianDistribution) — Encoded outputs of Encoder represented as the mean and logvar of DiagonalGaussianDistribution.DiagonalGaussianDistribution allows for sampling latents from the distribution.

Output of AutoencoderKL encoding method.

AutoencoderKL

class diffusers.AutoencoderKL

< source >

Parameters

in_channels (int, optional, defaults to 3) — Number of channels in the input image.
out_channels (int, optional, defaults to 3) — Number of channels in the output.
down_block_types (Tuple[str], optional, defaults to — obj:("DownEncoderBlock2D",)): Tuple of downsample block types.
up_block_types (Tuple[str], optional, defaults to — obj:("UpDecoderBlock2D",)): Tuple of upsample block types.
block_out_channels (Tuple[int], optional, defaults to — obj:(64,)): Tuple of block output channels.
act_fn (str, optional, defaults to "silu") — The activation function to use.
latent_channels (int, optional, defaults to 4) — Number of channels in the latent space.
sample_size (int, optional, defaults to 32) — TODO
scaling_factor (float, optional, defaults to 0.18215) — The component-wise standard deviation of the trained latent space computed using the first batch of the training set. This is used to scale the latent space to have unit variance when training the diffusion model. The latents are scaled with the formula z = z * scaling_factor before being passed to the diffusion model. When decoding, the latents are scaled back to the original scale with the formula: z = 1 / scaling_factor * z. For more details, refer to sections 4.3.2 and D.1 of the High-Resolution Image Synthesis with Latent Diffusion Models paper.

Variational Autoencoder (VAE) model with KL loss from the paper Auto-Encoding Variational Bayes by Diederik P. Kingma and Max Welling.

This model inherits from ModelMixin. Check the superclass documentation for the generic methods the library implements for all the model (such as downloading or saving, etc.)

Disable sliced VAE decoding. If enable_slicing was previously invoked, this method will go back to computing decoding in one step.

Disable tiled VAE decoding. If enable_vae_tiling was previously invoked, this method will go back to computing decoding in one step.

Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.

enable_tiling

< source >

( use_tiling: bool = True )

Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to compute decoding and encoding in several steps. This is useful to save a large amount of memory and to allow the processing of larger images.

forward

< source >

( sample: FloatTensor sample_posterior: bool = False return_dict: bool = True generator: typing.Optional[torch._C.Generator] = None )

Parameters

sample (torch.FloatTensor) — Input sample.
sample_posterior (bool, optional, defaults to False) — Whether to sample from the posterior.
return_dict (bool, optional, defaults to True) — Whether or not to return a DecoderOutput instead of a plain tuple.

set_attn_processor

< source >

Parameters

**`processor** (dict of AttentionProcessor or AttentionProcessor) — The instantiated processor class or a dictionary of processor classes that will be set as the processor of **all** Attention layers.
In case processor is a dict, the key needs to define the path to the corresponding cross attention processor. This is strongly recommended when setting trainable attention processors. —

Disables custom attention processors and sets the default attention implementation.

tiled_decode

< source >

( z: FloatTensor return_dict: bool = True )

Parameters

When this option is enabled, the VAE will split the input tensor into tiles to compute decoding in several —
steps. This is useful to keep memory use constant regardless of image size. The end result of tiled decoding is —
different from non-tiled decoding due to each tile using a different decoder. To avoid tiling artifacts, the —
tiles overlap and are blended together to form a smooth output. You may still see tile-sized changes in the —
look of the output, but they should be much less noticeable. — z (torch.FloatTensor): Input batch of latent vectors. return_dict (bool, optional, defaults toTrue): Whether or not to return a DecoderOutput instead of a plain tuple.

Decode a batch of images using a tiled decoder.

tiled_encode

< source >

( x: FloatTensor return_dict: bool = True )

Parameters

When this option is enabled, the VAE will split the input tensor into tiles to compute encoding in several —
steps. This is useful to keep memory use constant regardless of image size. The end result of tiled encoding is —
different from non-tiled encoding due to each tile using a different encoder. To avoid tiling artifacts, the —
tiles overlap and are blended together to form a smooth output. You may still see tile-sized changes in the —
look of the output, but they should be much less noticeable. — x (torch.FloatTensor): Input batch of images. return_dict (bool, optional, defaults to True): Whether or not to return a AutoencoderKLOutput instead of a plain tuple.

Encode a batch of images using a tiled encoder.

Transformer2DModel

class diffusers.Transformer2DModel

< source >

( num_attention_heads: int = 16 attention_head_dim: int = 88 in_channels: typing.Optional[int] = None out_channels: typing.Optional[int] = None num_layers: int = 1 dropout: float = 0.0 norm_num_groups: int = 32 cross_attention_dim: typing.Optional[int] = None attention_bias: bool = False sample_size: typing.Optional[int] = None num_vector_embeds: typing.Optional[int] = None patch_size: typing.Optional[int] = None activation_fn: str = 'geglu' num_embeds_ada_norm: typing.Optional[int] = None use_linear_projection: bool = False only_cross_attention: bool = False upcast_attention: bool = False norm_type: str = 'layer_norm' norm_elementwise_affine: bool = True )

Parameters

num_attention_heads (int, optional, defaults to 16) — The number of heads to use for multi-head attention.
attention_head_dim (int, optional, defaults to 88) — The number of channels in each head.
in_channels (int, optional) — Pass if the input is continuous. The number of channels in the input and output.
num_layers (int, optional, defaults to 1) — The number of layers of Transformer blocks to use.
dropout (float, optional, defaults to 0.0) — The dropout probability to use.
cross_attention_dim (int, optional) — The number of encoder_hidden_states dimensions to use.
sample_size (int, optional) — Pass if the input is discrete. The width of the latent images. Note that this is fixed at training time as it is used for learning a number of position embeddings. SeeImagePositionalEmbeddings.
num_vector_embeds (int, optional) — Pass if the input is discrete. The number of classes of the vector embeddings of the latent pixels. Includes the class for the masked latent pixel.
activation_fn (str, optional, defaults to "geglu") — Activation function to be used in feed-forward.
num_embeds_ada_norm ( int, optional) — Pass if at least one of the norm_layers is AdaLayerNorm. The number of diffusion steps used during training. Note that this is fixed at training time as it is used to learn a number of embeddings that are added to the hidden states. During inference, you can denoise for up to but not more than steps than num_embeds_ada_norm.
attention_bias (bool, optional) — Configure if the TransformerBlocks’ attention should contain a bias parameter.

Transformer model for image-like data. Takes either discrete (classes of vector embeddings) or continuous (actual embeddings) inputs.

When input is continuous: First, project the input (aka embedding) and reshape to b, t, d. Then apply standard transformer action. Finally, reshape to image.

When input is discrete: First, input (classes of latent pixels) is converted to embeddings and has positional embeddings applied, see ImagePositionalEmbeddings. Then apply standard transformer action. Finally, predict classes of unnoised image.

Note that it is assumed one of the input classes is the masked latent pixel. The predicted classes of the unnoised image do not contain a prediction for the masked pixel as the unnoised image cannot be masked.

forward

< source >

( hidden_states: Tensor encoder_hidden_states: typing.Optional[torch.Tensor] = None timestep: typing.Optional[torch.LongTensor] = None class_labels: typing.Optional[torch.LongTensor] = None cross_attention_kwargs: typing.Dict[str, typing.Any] = None attention_mask: typing.Optional[torch.Tensor] = None encoder_attention_mask: typing.Optional[torch.Tensor] = None return_dict: bool = True ) → Transformer2DModelOutput or tuple

Parameters

hidden_states ( When discrete, torch.LongTensor of shape (batch size, num latent pixels). — When continuous, torch.FloatTensor of shape (batch size, channel, height, width)): Input hidden_states
encoder_hidden_states ( torch.FloatTensor of shape (batch size, sequence len, embed dims), optional) — Conditional embeddings for cross attention layer. If not given, cross-attention defaults to self-attention.
timestep ( torch.LongTensor, optional) — Optional timestep to be applied as an embedding in AdaLayerNorm’s. Used to indicate denoising step.
class_labels ( torch.LongTensor of shape (batch size, num classes), optional) — Optional class labels to be applied as an embedding in AdaLayerZeroNorm. Used to indicate class labels conditioning.
encoder_attention_mask ( torch.Tensor, optional ). — Cross-attention mask, applied to encoder_hidden_states. Two formats supported: Mask (batch, sequence_length) True = keep, False = discard. Bias (batch, 1, sequence_length) 0 = keep, -10000 = discard. If ndim == 2: will be interpreted as a mask, then converted into a bias consistent with the format above. This bias will be added to the cross-attention scores.
return_dict (bool, optional, defaults to True) — Whether or not to return a models.unet_2d_condition.UNet2DConditionOutput instead of a plain tuple.

Transformer2DModelOutput if return_dict is True, otherwise a tuple. When returning a tuple, the first element is the sample tensor.

Transformer2DModelOutput

class diffusers.models.transformer_2d.Transformer2DModelOutput

< source >

( sample: FloatTensor )

Parameters

sample (torch.FloatTensor of shape (batch_size, num_channels, height, width) or (batch size, num_vector_embeds - 1, num_latent_pixels) if Transformer2DModel is discrete) — Hidden states conditioned on encoder_hidden_states input. If discrete, returns probability distributions for the unnoised latent pixels.

TransformerTemporalModel

class diffusers.models.transformer_temporal.TransformerTemporalModel

< source >

Parameters

num_attention_heads (int, optional, defaults to 16) — The number of heads to use for multi-head attention.
attention_head_dim (int, optional, defaults to 88) — The number of channels in each head.
in_channels (int, optional) — Pass if the input is continuous. The number of channels in the input and output.
num_layers (int, optional, defaults to 1) — The number of layers of Transformer blocks to use.
dropout (float, optional, defaults to 0.0) — The dropout probability to use.
cross_attention_dim (int, optional) — The number of encoder_hidden_states dimensions to use.
sample_size (int, optional) — Pass if the input is discrete. The width of the latent images. Note that this is fixed at training time as it is used for learning a number of position embeddings. SeeImagePositionalEmbeddings.
activation_fn (str, optional, defaults to "geglu") — Activation function to be used in feed-forward.
attention_bias (bool, optional) — Configure if the TransformerBlocks’ attention should contain a bias parameter.
double_self_attention (bool, optional) — Configure if each TransformerBlock should contain two self-attention layers

Transformer model for video-like data.

forward

< source >

( hidden_states encoder_hidden_states = None timestep = None class_labels = None num_frames = 1 cross_attention_kwargs = None return_dict: bool = True ) → ~models.transformer_2d.TransformerTemporalModelOutput or tuple

Parameters

hidden_states ( When discrete, torch.LongTensor of shape (batch size, num latent pixels). — When continous, torch.FloatTensor of shape (batch size, channel, height, width)): Input hidden_states
encoder_hidden_states ( torch.LongTensor of shape (batch size, encoder_hidden_states dim), optional) — Conditional embeddings for cross attention layer. If not given, cross-attention defaults to self-attention.
timestep ( torch.long, optional) — Optional timestep to be applied as an embedding in AdaLayerNorm’s. Used to indicate denoising step.
class_labels ( torch.LongTensor of shape (batch size, num classes), optional) — Optional class labels to be applied as an embedding in AdaLayerZeroNorm. Used to indicate class labels conditioning.
return_dict (bool, optional, defaults to True) — Whether or not to return a models.unet_2d_condition.UNet2DConditionOutput instead of a plain tuple.

Returns

~models.transformer_2d.TransformerTemporalModelOutput or tuple

~models.transformer_2d.TransformerTemporalModelOutput if return_dict is True, otherwise a tuple. When returning a tuple, the first element is the sample tensor.

Transformer2DModelOutput

class diffusers.models.transformer_temporal.TransformerTemporalModelOutput

< source >

( sample: FloatTensor )

Parameters

sample (torch.FloatTensor of shape (batch_size x num_frames, num_channels, height, width)) — Hidden states conditioned on encoder_hidden_states input.

PriorTransformer

class diffusers.PriorTransformer

< source >

( num_attention_heads: int = 32 attention_head_dim: int = 64 num_layers: int = 20 embedding_dim: int = 768 num_embeddings = 77 additional_embeddings = 4 dropout: float = 0.0 )

Parameters

num_attention_heads (int, optional, defaults to 32) — The number of heads to use for multi-head attention.
attention_head_dim (int, optional, defaults to 64) — The number of channels in each head.
num_layers (int, optional, defaults to 20) — The number of layers of Transformer blocks to use.
embedding_dim (int, optional, defaults to 768) — The dimension of the CLIP embeddings. Note that CLIP image embeddings and text embeddings are both the same dimension.
num_embeddings (int, optional, defaults to 77) — The max number of clip embeddings allowed. I.e. the length of the prompt after it has been tokenized.
additional_embeddings (int, optional, defaults to 4) — The number of additional tokens appended to the projected hidden_states. The actual length of the used hidden_states is num_embeddings + additional_embeddings.
dropout (float, optional, defaults to 0.0) — The dropout probability to use.

The prior transformer from unCLIP is used to predict CLIP image embeddings from CLIP text embeddings. Note that the transformer predicts the image embeddings through a denoising diffusion process.

This model inherits from ModelMixin. Check the superclass documentation for the generic methods the library implements for all the models (such as downloading or saving, etc.)

For more details, see the original paper: https://arxiv.org/abs/2204.06125

forward

< source >

( hidden_states timestep: typing.Union[torch.Tensor, float, int] proj_embedding: FloatTensor encoder_hidden_states: FloatTensor attention_mask: typing.Optional[torch.BoolTensor] = None return_dict: bool = True ) → PriorTransformerOutput or tuple

Parameters

hidden_states (torch.FloatTensor of shape (batch_size, embedding_dim)) — x_t, the currently predicted image embeddings.
timestep (torch.long) — Current denoising step.
proj_embedding (torch.FloatTensor of shape (batch_size, embedding_dim)) — Projected embedding vector the denoising process is conditioned on.
encoder_hidden_states (torch.FloatTensor of shape (batch_size, num_embeddings, embedding_dim)) — Hidden states of the text embeddings the denoising process is conditioned on.
attention_mask (torch.BoolTensor of shape (batch_size, num_embeddings)) — Text mask for the text embeddings.
return_dict (bool, optional, defaults to True) — Whether or not to return a models.prior_transformer.PriorTransformerOutput instead of a plain tuple.

PriorTransformerOutput if return_dict is True, otherwise a tuple. When returning a tuple, the first element is the sample tensor.

set_attn_processor

< source >

Parameters

**`processor** (dict of AttentionProcessor or AttentionProcessor) — The instantiated processor class or a dictionary of processor classes that will be set as the processor of **all** Attention layers.
In case processor is a dict, the key needs to define the path to the corresponding cross attention processor. This is strongly recommended when setting trainable attention processors. —

Disables custom attention processors and sets the default attention implementation.

PriorTransformerOutput

class diffusers.models.prior_transformer.PriorTransformerOutput

< source >

( predicted_image_embedding: FloatTensor )

Parameters

predicted_image_embedding (torch.FloatTensor of shape (batch_size, embedding_dim)) — The predicted CLIP image embedding conditioned on the CLIP text embedding input.

ControlNetOutput

class diffusers.models.controlnet.ControlNetOutput

< source >

( down_block_res_samples: typing.Tuple[torch.Tensor] mid_block_res_sample: Tensor )

ControlNetModel

class diffusers.ControlNetModel

< source >

( in_channels: int = 4 conditioning_channels: int = 3 flip_sin_to_cos: bool = True freq_shift: int = 0 down_block_types: typing.Tuple[str] = ('CrossAttnDownBlock2D', 'CrossAttnDownBlock2D', 'CrossAttnDownBlock2D', 'DownBlock2D') only_cross_attention: typing.Union[bool, typing.Tuple[bool]] = False block_out_channels: typing.Tuple[int] = (320, 640, 1280, 1280) layers_per_block: int = 2 downsample_padding: int = 1 mid_block_scale_factor: float = 1 act_fn: str = 'silu' norm_num_groups: typing.Optional[int] = 32 norm_eps: float = 1e-05 cross_attention_dim: int = 1280 attention_head_dim: typing.Union[int, typing.Tuple[int]] = 8 num_attention_heads: typing.Union[int, typing.Tuple[int], NoneType] = None use_linear_projection: bool = False class_embed_type: typing.Optional[str] = None num_class_embeds: typing.Optional[int] = None upcast_attention: bool = False resnet_time_scale_shift: str = 'default' projection_class_embeddings_input_dim: typing.Optional[int] = None controlnet_conditioning_channel_order: str = 'rgb' conditioning_embedding_out_channels: typing.Optional[typing.Tuple[int]] = (16, 32, 96, 256) global_pool_conditions: bool = False )

from_unet

< source >

( unet: UNet2DConditionModel controlnet_conditioning_channel_order: str = 'rgb' conditioning_embedding_out_channels: typing.Optional[typing.Tuple[int]] = (16, 32, 96, 256) load_weights_from_unet: bool = True )

Parameters

unet (UNet2DConditionModel) — UNet model which weights are copied to the ControlNet. Note that all configuration options are also copied where applicable.

Instantiate Controlnet class from UNet2DConditionModel.

set_attention_slice

< source >

( slice_size )

Parameters

slice_size (str or int or list(int), optional, defaults to "auto") — When "auto", halves the input to the attention heads, so attention will be computed in two steps. If"max", maximum amount of memory will be saved by running only one slice at a time. If a number is provided, uses as many slices as num_attention_heads // slice_size. In this case,num_attention_heads must be a multiple of slice_size.

Enable sliced attention computation.

set_attn_processor

< source >

Parameters

**`processor** (dict of AttentionProcessor or AttentionProcessor) — The instantiated processor class or a dictionary of processor classes that will be set as the processor of **all** Attention layers.
In case processor is a dict, the key needs to define the path to the corresponding cross attention processor. This is strongly recommended when setting trainable attention processors. —

Disables custom attention processors and sets the default attention implementation.

FlaxModelMixin

Base class for all flax models.

FlaxModelMixin takes care of storing the configuration of the models and handles methods for loading, downloading and saving models.

from_pretrained

< source >

( pretrained_model_name_or_path: typing.Union[str, os.PathLike] dtype: dtype = <class 'jax.numpy.float32'> *model_args **kwargs )

Parameters

pretrained_model_name_or_path (str or os.PathLike) — Can be either:
- A string, the model id of a pretrained model hosted inside a model repo on huggingface.co. Valid model ids are namespaced under a user or organization name, likerunwayml/stable-diffusion-v1-5.
- A path to a directory containing model weights saved using save_pretrained(), e.g., ./my_model_directory/.
dtype (jax.numpy.dtype, optional, defaults to jax.numpy.float32) — The data type of the computation. Can be one of jax.numpy.float32, jax.numpy.float16 (on GPUs) andjax.numpy.bfloat16 (on TPUs).
This can be used to enable mixed-precision training or half-precision inference on GPUs or TPUs. If specified all the computation will be performed with the given dtype.
Note that this only specifies the dtype of the computation and does not influence the dtype of model parameters.
If you wish to change the dtype of the model parameters, see ~ModelMixin.to_fp16 and~ModelMixin.to_bf16.
model_args (sequence of positional arguments, optional) — All remaining positional arguments will be passed to the underlying model’s __init__ method.
cache_dir (Union[str, os.PathLike], optional) — Path to a directory in which a downloaded pretrained model configuration should be cached if the standard cache should not be used.
force_download (bool, optional, defaults to False) — Whether or not to force the (re-)download of the model weights and configuration files, overriding the cached versions if they exist.
resume_download (bool, optional, defaults to False) — Whether or not to delete incompletely received files. Will attempt to resume the download if such a file exists.
proxies (Dict[str, str], optional) — A dictionary of proxy servers to use by protocol or endpoint, e.g., {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}. The proxies are used on each request.
local_files_only(bool, optional, defaults to False) — Whether or not to only look at local files (i.e., do not try to download the model).
revision (str, optional, defaults to "main") — The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a git-based system for storing models and other artifacts on huggingface.co, so revision can be any identifier allowed by git.
from_pt (bool, optional, defaults to False) — Load the model weights from a PyTorch checkpoint save file.
kwargs (remaining dictionary of keyword arguments, optional) — Can be used to update the configuration object (after it being loaded) and initiate the model (e.g.,output_attentions=True). Behaves differently depending on whether a config is provided or automatically loaded:
- If a configuration is provided with config, **kwargs will be directly passed to the underlying model’s __init__ method (we assume all relevant updates to the configuration have already been done)
- If a configuration is not provided, kwargs will be first passed to the configuration class initialization function (from_config()). Each key of kwargs that corresponds to a configuration attribute will be used to override said attribute with the supplied kwargsvalue. Remaining keys that do not correspond to any configuration attribute will be passed to the underlying model’s __init__ function.

Instantiate a pretrained flax model from a pre-trained model configuration.

The warning Weights from XXX not used in YYY means that the layer XXX is not used by YYY, therefore those weights are discarded.

Examples:

from diffusers import FlaxUNet2DConditionModel

model, params = FlaxUNet2DConditionModel.from_pretrained("runwayml/stable-diffusion-v1-5")

model, params = FlaxUNet2DConditionModel.from_pretrained("./test/saved_model/")

save_pretrained

< source >

( save_directory: typing.Union[str, os.PathLike] params: typing.Union[typing.Dict, flax.core.frozen_dict.FrozenDict] is_main_process: bool = True )

Parameters

save_directory (str or os.PathLike) — Directory to which to save. Will be created if it doesn’t exist.
params (Union[Dict, FrozenDict]) — A PyTree of model parameters.
is_main_process (bool, optional, defaults to True) — Whether the process calling this is the main process or not. Useful when in distributed training like TPUs and need to call this function on all processes. In this case, set is_main_process=True only on the main process to avoid race conditions.

Save a model and its configuration file to a directory, so that it can be re-loaded using the[from_pretrained()](/docs/diffusers/main/en/api/models#diffusers.FlaxModelMixin.from_pretrained) class method

to_bf16

< source >

( params: typing.Union[typing.Dict, flax.core.frozen_dict.FrozenDict] mask: typing.Any = None )

Parameters

params (Union[Dict, FrozenDict]) — A PyTree of model parameters.
mask (Union[Dict, FrozenDict]) — A PyTree with same structure as the params tree. The leaves should be booleans, True for params you want to cast, and should be False for those you want to skip.

Cast the floating-point params to jax.numpy.bfloat16. This returns a new params tree and does not cast the params in place.

This method can be used on TPU to explicitly convert the model parameters to bfloat16 precision to do full half-precision training or to save weights in bfloat16 for inference in order to save memory and improve speed.

Examples:

from diffusers import FlaxUNet2DConditionModel

model, params = FlaxUNet2DConditionModel.from_pretrained("runwayml/stable-diffusion-v1-5")

params = model.to_bf16(params)

from flax import traverse_util

model, params = FlaxUNet2DConditionModel.from_pretrained("runwayml/stable-diffusion-v1-5") flat_params = traverse_util.flatten_dict(params) mask = { ... path: (path[-2] != ("LayerNorm", "bias") and path[-2:] != ("LayerNorm", "scale")) ... for path in flat_params ... } mask = traverse_util.unflatten_dict(mask) params = model.to_bf16(params, mask)

to_fp16

< source >

( params: typing.Union[typing.Dict, flax.core.frozen_dict.FrozenDict] mask: typing.Any = None )

Parameters

params (Union[Dict, FrozenDict]) — A PyTree of model parameters.
mask (Union[Dict, FrozenDict]) — A PyTree with same structure as the params tree. The leaves should be booleans, True for params you want to cast, and should be False for those you want to skip

Cast the floating-point params to jax.numpy.float16. This returns a new params tree and does not cast theparams in place.

This method can be used on GPU to explicitly convert the model parameters to float16 precision to do full half-precision training or to save weights in float16 for inference in order to save memory and improve speed.

Examples:

from diffusers import FlaxUNet2DConditionModel

model, params = FlaxUNet2DConditionModel.from_pretrained("runwayml/stable-diffusion-v1-5")

params = model.to_fp16(params)

from flax import traverse_util

model, params = FlaxUNet2DConditionModel.from_pretrained("runwayml/stable-diffusion-v1-5") flat_params = traverse_util.flatten_dict(params) mask = { ... path: (path[-2] != ("LayerNorm", "bias") and path[-2:] != ("LayerNorm", "scale")) ... for path in flat_params ... } mask = traverse_util.unflatten_dict(mask) params = model.to_fp16(params, mask)

to_fp32

< source >

( params: typing.Union[typing.Dict, flax.core.frozen_dict.FrozenDict] mask: typing.Any = None )

Parameters

params (Union[Dict, FrozenDict]) — A PyTree of model parameters.
mask (Union[Dict, FrozenDict]) — A PyTree with same structure as the params tree. The leaves should be booleans, True for params you want to cast, and should be False for those you want to skip

Cast the floating-point params to jax.numpy.float32. This method can be used to explicitly convert the model parameters to fp32 precision. This returns a new params tree and does not cast the params in place.

Examples:

from diffusers import FlaxUNet2DConditionModel

model, params = FlaxUNet2DConditionModel.from_pretrained("runwayml/stable-diffusion-v1-5")

params = model.to_f16(params)

params = model.to_fp32(params)

FlaxUNet2DConditionOutput

class diffusers.models.unet_2d_condition_flax.FlaxUNet2DConditionOutput

< source >

( sample: ndarray )

Parameters

sample (jnp.ndarray of shape (batch_size, num_channels, height, width)) — Hidden states conditioned on encoder_hidden_states input. Output of last layer of model.

“Returns a new object replacing the specified fields with new values.

FlaxUNet2DConditionModel

class diffusers.FlaxUNet2DConditionModel

< source >

( sample_size: int = 32 in_channels: int = 4 out_channels: int = 4 down_block_types: typing.Tuple[str] = ('CrossAttnDownBlock2D', 'CrossAttnDownBlock2D', 'CrossAttnDownBlock2D', 'DownBlock2D') up_block_types: typing.Tuple[str] = ('UpBlock2D', 'CrossAttnUpBlock2D', 'CrossAttnUpBlock2D', 'CrossAttnUpBlock2D') only_cross_attention: typing.Union[bool, typing.Tuple[bool]] = False block_out_channels: typing.Tuple[int] = (320, 640, 1280, 1280) layers_per_block: int = 2 attention_head_dim: typing.Union[int, typing.Tuple[int]] = 8 num_attention_heads: typing.Union[int, typing.Tuple[int], NoneType] = None cross_attention_dim: int = 1280 dropout: float = 0.0 use_linear_projection: bool = False dtype: dtype = <class 'jax.numpy.float32'> flip_sin_to_cos: bool = True freq_shift: int = 0 use_memory_efficient_attention: bool = False parent: typing.Union[typing.Type[flax.linen.module.Module], typing.Type[flax.core.scope.Scope], typing.Type[flax.linen.module._Sentinel], NoneType] = <flax.linen.module._Sentinel object at 0x7fa57076a310> name: str = None )

Parameters

sample_size (int, optional) — The size of the input sample.
in_channels (int, optional, defaults to 4) — The number of channels in the input sample.
out_channels (int, optional, defaults to 4) — The number of channels in the output.
down_block_types (Tuple[str], optional, defaults to ("CrossAttnDownBlock2D", "CrossAttnDownBlock2D", "CrossAttnDownBlock2D", "DownBlock2D")) — The tuple of downsample blocks to use. The corresponding class names will be: “FlaxCrossAttnDownBlock2D”, “FlaxCrossAttnDownBlock2D”, “FlaxCrossAttnDownBlock2D”, “FlaxDownBlock2D”
up_block_types (Tuple[str], optional, defaults to ("UpBlock2D", "CrossAttnUpBlock2D", "CrossAttnUpBlock2D", "CrossAttnUpBlock2D",)) — The tuple of upsample blocks to use. The corresponding class names will be: “FlaxUpBlock2D”, “FlaxCrossAttnUpBlock2D”, “FlaxCrossAttnUpBlock2D”, “FlaxCrossAttnUpBlock2D”
block_out_channels (Tuple[int], optional, defaults to (320, 640, 1280, 1280)) — The tuple of output channels for each block.
layers_per_block (int, optional, defaults to 2) — The number of layers per block.
attention_head_dim (int or Tuple[int], optional, defaults to 8) — The dimension of the attention heads.
num_attention_heads (int or Tuple[int], optional) — The number of attention heads.
cross_attention_dim (int, optional, defaults to 768) — The dimension of the cross attention features.
dropout (float, optional, defaults to 0) — Dropout probability for down, up and bottleneck blocks.
flip_sin_to_cos (bool, optional, defaults to True) — Whether to flip the sin to cos in the time embedding.
freq_shift (int, optional, defaults to 0) — The frequency shift to apply to the time embedding.
use_memory_efficient_attention (bool, optional, defaults to False) — enable memory efficient attention https://arxiv.org/abs/2112.05682

FlaxUNet2DConditionModel is a conditional 2D UNet model that takes in a noisy sample, conditional state, and a timestep and returns sample shaped output.

This model inherits from FlaxModelMixin. Check the superclass documentation for the generic methods the library implements for all the models (such as downloading or saving, etc.)

Also, this model is a Flax Linen flax.linen.Modulesubclass. Use it as a regular Flax linen Module and refer to the Flax documentation for all matter related to general usage and behavior.

Finally, this model supports inherent JAX features such as:

FlaxDecoderOutput

class diffusers.models.vae_flax.FlaxDecoderOutput

< source >

( sample: ndarray )

Parameters

sample (jnp.ndarray of shape (batch_size, num_channels, height, width)) — Decoded output sample of the model. Output of the last layer of the model.
dtype (jnp.dtype, optional, defaults to jnp.float32) — Parameters dtype

Output of decoding method.

“Returns a new object replacing the specified fields with new values.

FlaxAutoencoderKLOutput

class diffusers.models.vae_flax.FlaxAutoencoderKLOutput

< source >

( latent_dist: FlaxDiagonalGaussianDistribution )

Parameters

latent_dist (FlaxDiagonalGaussianDistribution) — Encoded outputs of Encoder represented as the mean and logvar of FlaxDiagonalGaussianDistribution.FlaxDiagonalGaussianDistribution allows for sampling latents from the distribution.

Output of AutoencoderKL encoding method.

“Returns a new object replacing the specified fields with new values.

FlaxAutoencoderKL

class diffusers.FlaxAutoencoderKL

< source >

( in_channels: int = 3 out_channels: int = 3 down_block_types: typing.Tuple[str] = ('DownEncoderBlock2D',) up_block_types: typing.Tuple[str] = ('UpDecoderBlock2D',) block_out_channels: typing.Tuple[int] = (64,) layers_per_block: int = 1 act_fn: str = 'silu' latent_channels: int = 4 norm_num_groups: int = 32 sample_size: int = 32 scaling_factor: float = 0.18215 dtype: dtype = <class 'jax.numpy.float32'> parent: typing.Union[typing.Type[flax.linen.module.Module], typing.Type[flax.core.scope.Scope], typing.Type[flax.linen.module._Sentinel], NoneType] = <flax.linen.module._Sentinel object at 0x7fa57076a310> name: str = None )

Parameters

in_channels (int, optional, defaults to 3) — Input channels
out_channels (int, optional, defaults to 3) — Output channels
down_block_types (Tuple[str], optional, defaults to (DownEncoderBlock2D)) — DownEncoder block type
up_block_types (Tuple[str], optional, defaults to (UpDecoderBlock2D)) — UpDecoder block type
block_out_channels (Tuple[str], optional, defaults to (64,)) — Tuple containing the number of output channels for each block
layers_per_block (int, optional, defaults to 2) — Number of Resnet layer for each block
act_fn (str, optional, defaults to silu) — Activation function
latent_channels (int, optional, defaults to 4) — Latent space channels
norm_num_groups (int, optional, defaults to 32) — Norm num group
sample_size (int, optional, defaults to 32) — Sample input size
scaling_factor (float, optional, defaults to 0.18215) — The component-wise standard deviation of the trained latent space computed using the first batch of the training set. This is used to scale the latent space to have unit variance when training the diffusion model. The latents are scaled with the formula z = z scaling_factor before being passed to the diffusion model. When decoding, the latents are scaled back to the original scale with the formula: z = 1 / scaling_factor z. For more details, refer to sections 4.3.2 and D.1 of the High-Resolution Image Synthesis with Latent Diffusion Models paper.
dtype (jnp.dtype, optional, defaults to jnp.float32) — parameters dtype

Flax Implementation of Variational Autoencoder (VAE) model with KL loss from the paper Auto-Encoding Variational Bayes by Diederik P. Kingma and Max Welling.

This model is a Flax Linen flax.linen.Modulesubclass. Use it as a regular Flax linen Module and refer to the Flax documentation for all matter related to general usage and behavior.

Finally, this model supports inherent JAX features such as:

FlaxControlNetOutput

class diffusers.models.controlnet_flax.FlaxControlNetOutput

< source >

( down_block_res_samples: ndarray mid_block_res_sample: ndarray )

“Returns a new object replacing the specified fields with new values.

FlaxControlNetModel

class diffusers.FlaxControlNetModel

< source >

( sample_size: int = 32 in_channels: int = 4 down_block_types: typing.Tuple[str] = ('CrossAttnDownBlock2D', 'CrossAttnDownBlock2D', 'CrossAttnDownBlock2D', 'DownBlock2D') only_cross_attention: typing.Union[bool, typing.Tuple[bool]] = False block_out_channels: typing.Tuple[int] = (320, 640, 1280, 1280) layers_per_block: int = 2 attention_head_dim: typing.Union[int, typing.Tuple[int]] = 8 num_attention_heads: typing.Union[int, typing.Tuple[int], NoneType] = None cross_attention_dim: int = 1280 dropout: float = 0.0 use_linear_projection: bool = False dtype: dtype = <class 'jax.numpy.float32'> flip_sin_to_cos: bool = True freq_shift: int = 0 controlnet_conditioning_channel_order: str = 'rgb' conditioning_embedding_out_channels: typing.Tuple[int] = (16, 32, 96, 256) parent: typing.Union[typing.Type[flax.linen.module.Module], typing.Type[flax.core.scope.Scope], typing.Type[flax.linen.module._Sentinel], NoneType] = <flax.linen.module._Sentinel object at 0x7fa57076a310> name: str = None )

Parameters

sample_size (int, optional) — The size of the input sample.
in_channels (int, optional, defaults to 4) — The number of channels in the input sample.
down_block_types (Tuple[str], optional, defaults to ("CrossAttnDownBlock2D", "CrossAttnDownBlock2D", "CrossAttnDownBlock2D", "DownBlock2D")) — The tuple of downsample blocks to use. The corresponding class names will be: “FlaxCrossAttnDownBlock2D”, “FlaxCrossAttnDownBlock2D”, “FlaxCrossAttnDownBlock2D”, “FlaxDownBlock2D”
block_out_channels (Tuple[int], optional, defaults to (320, 640, 1280, 1280)) — The tuple of output channels for each block.
layers_per_block (int, optional, defaults to 2) — The number of layers per block.
attention_head_dim (int or Tuple[int], optional, defaults to 8) — The dimension of the attention heads.
num_attention_heads (int or Tuple[int], optional) — The number of attention heads.
cross_attention_dim (int, optional, defaults to 768) — The dimension of the cross attention features.
dropout (float, optional, defaults to 0) — Dropout probability for down, up and bottleneck blocks.
flip_sin_to_cos (bool, optional, defaults to True) — Whether to flip the sin to cos in the time embedding.
freq_shift (int, optional, defaults to 0) — The frequency shift to apply to the time embedding.
controlnet_conditioning_channel_order (str, optional, defaults to rgb) — The channel order of conditional image. Will convert it to rgb if it’s bgr
conditioning_embedding_out_channels (tuple, optional, defaults to (16, 32, 96, 256)) — The tuple of output channel for each block in conditioning_embedding layer

Quoting from https://arxiv.org/abs/2302.05543: “Stable Diffusion uses a pre-processing method similar to VQ-GAN [11] to convert the entire dataset of 512 × 512 images into smaller 64 × 64 “latent images” for stabilized training. This requires ControlNets to convert image-based conditions to 64 × 64 feature space to match the convolution size. We use a tiny network E(·) of four convolution layers with 4 × 4 kernels and 2 × 2 strides (activated by ReLU, channels are 16, 32, 64, 128, initialized with Gaussian weights, trained jointly with the full model) to encode image-space conditions … into feature maps …”

This model inherits from FlaxModelMixin. Check the superclass documentation for the generic methods the library implements for all the models (such as downloading or saving, etc.)

Also, this model is a Flax Linen flax.linen.Modulesubclass. Use it as a regular Flax linen Module and refer to the Flax documentation for all matter related to general usage and behavior.

Finally, this model supports inherent JAX features such as: