Models · Hugging Face (original) (raw)

Diffusers contains pretrained models for popular algorithms and modules for creating the next set of diffusion models. The primary function of these models is to denoise an input sample, by modeling the distribution pθ(xt−1∣xt)p_{\theta}(x_{t-1}|x_{t}). The models are built on the base class [‘ModelMixin’] that is a torch.nn.module with basic functionality for saving and loading models both locally and from the HuggingFace hub.

ModelMixin

Base class for all models.

ModelMixin takes care of storing the configuration of the models and handles methods for loading, downloading and saving models.

Deactivates gradient checkpointing for the current model.

Note that in other frameworks this feature can be referred to as “activation checkpointing” or “checkpoint activations”.

disable_xformers_memory_efficient_attention

< source >

( )

Disable memory efficient attention as implemented in xformers.

Activates gradient checkpointing for the current model.

Note that in other frameworks this feature can be referred to as “activation checkpointing” or “checkpoint activations”.

enable_xformers_memory_efficient_attention

< source >

( attention_op: typing.Optional[typing.Callable] = None )

Parameters

Enable memory efficient attention as implemented in xformers.

When this option is enabled, you should observe lower GPU memory usage and a potential speed up at inference time. Speed up at training time is not guaranteed.

Warning: When Memory Efficient Attention and Sliced attention are both enabled, the Memory Efficient Attention is used.

Examples:

import torch from diffusers import UNet2DConditionModel from xformers.ops import MemoryEfficientAttentionFlashAttentionOp

model = UNet2DConditionModel.from_pretrained( ... "stabilityai/stable-diffusion-2-1", subfolder="unet", torch_dtype=torch.float16 ... ) model = model.to("cuda") model.enable_xformers_memory_efficient_attention(attention_op=MemoryEfficientAttentionFlashAttentionOp)

from_pretrained

< source >

( pretrained_model_name_or_path: typing.Union[str, os.PathLike, NoneType] **kwargs )

Parameters

Instantiate a pretrained pytorch model from a pre-trained model configuration.

The model is set in evaluation mode by default using model.eval() (Dropout modules are deactivated). To train the model, you should first set it back in training mode with model.train().

The warning Weights from XXX not initialized from pretrained model means that the weights of XXX do not come pretrained with the rest of the model. It is up to you to train those weights with a downstream fine-tuning task.

The warning Weights from XXX not used in YYY means that the layer XXX is not used by YYY, therefore those weights are discarded.

It is required to be logged in (huggingface-cli login) when you want to use private or gated models.

Activate the special “offline-mode” to use this method in a firewalled environment.

num_parameters

< source >

( only_trainable: bool = False exclude_embeddings: bool = False ) → int

Parameters

The number of parameters.

Get number of (optionally, trainable or non-embeddings) parameters in the module.

save_pretrained

< source >

( save_directory: typing.Union[str, os.PathLike] is_main_process: bool = True save_function: typing.Callable = None safe_serialization: bool = False variant: typing.Optional[str] = None )

Parameters

Save a model and its configuration file to a directory, so that it can be re-loaded using the[from_pretrained()](/docs/diffusers/main/en/api/models#diffusers.ModelMixin.from_pretrained) class method.

UNet2DOutput

class diffusers.models.unet_2d.UNet2DOutput

< source >

( sample: FloatTensor )

Parameters

UNet2DModel

class diffusers.UNet2DModel

< source >

( sample_size: typing.Union[int, typing.Tuple[int, int], NoneType] = None in_channels: int = 3 out_channels: int = 3 center_input_sample: bool = False time_embedding_type: str = 'positional' freq_shift: int = 0 flip_sin_to_cos: bool = True down_block_types: typing.Tuple[str] = ('DownBlock2D', 'AttnDownBlock2D', 'AttnDownBlock2D', 'AttnDownBlock2D') up_block_types: typing.Tuple[str] = ('AttnUpBlock2D', 'AttnUpBlock2D', 'AttnUpBlock2D', 'UpBlock2D') block_out_channels: typing.Tuple[int] = (224, 448, 672, 896) layers_per_block: int = 2 mid_block_scale_factor: float = 1 downsample_padding: int = 1 act_fn: str = 'silu' attention_head_dim: typing.Optional[int] = 8 norm_num_groups: int = 32 norm_eps: float = 1e-05 resnet_time_scale_shift: str = 'default' add_attention: bool = True class_embed_type: typing.Optional[str] = None num_class_embeds: typing.Optional[int] = None )

Parameters

UNet2DModel is a 2D UNet model that takes in a noisy sample and a timestep and returns sample shaped output.

This model inherits from ModelMixin. Check the superclass documentation for the generic methods the library implements for all the model (such as downloading or saving, etc.)

forward

< source >

( sample: FloatTensor timestep: typing.Union[torch.Tensor, float, int] class_labels: typing.Optional[torch.Tensor] = None return_dict: bool = True ) → UNet2DOutput or tuple

Parameters

Returns

UNet2DOutput or tuple

UNet2DOutput if return_dict is True, otherwise a tuple. When returning a tuple, the first element is the sample tensor.

UNet1DOutput

class diffusers.models.unet_1d.UNet1DOutput

< source >

( sample: FloatTensor )

Parameters

UNet1DModel

class diffusers.UNet1DModel

< source >

( sample_size: int = 65536 sample_rate: typing.Optional[int] = None in_channels: int = 2 out_channels: int = 2 extra_in_channels: int = 0 time_embedding_type: str = 'fourier' flip_sin_to_cos: bool = True use_timestep_embedding: bool = False freq_shift: float = 0.0 down_block_types: typing.Tuple[str] = ('DownBlock1DNoSkip', 'DownBlock1D', 'AttnDownBlock1D') up_block_types: typing.Tuple[str] = ('AttnUpBlock1D', 'UpBlock1D', 'UpBlock1DNoSkip') mid_block_type: typing.Tuple[str] = 'UNetMidBlock1D' out_block_type: str = None block_out_channels: typing.Tuple[int] = (32, 32, 64) act_fn: str = None norm_num_groups: int = 8 layers_per_block: int = 1 downsample_each_block: bool = False )

Parameters

UNet1DModel is a 1D UNet model that takes in a noisy sample and a timestep and returns sample shaped output.

This model inherits from ModelMixin. Check the superclass documentation for the generic methods the library implements for all the model (such as downloading or saving, etc.)

forward

< source >

( sample: FloatTensor timestep: typing.Union[torch.Tensor, float, int] return_dict: bool = True ) → UNet1DOutput or tuple

Parameters

Returns

UNet1DOutput or tuple

UNet1DOutput if return_dict is True, otherwise a tuple. When returning a tuple, the first element is the sample tensor.

UNet2DConditionOutput

class diffusers.models.unet_2d_condition.UNet2DConditionOutput

< source >

( sample: FloatTensor )

Parameters

UNet2DConditionModel

class diffusers.UNet2DConditionModel

< source >

( sample_size: typing.Optional[int] = None in_channels: int = 4 out_channels: int = 4 center_input_sample: bool = False flip_sin_to_cos: bool = True freq_shift: int = 0 down_block_types: typing.Tuple[str] = ('CrossAttnDownBlock2D', 'CrossAttnDownBlock2D', 'CrossAttnDownBlock2D', 'DownBlock2D') mid_block_type: typing.Optional[str] = 'UNetMidBlock2DCrossAttn' up_block_types: typing.Tuple[str] = ('UpBlock2D', 'CrossAttnUpBlock2D', 'CrossAttnUpBlock2D', 'CrossAttnUpBlock2D') only_cross_attention: typing.Union[bool, typing.Tuple[bool]] = False block_out_channels: typing.Tuple[int] = (320, 640, 1280, 1280) layers_per_block: typing.Union[int, typing.Tuple[int]] = 2 downsample_padding: int = 1 mid_block_scale_factor: float = 1 act_fn: str = 'silu' norm_num_groups: typing.Optional[int] = 32 norm_eps: float = 1e-05 cross_attention_dim: typing.Union[int, typing.Tuple[int]] = 1280 encoder_hid_dim: typing.Optional[int] = None encoder_hid_dim_type: typing.Optional[str] = None attention_head_dim: typing.Union[int, typing.Tuple[int]] = 8 num_attention_heads: typing.Union[int, typing.Tuple[int], NoneType] = None dual_cross_attention: bool = False use_linear_projection: bool = False class_embed_type: typing.Optional[str] = None addition_embed_type: typing.Optional[str] = None num_class_embeds: typing.Optional[int] = None upcast_attention: bool = False resnet_time_scale_shift: str = 'default' resnet_skip_time_act: bool = False resnet_out_scale_factor: int = 1.0 time_embedding_type: str = 'positional' time_embedding_dim: typing.Optional[int] = None time_embedding_act_fn: typing.Optional[str] = None timestep_post_act: typing.Optional[str] = None time_cond_proj_dim: typing.Optional[int] = None conv_in_kernel: int = 3 conv_out_kernel: int = 3 projection_class_embeddings_input_dim: typing.Optional[int] = None class_embeddings_concat: bool = False mid_block_only_cross_attention: typing.Optional[bool] = None cross_attention_norm: typing.Optional[str] = None addition_embed_type_num_heads = 64 )

Parameters

UNet2DConditionModel is a conditional 2D UNet model that takes in a noisy sample, conditional state, and a timestep and returns sample shaped output.

This model inherits from ModelMixin. Check the superclass documentation for the generic methods the library implements for all the models (such as downloading or saving, etc.)

forward

< source >

( sample: FloatTensor timestep: typing.Union[torch.Tensor, float, int] encoder_hidden_states: Tensor class_labels: typing.Optional[torch.Tensor] = None timestep_cond: typing.Optional[torch.Tensor] = None attention_mask: typing.Optional[torch.Tensor] = None cross_attention_kwargs: typing.Union[typing.Dict[str, typing.Any], NoneType] = None added_cond_kwargs: typing.Union[typing.Dict[str, torch.Tensor], NoneType] = None down_block_additional_residuals: typing.Optional[typing.Tuple[torch.Tensor]] = None mid_block_additional_residual: typing.Optional[torch.Tensor] = None encoder_attention_mask: typing.Optional[torch.Tensor] = None return_dict: bool = True ) → UNet2DConditionOutput or tuple

Parameters

UNet2DConditionOutput if return_dict is True, otherwise a tuple. When returning a tuple, the first element is the sample tensor.

set_attention_slice

< source >

( slice_size )

Parameters

Enable sliced attention computation.

When this option is enabled, the attention module will split the input tensor in slices, to compute attention in several steps. This is useful to save some memory in exchange for a small speed decrease.

set_attn_processor

< source >

( processor: typing.Union[diffusers.models.attention_processor.AttnProcessor, diffusers.models.attention_processor.AttnProcessor2_0, diffusers.models.attention_processor.XFormersAttnProcessor, diffusers.models.attention_processor.SlicedAttnProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor, diffusers.models.attention_processor.SlicedAttnAddedKVProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor2_0, diffusers.models.attention_processor.XFormersAttnAddedKVProcessor, diffusers.models.attention_processor.LoRAAttnProcessor, diffusers.models.attention_processor.LoRAXFormersAttnProcessor, diffusers.models.attention_processor.LoRAAttnProcessor2_0, diffusers.models.attention_processor.LoRAAttnAddedKVProcessor, diffusers.models.attention_processor.CustomDiffusionAttnProcessor, diffusers.models.attention_processor.CustomDiffusionXFormersAttnProcessor, typing.Dict[str, typing.Union[diffusers.models.attention_processor.AttnProcessor, diffusers.models.attention_processor.AttnProcessor2_0, diffusers.models.attention_processor.XFormersAttnProcessor, diffusers.models.attention_processor.SlicedAttnProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor, diffusers.models.attention_processor.SlicedAttnAddedKVProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor2_0, diffusers.models.attention_processor.XFormersAttnAddedKVProcessor, diffusers.models.attention_processor.LoRAAttnProcessor, diffusers.models.attention_processor.LoRAXFormersAttnProcessor, diffusers.models.attention_processor.LoRAAttnProcessor2_0, diffusers.models.attention_processor.LoRAAttnAddedKVProcessor, diffusers.models.attention_processor.CustomDiffusionAttnProcessor, diffusers.models.attention_processor.CustomDiffusionXFormersAttnProcessor]]] )

Parameters

Disables custom attention processors and sets the default attention implementation.

UNet3DConditionOutput

class diffusers.models.unet_3d_condition.UNet3DConditionOutput

< source >

( sample: FloatTensor )

Parameters

UNet3DConditionModel

class diffusers.UNet3DConditionModel

< source >

( sample_size: typing.Optional[int] = None in_channels: int = 4 out_channels: int = 4 down_block_types: typing.Tuple[str] = ('CrossAttnDownBlock3D', 'CrossAttnDownBlock3D', 'CrossAttnDownBlock3D', 'DownBlock3D') up_block_types: typing.Tuple[str] = ('UpBlock3D', 'CrossAttnUpBlock3D', 'CrossAttnUpBlock3D', 'CrossAttnUpBlock3D') block_out_channels: typing.Tuple[int] = (320, 640, 1280, 1280) layers_per_block: int = 2 downsample_padding: int = 1 mid_block_scale_factor: float = 1 act_fn: str = 'silu' norm_num_groups: typing.Optional[int] = 32 norm_eps: float = 1e-05 cross_attention_dim: int = 1024 attention_head_dim: typing.Union[int, typing.Tuple[int]] = 64 num_attention_heads: typing.Union[int, typing.Tuple[int], NoneType] = None )

Parameters

UNet3DConditionModel is a conditional 2D UNet model that takes in a noisy sample, conditional state, and a timestep and returns sample shaped output.

This model inherits from ModelMixin. Check the superclass documentation for the generic methods the library implements for all the models (such as downloading or saving, etc.)

forward

< source >

( sample: FloatTensor timestep: typing.Union[torch.Tensor, float, int] encoder_hidden_states: Tensor class_labels: typing.Optional[torch.Tensor] = None timestep_cond: typing.Optional[torch.Tensor] = None attention_mask: typing.Optional[torch.Tensor] = None cross_attention_kwargs: typing.Union[typing.Dict[str, typing.Any], NoneType] = None down_block_additional_residuals: typing.Optional[typing.Tuple[torch.Tensor]] = None mid_block_additional_residual: typing.Optional[torch.Tensor] = None return_dict: bool = True ) → ~models.unet_2d_condition.UNet3DConditionOutput or tuple

Parameters

Returns

~models.unet_2d_condition.UNet3DConditionOutput or tuple

~models.unet_2d_condition.UNet3DConditionOutput if return_dict is True, otherwise a tuple. When returning a tuple, the first element is the sample tensor.

set_attention_slice

< source >

( slice_size )

Parameters

Enable sliced attention computation.

When this option is enabled, the attention module will split the input tensor in slices, to compute attention in several steps. This is useful to save some memory in exchange for a small speed decrease.

set_attn_processor

< source >

( processor: typing.Union[diffusers.models.attention_processor.AttnProcessor, diffusers.models.attention_processor.AttnProcessor2_0, diffusers.models.attention_processor.XFormersAttnProcessor, diffusers.models.attention_processor.SlicedAttnProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor, diffusers.models.attention_processor.SlicedAttnAddedKVProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor2_0, diffusers.models.attention_processor.XFormersAttnAddedKVProcessor, diffusers.models.attention_processor.LoRAAttnProcessor, diffusers.models.attention_processor.LoRAXFormersAttnProcessor, diffusers.models.attention_processor.LoRAAttnProcessor2_0, diffusers.models.attention_processor.LoRAAttnAddedKVProcessor, diffusers.models.attention_processor.CustomDiffusionAttnProcessor, diffusers.models.attention_processor.CustomDiffusionXFormersAttnProcessor, typing.Dict[str, typing.Union[diffusers.models.attention_processor.AttnProcessor, diffusers.models.attention_processor.AttnProcessor2_0, diffusers.models.attention_processor.XFormersAttnProcessor, diffusers.models.attention_processor.SlicedAttnProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor, diffusers.models.attention_processor.SlicedAttnAddedKVProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor2_0, diffusers.models.attention_processor.XFormersAttnAddedKVProcessor, diffusers.models.attention_processor.LoRAAttnProcessor, diffusers.models.attention_processor.LoRAXFormersAttnProcessor, diffusers.models.attention_processor.LoRAAttnProcessor2_0, diffusers.models.attention_processor.LoRAAttnAddedKVProcessor, diffusers.models.attention_processor.CustomDiffusionAttnProcessor, diffusers.models.attention_processor.CustomDiffusionXFormersAttnProcessor]]] )

Parameters

Disables custom attention processors and sets the default attention implementation.

DecoderOutput

class diffusers.models.vae.DecoderOutput

< source >

( sample: FloatTensor )

Parameters

Output of decoding method.

VQEncoderOutput

class diffusers.models.vq_model.VQEncoderOutput

< source >

( latents: FloatTensor )

Parameters

Output of VQModel encoding method.

VQModel

class diffusers.VQModel

< source >

( in_channels: int = 3 out_channels: int = 3 down_block_types: typing.Tuple[str] = ('DownEncoderBlock2D',) up_block_types: typing.Tuple[str] = ('UpDecoderBlock2D',) block_out_channels: typing.Tuple[int] = (64,) layers_per_block: int = 1 act_fn: str = 'silu' latent_channels: int = 3 sample_size: int = 32 num_vq_embeddings: int = 256 norm_num_groups: int = 32 vq_embed_dim: typing.Optional[int] = None scaling_factor: float = 0.18215 norm_type: str = 'group' )

Parameters

VQ-VAE model from the paper Neural Discrete Representation Learning by Aaron van den Oord, Oriol Vinyals and Koray Kavukcuoglu.

This model inherits from ModelMixin. Check the superclass documentation for the generic methods the library implements for all the model (such as downloading or saving, etc.)

forward

< source >

( sample: FloatTensor return_dict: bool = True )

Parameters

AutoencoderKLOutput

class diffusers.models.autoencoder_kl.AutoencoderKLOutput

< source >

( latent_dist: DiagonalGaussianDistribution )

Parameters

Output of AutoencoderKL encoding method.

AutoencoderKL

class diffusers.AutoencoderKL

< source >

( in_channels: int = 3 out_channels: int = 3 down_block_types: typing.Tuple[str] = ('DownEncoderBlock2D',) up_block_types: typing.Tuple[str] = ('UpDecoderBlock2D',) block_out_channels: typing.Tuple[int] = (64,) layers_per_block: int = 1 act_fn: str = 'silu' latent_channels: int = 4 norm_num_groups: int = 32 sample_size: int = 32 scaling_factor: float = 0.18215 )

Parameters

Variational Autoencoder (VAE) model with KL loss from the paper Auto-Encoding Variational Bayes by Diederik P. Kingma and Max Welling.

This model inherits from ModelMixin. Check the superclass documentation for the generic methods the library implements for all the model (such as downloading or saving, etc.)

Disable sliced VAE decoding. If enable_slicing was previously invoked, this method will go back to computing decoding in one step.

Disable tiled VAE decoding. If enable_vae_tiling was previously invoked, this method will go back to computing decoding in one step.

Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.

enable_tiling

< source >

( use_tiling: bool = True )

Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to compute decoding and encoding in several steps. This is useful to save a large amount of memory and to allow the processing of larger images.

forward

< source >

( sample: FloatTensor sample_posterior: bool = False return_dict: bool = True generator: typing.Optional[torch._C.Generator] = None )

Parameters

set_attn_processor

< source >

( processor: typing.Union[diffusers.models.attention_processor.AttnProcessor, diffusers.models.attention_processor.AttnProcessor2_0, diffusers.models.attention_processor.XFormersAttnProcessor, diffusers.models.attention_processor.SlicedAttnProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor, diffusers.models.attention_processor.SlicedAttnAddedKVProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor2_0, diffusers.models.attention_processor.XFormersAttnAddedKVProcessor, diffusers.models.attention_processor.LoRAAttnProcessor, diffusers.models.attention_processor.LoRAXFormersAttnProcessor, diffusers.models.attention_processor.LoRAAttnProcessor2_0, diffusers.models.attention_processor.LoRAAttnAddedKVProcessor, diffusers.models.attention_processor.CustomDiffusionAttnProcessor, diffusers.models.attention_processor.CustomDiffusionXFormersAttnProcessor, typing.Dict[str, typing.Union[diffusers.models.attention_processor.AttnProcessor, diffusers.models.attention_processor.AttnProcessor2_0, diffusers.models.attention_processor.XFormersAttnProcessor, diffusers.models.attention_processor.SlicedAttnProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor, diffusers.models.attention_processor.SlicedAttnAddedKVProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor2_0, diffusers.models.attention_processor.XFormersAttnAddedKVProcessor, diffusers.models.attention_processor.LoRAAttnProcessor, diffusers.models.attention_processor.LoRAXFormersAttnProcessor, diffusers.models.attention_processor.LoRAAttnProcessor2_0, diffusers.models.attention_processor.LoRAAttnAddedKVProcessor, diffusers.models.attention_processor.CustomDiffusionAttnProcessor, diffusers.models.attention_processor.CustomDiffusionXFormersAttnProcessor]]] )

Parameters

Disables custom attention processors and sets the default attention implementation.

tiled_decode

< source >

( z: FloatTensor return_dict: bool = True )

Parameters

Decode a batch of images using a tiled decoder.

tiled_encode

< source >

( x: FloatTensor return_dict: bool = True )

Parameters

Encode a batch of images using a tiled encoder.

Transformer2DModel

class diffusers.Transformer2DModel

< source >

( num_attention_heads: int = 16 attention_head_dim: int = 88 in_channels: typing.Optional[int] = None out_channels: typing.Optional[int] = None num_layers: int = 1 dropout: float = 0.0 norm_num_groups: int = 32 cross_attention_dim: typing.Optional[int] = None attention_bias: bool = False sample_size: typing.Optional[int] = None num_vector_embeds: typing.Optional[int] = None patch_size: typing.Optional[int] = None activation_fn: str = 'geglu' num_embeds_ada_norm: typing.Optional[int] = None use_linear_projection: bool = False only_cross_attention: bool = False upcast_attention: bool = False norm_type: str = 'layer_norm' norm_elementwise_affine: bool = True )

Parameters

Transformer model for image-like data. Takes either discrete (classes of vector embeddings) or continuous (actual embeddings) inputs.

When input is continuous: First, project the input (aka embedding) and reshape to b, t, d. Then apply standard transformer action. Finally, reshape to image.

When input is discrete: First, input (classes of latent pixels) is converted to embeddings and has positional embeddings applied, see ImagePositionalEmbeddings. Then apply standard transformer action. Finally, predict classes of unnoised image.

Note that it is assumed one of the input classes is the masked latent pixel. The predicted classes of the unnoised image do not contain a prediction for the masked pixel as the unnoised image cannot be masked.

forward

< source >

( hidden_states: Tensor encoder_hidden_states: typing.Optional[torch.Tensor] = None timestep: typing.Optional[torch.LongTensor] = None class_labels: typing.Optional[torch.LongTensor] = None cross_attention_kwargs: typing.Dict[str, typing.Any] = None attention_mask: typing.Optional[torch.Tensor] = None encoder_attention_mask: typing.Optional[torch.Tensor] = None return_dict: bool = True ) → Transformer2DModelOutput or tuple

Parameters

Transformer2DModelOutput if return_dict is True, otherwise a tuple. When returning a tuple, the first element is the sample tensor.

Transformer2DModelOutput

class diffusers.models.transformer_2d.Transformer2DModelOutput

< source >

( sample: FloatTensor )

Parameters

TransformerTemporalModel

class diffusers.models.transformer_temporal.TransformerTemporalModel

< source >

( num_attention_heads: int = 16 attention_head_dim: int = 88 in_channels: typing.Optional[int] = None out_channels: typing.Optional[int] = None num_layers: int = 1 dropout: float = 0.0 norm_num_groups: int = 32 cross_attention_dim: typing.Optional[int] = None attention_bias: bool = False sample_size: typing.Optional[int] = None activation_fn: str = 'geglu' norm_elementwise_affine: bool = True double_self_attention: bool = True )

Parameters

Transformer model for video-like data.

forward

< source >

( hidden_states encoder_hidden_states = None timestep = None class_labels = None num_frames = 1 cross_attention_kwargs = None return_dict: bool = True ) → ~models.transformer_2d.TransformerTemporalModelOutput or tuple

Parameters

Returns

~models.transformer_2d.TransformerTemporalModelOutput or tuple

~models.transformer_2d.TransformerTemporalModelOutput if return_dict is True, otherwise a tuple. When returning a tuple, the first element is the sample tensor.

Transformer2DModelOutput

class diffusers.models.transformer_temporal.TransformerTemporalModelOutput

< source >

( sample: FloatTensor )

Parameters

PriorTransformer

class diffusers.PriorTransformer

< source >

( num_attention_heads: int = 32 attention_head_dim: int = 64 num_layers: int = 20 embedding_dim: int = 768 num_embeddings = 77 additional_embeddings = 4 dropout: float = 0.0 )

Parameters

The prior transformer from unCLIP is used to predict CLIP image embeddings from CLIP text embeddings. Note that the transformer predicts the image embeddings through a denoising diffusion process.

This model inherits from ModelMixin. Check the superclass documentation for the generic methods the library implements for all the models (such as downloading or saving, etc.)

For more details, see the original paper: https://arxiv.org/abs/2204.06125

forward

< source >

( hidden_states timestep: typing.Union[torch.Tensor, float, int] proj_embedding: FloatTensor encoder_hidden_states: FloatTensor attention_mask: typing.Optional[torch.BoolTensor] = None return_dict: bool = True ) → PriorTransformerOutput or tuple

Parameters

PriorTransformerOutput if return_dict is True, otherwise a tuple. When returning a tuple, the first element is the sample tensor.

set_attn_processor

< source >

( processor: typing.Union[diffusers.models.attention_processor.AttnProcessor, diffusers.models.attention_processor.AttnProcessor2_0, diffusers.models.attention_processor.XFormersAttnProcessor, diffusers.models.attention_processor.SlicedAttnProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor, diffusers.models.attention_processor.SlicedAttnAddedKVProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor2_0, diffusers.models.attention_processor.XFormersAttnAddedKVProcessor, diffusers.models.attention_processor.LoRAAttnProcessor, diffusers.models.attention_processor.LoRAXFormersAttnProcessor, diffusers.models.attention_processor.LoRAAttnProcessor2_0, diffusers.models.attention_processor.LoRAAttnAddedKVProcessor, diffusers.models.attention_processor.CustomDiffusionAttnProcessor, diffusers.models.attention_processor.CustomDiffusionXFormersAttnProcessor, typing.Dict[str, typing.Union[diffusers.models.attention_processor.AttnProcessor, diffusers.models.attention_processor.AttnProcessor2_0, diffusers.models.attention_processor.XFormersAttnProcessor, diffusers.models.attention_processor.SlicedAttnProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor, diffusers.models.attention_processor.SlicedAttnAddedKVProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor2_0, diffusers.models.attention_processor.XFormersAttnAddedKVProcessor, diffusers.models.attention_processor.LoRAAttnProcessor, diffusers.models.attention_processor.LoRAXFormersAttnProcessor, diffusers.models.attention_processor.LoRAAttnProcessor2_0, diffusers.models.attention_processor.LoRAAttnAddedKVProcessor, diffusers.models.attention_processor.CustomDiffusionAttnProcessor, diffusers.models.attention_processor.CustomDiffusionXFormersAttnProcessor]]] )

Parameters

Disables custom attention processors and sets the default attention implementation.

PriorTransformerOutput

class diffusers.models.prior_transformer.PriorTransformerOutput

< source >

( predicted_image_embedding: FloatTensor )

Parameters

ControlNetOutput

class diffusers.models.controlnet.ControlNetOutput

< source >

( down_block_res_samples: typing.Tuple[torch.Tensor] mid_block_res_sample: Tensor )

ControlNetModel

class diffusers.ControlNetModel

< source >

( in_channels: int = 4 conditioning_channels: int = 3 flip_sin_to_cos: bool = True freq_shift: int = 0 down_block_types: typing.Tuple[str] = ('CrossAttnDownBlock2D', 'CrossAttnDownBlock2D', 'CrossAttnDownBlock2D', 'DownBlock2D') only_cross_attention: typing.Union[bool, typing.Tuple[bool]] = False block_out_channels: typing.Tuple[int] = (320, 640, 1280, 1280) layers_per_block: int = 2 downsample_padding: int = 1 mid_block_scale_factor: float = 1 act_fn: str = 'silu' norm_num_groups: typing.Optional[int] = 32 norm_eps: float = 1e-05 cross_attention_dim: int = 1280 attention_head_dim: typing.Union[int, typing.Tuple[int]] = 8 num_attention_heads: typing.Union[int, typing.Tuple[int], NoneType] = None use_linear_projection: bool = False class_embed_type: typing.Optional[str] = None num_class_embeds: typing.Optional[int] = None upcast_attention: bool = False resnet_time_scale_shift: str = 'default' projection_class_embeddings_input_dim: typing.Optional[int] = None controlnet_conditioning_channel_order: str = 'rgb' conditioning_embedding_out_channels: typing.Optional[typing.Tuple[int]] = (16, 32, 96, 256) global_pool_conditions: bool = False )

from_unet

< source >

( unet: UNet2DConditionModel controlnet_conditioning_channel_order: str = 'rgb' conditioning_embedding_out_channels: typing.Optional[typing.Tuple[int]] = (16, 32, 96, 256) load_weights_from_unet: bool = True )

Parameters

Instantiate Controlnet class from UNet2DConditionModel.

set_attention_slice

< source >

( slice_size )

Parameters

Enable sliced attention computation.

When this option is enabled, the attention module will split the input tensor in slices, to compute attention in several steps. This is useful to save some memory in exchange for a small speed decrease.

set_attn_processor

< source >

( processor: typing.Union[diffusers.models.attention_processor.AttnProcessor, diffusers.models.attention_processor.AttnProcessor2_0, diffusers.models.attention_processor.XFormersAttnProcessor, diffusers.models.attention_processor.SlicedAttnProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor, diffusers.models.attention_processor.SlicedAttnAddedKVProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor2_0, diffusers.models.attention_processor.XFormersAttnAddedKVProcessor, diffusers.models.attention_processor.LoRAAttnProcessor, diffusers.models.attention_processor.LoRAXFormersAttnProcessor, diffusers.models.attention_processor.LoRAAttnProcessor2_0, diffusers.models.attention_processor.LoRAAttnAddedKVProcessor, diffusers.models.attention_processor.CustomDiffusionAttnProcessor, diffusers.models.attention_processor.CustomDiffusionXFormersAttnProcessor, typing.Dict[str, typing.Union[diffusers.models.attention_processor.AttnProcessor, diffusers.models.attention_processor.AttnProcessor2_0, diffusers.models.attention_processor.XFormersAttnProcessor, diffusers.models.attention_processor.SlicedAttnProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor, diffusers.models.attention_processor.SlicedAttnAddedKVProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor2_0, diffusers.models.attention_processor.XFormersAttnAddedKVProcessor, diffusers.models.attention_processor.LoRAAttnProcessor, diffusers.models.attention_processor.LoRAXFormersAttnProcessor, diffusers.models.attention_processor.LoRAAttnProcessor2_0, diffusers.models.attention_processor.LoRAAttnAddedKVProcessor, diffusers.models.attention_processor.CustomDiffusionAttnProcessor, diffusers.models.attention_processor.CustomDiffusionXFormersAttnProcessor]]] )

Parameters

Disables custom attention processors and sets the default attention implementation.

FlaxModelMixin

Base class for all flax models.

FlaxModelMixin takes care of storing the configuration of the models and handles methods for loading, downloading and saving models.

from_pretrained

< source >

( pretrained_model_name_or_path: typing.Union[str, os.PathLike] dtype: dtype = <class 'jax.numpy.float32'> *model_args **kwargs )

Parameters

Instantiate a pretrained flax model from a pre-trained model configuration.

The warning Weights from XXX not initialized from pretrained model means that the weights of XXX do not come pretrained with the rest of the model. It is up to you to train those weights with a downstream fine-tuning task.

The warning Weights from XXX not used in YYY means that the layer XXX is not used by YYY, therefore those weights are discarded.

Examples:

from diffusers import FlaxUNet2DConditionModel

model, params = FlaxUNet2DConditionModel.from_pretrained("runwayml/stable-diffusion-v1-5")

model, params = FlaxUNet2DConditionModel.from_pretrained("./test/saved_model/")

save_pretrained

< source >

( save_directory: typing.Union[str, os.PathLike] params: typing.Union[typing.Dict, flax.core.frozen_dict.FrozenDict] is_main_process: bool = True )

Parameters

Save a model and its configuration file to a directory, so that it can be re-loaded using the[from_pretrained()](/docs/diffusers/main/en/api/models#diffusers.FlaxModelMixin.from_pretrained) class method

to_bf16

< source >

( params: typing.Union[typing.Dict, flax.core.frozen_dict.FrozenDict] mask: typing.Any = None )

Parameters

Cast the floating-point params to jax.numpy.bfloat16. This returns a new params tree and does not cast the params in place.

This method can be used on TPU to explicitly convert the model parameters to bfloat16 precision to do full half-precision training or to save weights in bfloat16 for inference in order to save memory and improve speed.

Examples:

from diffusers import FlaxUNet2DConditionModel

model, params = FlaxUNet2DConditionModel.from_pretrained("runwayml/stable-diffusion-v1-5")

params = model.to_bf16(params)

from flax import traverse_util

model, params = FlaxUNet2DConditionModel.from_pretrained("runwayml/stable-diffusion-v1-5") flat_params = traverse_util.flatten_dict(params) mask = { ... path: (path[-2] != ("LayerNorm", "bias") and path[-2:] != ("LayerNorm", "scale")) ... for path in flat_params ... } mask = traverse_util.unflatten_dict(mask) params = model.to_bf16(params, mask)

to_fp16

< source >

( params: typing.Union[typing.Dict, flax.core.frozen_dict.FrozenDict] mask: typing.Any = None )

Parameters

Cast the floating-point params to jax.numpy.float16. This returns a new params tree and does not cast theparams in place.

This method can be used on GPU to explicitly convert the model parameters to float16 precision to do full half-precision training or to save weights in float16 for inference in order to save memory and improve speed.

Examples:

from diffusers import FlaxUNet2DConditionModel

model, params = FlaxUNet2DConditionModel.from_pretrained("runwayml/stable-diffusion-v1-5")

params = model.to_fp16(params)

from flax import traverse_util

model, params = FlaxUNet2DConditionModel.from_pretrained("runwayml/stable-diffusion-v1-5") flat_params = traverse_util.flatten_dict(params) mask = { ... path: (path[-2] != ("LayerNorm", "bias") and path[-2:] != ("LayerNorm", "scale")) ... for path in flat_params ... } mask = traverse_util.unflatten_dict(mask) params = model.to_fp16(params, mask)

to_fp32

< source >

( params: typing.Union[typing.Dict, flax.core.frozen_dict.FrozenDict] mask: typing.Any = None )

Parameters

Cast the floating-point params to jax.numpy.float32. This method can be used to explicitly convert the model parameters to fp32 precision. This returns a new params tree and does not cast the params in place.

Examples:

from diffusers import FlaxUNet2DConditionModel

model, params = FlaxUNet2DConditionModel.from_pretrained("runwayml/stable-diffusion-v1-5")

params = model.to_f16(params)

params = model.to_fp32(params)

FlaxUNet2DConditionOutput

class diffusers.models.unet_2d_condition_flax.FlaxUNet2DConditionOutput

< source >

( sample: ndarray )

Parameters

“Returns a new object replacing the specified fields with new values.

FlaxUNet2DConditionModel

class diffusers.FlaxUNet2DConditionModel

< source >

( sample_size: int = 32 in_channels: int = 4 out_channels: int = 4 down_block_types: typing.Tuple[str] = ('CrossAttnDownBlock2D', 'CrossAttnDownBlock2D', 'CrossAttnDownBlock2D', 'DownBlock2D') up_block_types: typing.Tuple[str] = ('UpBlock2D', 'CrossAttnUpBlock2D', 'CrossAttnUpBlock2D', 'CrossAttnUpBlock2D') only_cross_attention: typing.Union[bool, typing.Tuple[bool]] = False block_out_channels: typing.Tuple[int] = (320, 640, 1280, 1280) layers_per_block: int = 2 attention_head_dim: typing.Union[int, typing.Tuple[int]] = 8 num_attention_heads: typing.Union[int, typing.Tuple[int], NoneType] = None cross_attention_dim: int = 1280 dropout: float = 0.0 use_linear_projection: bool = False dtype: dtype = <class 'jax.numpy.float32'> flip_sin_to_cos: bool = True freq_shift: int = 0 use_memory_efficient_attention: bool = False parent: typing.Union[typing.Type[flax.linen.module.Module], typing.Type[flax.core.scope.Scope], typing.Type[flax.linen.module._Sentinel], NoneType] = <flax.linen.module._Sentinel object at 0x7fa57076a310> name: str = None )

Parameters

FlaxUNet2DConditionModel is a conditional 2D UNet model that takes in a noisy sample, conditional state, and a timestep and returns sample shaped output.

This model inherits from FlaxModelMixin. Check the superclass documentation for the generic methods the library implements for all the models (such as downloading or saving, etc.)

Also, this model is a Flax Linen flax.linen.Modulesubclass. Use it as a regular Flax linen Module and refer to the Flax documentation for all matter related to general usage and behavior.

Finally, this model supports inherent JAX features such as:

FlaxDecoderOutput

class diffusers.models.vae_flax.FlaxDecoderOutput

< source >

( sample: ndarray )

Parameters

Output of decoding method.

“Returns a new object replacing the specified fields with new values.

FlaxAutoencoderKLOutput

class diffusers.models.vae_flax.FlaxAutoencoderKLOutput

< source >

( latent_dist: FlaxDiagonalGaussianDistribution )

Parameters

Output of AutoencoderKL encoding method.

“Returns a new object replacing the specified fields with new values.

FlaxAutoencoderKL

class diffusers.FlaxAutoencoderKL

< source >

( in_channels: int = 3 out_channels: int = 3 down_block_types: typing.Tuple[str] = ('DownEncoderBlock2D',) up_block_types: typing.Tuple[str] = ('UpDecoderBlock2D',) block_out_channels: typing.Tuple[int] = (64,) layers_per_block: int = 1 act_fn: str = 'silu' latent_channels: int = 4 norm_num_groups: int = 32 sample_size: int = 32 scaling_factor: float = 0.18215 dtype: dtype = <class 'jax.numpy.float32'> parent: typing.Union[typing.Type[flax.linen.module.Module], typing.Type[flax.core.scope.Scope], typing.Type[flax.linen.module._Sentinel], NoneType] = <flax.linen.module._Sentinel object at 0x7fa57076a310> name: str = None )

Parameters

Flax Implementation of Variational Autoencoder (VAE) model with KL loss from the paper Auto-Encoding Variational Bayes by Diederik P. Kingma and Max Welling.

This model is a Flax Linen flax.linen.Modulesubclass. Use it as a regular Flax linen Module and refer to the Flax documentation for all matter related to general usage and behavior.

Finally, this model supports inherent JAX features such as:

FlaxControlNetOutput

class diffusers.models.controlnet_flax.FlaxControlNetOutput

< source >

( down_block_res_samples: ndarray mid_block_res_sample: ndarray )

“Returns a new object replacing the specified fields with new values.

FlaxControlNetModel

class diffusers.FlaxControlNetModel

< source >

( sample_size: int = 32 in_channels: int = 4 down_block_types: typing.Tuple[str] = ('CrossAttnDownBlock2D', 'CrossAttnDownBlock2D', 'CrossAttnDownBlock2D', 'DownBlock2D') only_cross_attention: typing.Union[bool, typing.Tuple[bool]] = False block_out_channels: typing.Tuple[int] = (320, 640, 1280, 1280) layers_per_block: int = 2 attention_head_dim: typing.Union[int, typing.Tuple[int]] = 8 num_attention_heads: typing.Union[int, typing.Tuple[int], NoneType] = None cross_attention_dim: int = 1280 dropout: float = 0.0 use_linear_projection: bool = False dtype: dtype = <class 'jax.numpy.float32'> flip_sin_to_cos: bool = True freq_shift: int = 0 controlnet_conditioning_channel_order: str = 'rgb' conditioning_embedding_out_channels: typing.Tuple[int] = (16, 32, 96, 256) parent: typing.Union[typing.Type[flax.linen.module.Module], typing.Type[flax.core.scope.Scope], typing.Type[flax.linen.module._Sentinel], NoneType] = <flax.linen.module._Sentinel object at 0x7fa57076a310> name: str = None )

Parameters

Quoting from https://arxiv.org/abs/2302.05543: “Stable Diffusion uses a pre-processing method similar to VQ-GAN [11] to convert the entire dataset of 512 × 512 images into smaller 64 × 64 “latent images” for stabilized training. This requires ControlNets to convert image-based conditions to 64 × 64 feature space to match the convolution size. We use a tiny network E(·) of four convolution layers with 4 × 4 kernels and 2 × 2 strides (activated by ReLU, channels are 16, 32, 64, 128, initialized with Gaussian weights, trained jointly with the full model) to encode image-space conditions … into feature maps …”

This model inherits from FlaxModelMixin. Check the superclass documentation for the generic methods the library implements for all the models (such as downloading or saving, etc.)

Also, this model is a Flax Linen flax.linen.Modulesubclass. Use it as a regular Flax linen Module and refer to the Flax documentation for all matter related to general usage and behavior.

Finally, this model supports inherent JAX features such as: